I'm working on an application that's supposed to read and process flat files. These files don't always use a consistent encoding for every field in a record, so it was decided that we should read/write bytes and avoid the necessary decoding/encoding of turning them into Strings.
However, a lot of these fields are simple integers, and I need to validate them (test that they are really integers and in a certain range). I need a function that receives a byte[] and turns that into an int. I'm assuming all the digits are plain ASCII.
I know I could do this by first turning the byte[] into a CharBuffer, decoding to ISO-8859-1 or UTF-8, and then calling Integer.parseInt() but that seems like a lot of overhead and performance is important.
So, basically what I need is a Java equivalent of atoi(). I would prefer an API function (including 3rd party APIs). Also, the function should report errors in some way.
As a side note, I'm having the same issue with fields representing date/time (these are more rare though). It would be great if someone could mention some fast C-like library for Java.
while i can not give you a ready java solution i want to point you onto interesting (c) code for you to read: the author of qmail has a small function to quickly parse unsigned longs from a byte array scan_ulong, you can find lots of incarnations of that function all over the web:
unsigned int scan_ulong(register const char *s,register unsigned long *u)
{
register unsigned int pos = 0;
register unsigned long result = 0;
register unsigned long c;
while ((c = (unsigned long) (unsigned char) (s[pos] - '0')) < 10) {
result = result * 10 + c;
++pos;
}
*u = result;
return pos;
}
(taken from here: https://github.com/jordansissel/djbdnsplus/blob/master/scan_ulong.c )
that code should translate pretty smoothly to java.
The atoi function from the C library is an incredibly dull piece of code: you can translate it to Java in five minutes or less. If you must avoid writing your own, you could use the String(byte\[\] buf, int offset,int length) constructor to make Java string bypassing CharBuffer, and parse it to complete the conversion.
Related
I've run into mind twisting bafflement after putting my hands into an old legacy project. The project consists of a Java application and a c++ application, which communicate with sockets. Both applications are designed to work on cross platform environments, so I'd be happy to keep the code as universal as possible.
I ended up rewriting parts of the communication logic, since the previous implementation had some issues with foreign characters. Now I ran into a problem with endianness, which I hope someone could spell out for me.
The Java software writes messages to socket with OutputStreamWriter, using UTF-16LE encoding, as follows.
OutputStream out = _socket.getOutputStream();
outputWriter = new OutputStreamWriter(new BufferedOutputStream(out), "UTF-16LE");
// ... create msg
outputWriter.write(msg, 0, msg.length());
outputWriter.flush();
The c++ program receives the message character by character as follows:
char buf[1];
std::queue<char> q;
std::u16string recUtf16Msg;
do {
int iResult = recv(socket, buf, 1, 0);
if (iResult <= 0)
break; // Error or EOS
for (int i = 0; i < iResult; i++) {
q.push(buf[i]);
}
while (q.size() >= 2) {
char firstByte = q.front();
q.pop();
char secondByte = q.front();
q.pop();
char16_t utf16char = (firstByte << (sizeof(char) * CHAR_BIT)) ^
(0x00ff & secondByte);
// Change endianness, if necessary
utf16char = ntohs(utf16char);
recUtf16Msg.push_back(utf16char);
}
// ... end of message check removed for clarity
} while (true);
Now the issue which I'm really facing is that the code above actually works, but I'm not really sure why. The c++ side is written to receive messages which use network byte order (big endian) but it seems that java is sending the data using little endian encoding.
On c++ side we even use ntons-function to change endianness to the one desired by host machine. According to specification I understand that hton is supposed to do swap endianness if host platform uses little endian byte order. However ntonhs actually swaps the endianness of the received small endian characters, which ends up as big endian and the software works flawlessly.
Maybe someone can point out what exactly is happening? Do I accidentally switch bytes already to when creating utf16char? Why htons makes everything work, while it seems to act exactly opposite to the documentation? To compile I'm using Clang with libc++.
I left out parts of the code for clarity, but you should get the general idea. Also, I'm aware that using queue and dynamic array may not be the most effective way of handling data, but it's clean and performs well enough for this purpose.
I am creating an easy to use server-client model with an extensible protocol, where the server is in Java and clients can be Java, C#, what-have-you.
I ran into this issue: Java data streams write strings with a short designating the length, followed by the data.
C# lets me specify the encoding I want, but it only reads one byte for the length. (actually, it says '7 bits at a time'...this is odd. This might be part of my problem?)
Here is my setup: The server sends a string to the client once it connects. It's a short string, so the first byte is 0 and the second byte is 9; the string is 9 bytes long.
//...
_socket.Connect(host, port);
var stream = new NetworkStream(_socket);
_in = new BinaryReader(stream, Encoding.UTF8);
Console.WriteLine(_in.ReadString()); //outputs nothing
Reading a single byte before reading the string of course outputs the expected string. But, how can I set up my stream reader to read a string using two bytes as the length, not one? Do I need to subclass BinaryReader and override ReadString()?
The C# BinaryWriter/Reader behavior uses, if I recall correctly, the 8th bit to signify where the last byte of the count is. This allows for counts up to 127 to fit in a single byte while still allowing for actual count values much larger (i.e. up to 2^31-1); it's a bit like UTF8 in that respect.
For your own purposes, note that you are writing the whole protocol (presumably), so you have complete control over both ends. Both behaviors you describe, in C# and Java, are implemented by what are essentially helper classes in each language. There's nothing saying that you have to use them, and both languages offer a way to simply encode text directly into an array of bytes which you can send however you like.
If you do want to stick with the Java-based protocol, you can use BitConverter to convert between a short to a byte[] so that you can send and receive those two bytes explicitly. For example:
_in = new BinaryReader(stream, Encoding.UTF8);
byte[] header = _in.ReadBytes(2);
short count = BitConverter.ToInt16(header, 0);
byte[] data = _in.ReadBytes(count);
string text = Encoding.UTF8.GetString(data);
Console.WriteLine(text); // outputs something
i try to store a string into an integer as follows:
i read the characters of the string and every 4 characters i do this:
val = (int) ch << 24 | (int) ch << 16 | (int) ch << 8 | (int) ch;
Then i put the integer value in an array of integer that is called memory (=> int memory[16]).
I would like to do it in an automatic way for every length of a string, plus i have difficulties to inverse the procedure again for an arbitrary size string. Any help?
EDIT:
(from below)
Basically, i do an exercise in JAVA. It's a MIPS simulator system. I have Register, Datum, Instruction, Label, Control, APSImulator classes and others. When i try to load the program from an array to simulator's memory, i actually read every contents of the array which is called 'program' and put it in memory. Memory is 2048 long and 32 bits wide. Registers are declared also 32bit integers. So when there is an content in the array like Datum.datum( "string" ) - Datum class has IntDatum and StringDatum subclasses - i have somehow to store the "string" in the simulator's data segment of memory. Memory is 0-1023 text and 1024-2047 data region. I also have to delimit the string with a null char - plus any checkings for full memory etc. I figure out that one way to store a String to MemContents ( reference type - empty interface - implemented by class that memory field belongs to ) is to store the string every ( 2 or maybe 4 symbols ) to a register and then take the contents of the register and store it in memory. So, i found very difficult to implement that and the reverse procedure also.
If you are working in C, you have your string in a char array that is of a size multiple of a int, you can just take the pointer to the char array, cast it to a pointer to a int array and do whatever you want with your int array. If you don't have this last guarantee, you may simply write a function that creates your int array on the fly:
size_t IntArrayFromString(const char * Source, int ** Dest)
{
size_t stringLength=strlen(Source);
size_t intArrElements;
intArrElements=stringLength/sizeof(int);
if(stringLength%sizeof(int)!=0)
intArrElements++;
*Dest=(int *)malloc(intArrElements*sizeof(int));
(*Dest)[intArrElements-1]=0;
memcpy(Dest, Source, stringLength);
return intArrElements;
}
The caller is responsible for freeing the Dest buffer.
(I'm not sure if it really works, I didn't test it)
Have you considered simply using String.getBytes() ? You can then use the byte array to create the ints (for example, using the BigInteger(byte[]) constructor.
This may not be the most efficient solution, but is probably less prone to errors and more readable.
Assuming Java: You could look at the ByteBuffer class, and it's getInt method. It has a byte order parameter which you need to configure first.
Basically, i do an exercise in JAVA. It's a MIPS simulator system. I have Register, Datum, Instruction, Label, Control, APSImulator classes and others. When i try to load the program from an array to simulator's memory, i actually read every contents of the array which is called 'program' and put it in memory. Memory is 2048 long and 32 bits wide. Registers are declared also 32bit integers. So when there is an content in the array like Datum.datum( "string" ) - Datum class has IntDatum and StringDatum subclasses - i have somehow to store the "string" in the simulator's data segment of memory. Memory is 0-1023 text and 1024-2047 data region. I also have to delimit the string with a null char - plus any checkings for full memory etc. I figure out that one way to store a String to MemContents ( reference type - empty interface - implemented by class that memory field belongs to ) is to store the string every ( 2 or maybe 4 symbols ) to a register and then take the contents of the register and store it in memory. So, i found very difficult to implement that and the reverse procedure also.
One common way to do this in C is to use a union. It could look like
union u_intstr {
char fourChars[4];
int singleInt;
};
Set the chars into the union as
union u_intstr myIntStr;
myIntStr.fourChars[0] = ch1;
myIntStr.fourChars[1] = ch2;
myIntStr.fourChars[2] = ch3;
myIntStr.fourChars[3] = ch4;
and then access the int as
printf("%d\n", myIntStr.singleInt);
Edit
In your case for 16 ints the union could be extended to look like
union u_my16ints {
char str[16*sizeof(int)];
int ints[16];
};
This is what I come up with
int len = strlen(str);
int count = (len + sizeof(int))/sizeof(int);
int *ptr = (int *)calloc(count, sizeof(int));
memcpy((void *)ptr, (void *)str, count*sizeof(int));
Due to the use of calloc(), the resulting buffer has at least one NULL, maybe more to pad the last integer. This is not portable because the integers are in native byte order.
According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).
Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).
This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.
Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.
For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.
A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.
This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?
With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.
you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$
I highly recommend protocol buffers for exactly this problem.
As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}
You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.
I have to read a binary file in a legacy format with Java.
In a nutshell the file has a header consisting of several integers, bytes and fixed-length char arrays, followed by a list of records which also consist of integers and chars.
In any other language I would create structs (C/C++) or records (Pascal/Delphi) which are byte-by-byte representations of the header and the record. Then I'd read sizeof(header) bytes into a header variable and do the same for the records.
Something like this: (Delphi)
type
THeader = record
Version: Integer;
Type: Byte;
BeginOfData: Integer;
ID: array[0..15] of Char;
end;
...
procedure ReadData(S: TStream);
var
Header: THeader;
begin
S.ReadBuffer(Header, SizeOf(THeader));
...
end;
What is the best way to do something similar with Java? Do I have to read every single value on its own or is there any other way to do this kind of "block-read"?
To my knowledge, Java forces you to read a file as bytes rather than being able to block read. If you were serializing Java objects, it'd be a different story.
The other examples shown use the DataInputStream class with a File, but you can also use a shortcut: The RandomAccessFile class:
RandomAccessFile in = new RandomAccessFile("filename", "r");
int version = in.readInt();
byte type = in.readByte();
int beginOfData = in.readInt();
byte[] tempId;
in.read(tempId, 0, 16);
String id = new String(tempId);
Note that you could turn the responce objects into a class, if that would make it easier.
If you would be using Preon, then all you would have to do is this:
public class Header {
#BoundNumber int version;
#BoundNumber byte type;
#BoundNumber int beginOfData;
#BoundString(size="15") String id;
}
Once you have this, you create Codec using a single line:
Codec<Header> codec = Codecs.create(Header.class);
And you use the Codec like this:
Header header = Codecs.decode(codec, file);
You could use the DataInputStream class as follows:
DataInputStream in = new DataInputStream(new BufferedInputStream(
new FileInputStream("filename")));
int x = in.readInt();
double y = in.readDouble();
etc.
Once you get these values you can do with them as you please. Look up the java.io.DataInputStream class in the API for more info.
I may have misunderstood you, but it seems to me you're creating in-memory structures you hope will be a byte-per-byte accurate representation of what you want to read from hard-disk, then copy the whole stuff onto memory and manipulate thence?
If that's indeed the case, you're playing a very dangerous game. At least in C, the standard doesn't enforce things like padding or aligning of members of a struct. Not to mention things like big/small endianness or parity bits... So even if your code happens to run it's very non-portable and risky - you depend on the compiler's creator not changing its mind on future versions.
Better to create an automaton to both validate the structure being read (byte per byte) from HD is valid, and filling an in-memory structure if it's indeed OK. You may loose some milliseconds (not so much as it may seem for modern OSes do a lot of disk read caching) though you gain platform and compiler independence. Plus, your code will be easily ported to another language.
Post Edit: In a way I sympathize with you. In the good-ol' days of DOS/Win3.11, I once created a C program to read BMP files. And used exactly the same technique. Everything was nice until I tried to compile it for Windows - oops!! Int was now 32 bits long, rather than 16! When I tried to compile on Linux, discovered gcc had very different rules for bit fields allocation than Microsoft C (6.0!). I had to resort to macro tricks to make it portable...
I used Javolution and javastruct, both handles the conversion between bytes and objects.
Javolution provides classes that represent C types. All you need to do is to write a class that describes the C structure. For example, from the C header file,
struct Date {
unsigned short year;
unsigned byte month;
unsigned byte day;
};
should be translated into:
public static class Date extends Struct {
public final Unsigned16 year = new Unsigned16();
public final Unsigned8 month = new Unsigned8();
public final Unsigned8 day = new Unsigned8();
}
Then call setByteBuffer to initialize the object:
Date date = new Date();
date.setByteBuffer(ByteBuffer.wrap(bytes), 0);
javastruct uses annotation to define fields in a C structure.
#StructClass
public class Foo{
#StructField(order = 0)
public byte b;
#StructField(order = 1)
public int i;
}
To initialize an object:
Foo f2 = new Foo();
JavaStruct.unpack(f2, b);
I guess FileInputStream lets you read in bytes. So, opening the file with FileInputStream and read in the sizeof(header). I am assuming that the header has a fixed format and size. I don't see that mentioned in the initial post, but assuming that is the case as it would get much more complex if the header has optional args and different sizes.
Once you have the info, there can be a header class in which you assign the contents of the buffer that you've already read. And then parse the records in a similar fashion.
Here is a link to read byte using a ByteBuffer (Java NIO)
http://exampledepot.com/egs/java.nio/ReadChannel.html
As other people mention DataInputStream and Buffers are probably the low-level API's you are after for dealing with binary data in java.
However you probably want something like Construct (wiki page has good examples too: http://en.wikipedia.org/wiki/Construct_(python_library), but for Java.
I don't know of any (Java versions) off hand, but taking that approach (declaratively specifying the struct in code) would probably be the right way to go. With a suitable fluent interface in Java it would probably be quite similar to a DSL.
EDIT: bit of googling reveals this:
http://javolution.org/api/javolution/io/Struct.html
Which might be the kind of thing you are looking for. I have no idea whether it works or is any good, but it looks like a sensible place to start.
I would create an object that wraps around a ByteBuffer representation of the data and provide getters to read directly from the buffer. In this way, you avoid copying data from the buffer to primitive types. Furthermore, you could use a MappedByteBuffer to get the byte buffer. If your binary data is complex, you can model it using classes and give each class a sliced version of your buffer.
class SomeHeader {
private final ByteBuffer buf;
SomeHeader( ByteBuffer fileBuffer){
// you may need to set limits accordingly before
// fileBuffer.limit(...)
this.buf = fileBuffer.slice();
// you may need to skip the sliced region
// fileBuffer.position(endPos)
}
public short getVersion(){
return buf.getShort(POSITION_OF_VERSION_IN_BUFFER);
}
}
Also useful are the methods for reading unsigned values from byte buffers.
HTH
I've written up a technique to do this sort of thing in java - similar to the old C-like idiom of reading bit-fields. Note it is just a start but could be expanded upon.
here
In the past I used DataInputStream to read data of arbitrary types in a specified order. This will not allow you to easily account for big-endian/little-endian issues.
As of 1.4 the java.nio.Buffer family might be the way to go, but it seems that the your code might actually be more complicated. These classes do have support for handling endian issues.
A while ago I found this article on using reflection and parsing to read binary data. In this case, the author is using reflection to read the java binary .class files. But if you are reading the data into a class file, it may be of some help.