Replicating C struct padding in Java - java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).

Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).

This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.

Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.

For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.

A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.

This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?

With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.

you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$

I highly recommend protocol buffers for exactly this problem.

As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}

You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Related

EBCDIC unpacking comp-3 data returns 40404** in Java

I have used the unpack data logic provided in below link for java
How to unpack COMP-3 digits using Java?
But for the null data in source it returns 404040404 like on Java unpack code. I understand this was space in ebcdic, but how to unpack by handling this space or to avoid it.
There are two problems that we have to deal with. First, is the data valid comp-3 data and second, is the data considered “valid” by older language implementations like COBOL since Comp-3 was mentioned.
If the offests are not misaligned it would appear that spaces are being interpreted by existing programs as 0 instead of spaces. This would be incorrect but could be an artifact of older programs that were engineered to tolerate this bad behaviour.
The approach I would take in a legacy shop (assuming no misalignment) is to consider “spaces” (which are sequences of 0x404040404040) as being zero. This would be a legacy check to compare the field with spaces and then assume that 0x00000000000f as the actual default. This is something an individual shop would have to determine and is not recognized as a general programming approach.
In terms of Java, one has to remember that bytes are “signed” so comparisons can be tricky based on how the code is written. The only “unsigned” data type I
recall in java is char which is really two bytes (unit 16) basically.
This is less of a programming problem than it is recognizing historical tolerance and remediation.

Java Serialization to transfer data between any language

Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.
I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.
There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?
Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack

Mapping java variables data to C Structures and write a c compatible file

I have a java class file with three arrayLists, one with type String, one with type Integer and other is ArrayList with type (ArrayList(String)). I have to write these these arraylists to a structure in C with character arrays, integers and short and output a file in a specefic format extension. The file has to be readable again by the same application. What is the best way to trasnfer the data from java to c structure and then output the c structure in a file. Thank you
There is no "C compatible file" format. If you have C structs written to disk file directly, then those are in an ad-hoc binary format. Exact format depends on things like packing and padding of the struct, byte order, word size of the CPU (like, 32 or 64 bit), etc.
So, start by defining the format, then forget it is produced by C.
Once you have the format defined, you can write a program to parse it in Java. If it is short with fixed length records, I'd probably create a class, which internally has just a private byte[] array, and then methods to manipulate it, save it and load it.
I suggest you write/read the data to a ByteBuffer using native byte ordering. The rest is up to you are to how you do it.
A library which might help is Javolution's Struct library which helps you map C structs onto ByteBuffers. This can help with C's various padding rules i.e.the exact layout might not be obvious.

problem when Java and C++ talk with each other

I have a program in C++ and it writes a binary file on disk. Then I use a Java program to read the number. The problem is the number read is different from the number written.
Say, I write an integer 4 using c++ and get back 67108864 when use JAVA to read it (using readint()). I suspect its due to big or small endian. Do you have any simple solutions to solve this?
Java's java.nio buffers let you specify the endianness.
See http://download.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html especially the order method which lets you specify endianness and the getInt method which lets you read an int.
To read a file using a ByteBuffer do something like:
ByteBuffer buffer = new RandomAccessFile(myFile, "r")
.getChannel.map(MapMode.READ, offset, length);
Remember to close it when you're done.

Best way to read structured binary files with Java

I have to read a binary file in a legacy format with Java.
In a nutshell the file has a header consisting of several integers, bytes and fixed-length char arrays, followed by a list of records which also consist of integers and chars.
In any other language I would create structs (C/C++) or records (Pascal/Delphi) which are byte-by-byte representations of the header and the record. Then I'd read sizeof(header) bytes into a header variable and do the same for the records.
Something like this: (Delphi)
type
THeader = record
Version: Integer;
Type: Byte;
BeginOfData: Integer;
ID: array[0..15] of Char;
end;
...
procedure ReadData(S: TStream);
var
Header: THeader;
begin
S.ReadBuffer(Header, SizeOf(THeader));
...
end;
What is the best way to do something similar with Java? Do I have to read every single value on its own or is there any other way to do this kind of "block-read"?
To my knowledge, Java forces you to read a file as bytes rather than being able to block read. If you were serializing Java objects, it'd be a different story.
The other examples shown use the DataInputStream class with a File, but you can also use a shortcut: The RandomAccessFile class:
RandomAccessFile in = new RandomAccessFile("filename", "r");
int version = in.readInt();
byte type = in.readByte();
int beginOfData = in.readInt();
byte[] tempId;
in.read(tempId, 0, 16);
String id = new String(tempId);
Note that you could turn the responce objects into a class, if that would make it easier.
If you would be using Preon, then all you would have to do is this:
public class Header {
#BoundNumber int version;
#BoundNumber byte type;
#BoundNumber int beginOfData;
#BoundString(size="15") String id;
}
Once you have this, you create Codec using a single line:
Codec<Header> codec = Codecs.create(Header.class);
And you use the Codec like this:
Header header = Codecs.decode(codec, file);
You could use the DataInputStream class as follows:
DataInputStream in = new DataInputStream(new BufferedInputStream(
new FileInputStream("filename")));
int x = in.readInt();
double y = in.readDouble();
etc.
Once you get these values you can do with them as you please. Look up the java.io.DataInputStream class in the API for more info.
I may have misunderstood you, but it seems to me you're creating in-memory structures you hope will be a byte-per-byte accurate representation of what you want to read from hard-disk, then copy the whole stuff onto memory and manipulate thence?
If that's indeed the case, you're playing a very dangerous game. At least in C, the standard doesn't enforce things like padding or aligning of members of a struct. Not to mention things like big/small endianness or parity bits... So even if your code happens to run it's very non-portable and risky - you depend on the compiler's creator not changing its mind on future versions.
Better to create an automaton to both validate the structure being read (byte per byte) from HD is valid, and filling an in-memory structure if it's indeed OK. You may loose some milliseconds (not so much as it may seem for modern OSes do a lot of disk read caching) though you gain platform and compiler independence. Plus, your code will be easily ported to another language.
Post Edit: In a way I sympathize with you. In the good-ol' days of DOS/Win3.11, I once created a C program to read BMP files. And used exactly the same technique. Everything was nice until I tried to compile it for Windows - oops!! Int was now 32 bits long, rather than 16! When I tried to compile on Linux, discovered gcc had very different rules for bit fields allocation than Microsoft C (6.0!). I had to resort to macro tricks to make it portable...
I used Javolution and javastruct, both handles the conversion between bytes and objects.
Javolution provides classes that represent C types. All you need to do is to write a class that describes the C structure. For example, from the C header file,
struct Date {
unsigned short year;
unsigned byte month;
unsigned byte day;
};
should be translated into:
public static class Date extends Struct {
public final Unsigned16 year = new Unsigned16();
public final Unsigned8 month = new Unsigned8();
public final Unsigned8 day = new Unsigned8();
}
Then call setByteBuffer to initialize the object:
Date date = new Date();
date.setByteBuffer(ByteBuffer.wrap(bytes), 0);
javastruct uses annotation to define fields in a C structure.
#StructClass
public class Foo{
#StructField(order = 0)
public byte b;
#StructField(order = 1)
public int i;
}
To initialize an object:
Foo f2 = new Foo();
JavaStruct.unpack(f2, b);
I guess FileInputStream lets you read in bytes. So, opening the file with FileInputStream and read in the sizeof(header). I am assuming that the header has a fixed format and size. I don't see that mentioned in the initial post, but assuming that is the case as it would get much more complex if the header has optional args and different sizes.
Once you have the info, there can be a header class in which you assign the contents of the buffer that you've already read. And then parse the records in a similar fashion.
Here is a link to read byte using a ByteBuffer (Java NIO)
http://exampledepot.com/egs/java.nio/ReadChannel.html
As other people mention DataInputStream and Buffers are probably the low-level API's you are after for dealing with binary data in java.
However you probably want something like Construct (wiki page has good examples too: http://en.wikipedia.org/wiki/Construct_(python_library), but for Java.
I don't know of any (Java versions) off hand, but taking that approach (declaratively specifying the struct in code) would probably be the right way to go. With a suitable fluent interface in Java it would probably be quite similar to a DSL.
EDIT: bit of googling reveals this:
http://javolution.org/api/javolution/io/Struct.html
Which might be the kind of thing you are looking for. I have no idea whether it works or is any good, but it looks like a sensible place to start.
I would create an object that wraps around a ByteBuffer representation of the data and provide getters to read directly from the buffer. In this way, you avoid copying data from the buffer to primitive types. Furthermore, you could use a MappedByteBuffer to get the byte buffer. If your binary data is complex, you can model it using classes and give each class a sliced version of your buffer.
class SomeHeader {
private final ByteBuffer buf;
SomeHeader( ByteBuffer fileBuffer){
// you may need to set limits accordingly before
// fileBuffer.limit(...)
this.buf = fileBuffer.slice();
// you may need to skip the sliced region
// fileBuffer.position(endPos)
}
public short getVersion(){
return buf.getShort(POSITION_OF_VERSION_IN_BUFFER);
}
}
Also useful are the methods for reading unsigned values from byte buffers.
HTH
I've written up a technique to do this sort of thing in java - similar to the old C-like idiom of reading bit-fields. Note it is just a start but could be expanded upon.
here
In the past I used DataInputStream to read data of arbitrary types in a specified order. This will not allow you to easily account for big-endian/little-endian issues.
As of 1.4 the java.nio.Buffer family might be the way to go, but it seems that the your code might actually be more complicated. These classes do have support for handling endian issues.
A while ago I found this article on using reflection and parsing to read binary data. In this case, the author is using reflection to read the java binary .class files. But if you are reading the data into a class file, it may be of some help.

Categories

Resources