Best way to read structured binary files with Java

Best way to read structured binary files with Java - java

I have to read a binary file in a legacy format with Java.
In a nutshell the file has a header consisting of several integers, bytes and fixed-length char arrays, followed by a list of records which also consist of integers and chars.
In any other language I would create structs (C/C++) or records (Pascal/Delphi) which are byte-by-byte representations of the header and the record. Then I'd read sizeof(header) bytes into a header variable and do the same for the records.
Something like this: (Delphi)
type
THeader = record
Version: Integer;
Type: Byte;
BeginOfData: Integer;
ID: array[0..15] of Char;
end;
...
procedure ReadData(S: TStream);
var
Header: THeader;
begin
S.ReadBuffer(Header, SizeOf(THeader));
...
end;
What is the best way to do something similar with Java? Do I have to read every single value on its own or is there any other way to do this kind of "block-read"?

To my knowledge, Java forces you to read a file as bytes rather than being able to block read. If you were serializing Java objects, it'd be a different story.
The other examples shown use the DataInputStream class with a File, but you can also use a shortcut: The RandomAccessFile class:
RandomAccessFile in = new RandomAccessFile("filename", "r");
int version = in.readInt();
byte type = in.readByte();
int beginOfData = in.readInt();
byte[] tempId;
in.read(tempId, 0, 16);
String id = new String(tempId);
Note that you could turn the responce objects into a class, if that would make it easier.

If you would be using Preon, then all you would have to do is this:
public class Header {
#BoundNumber int version;
#BoundNumber byte type;
#BoundNumber int beginOfData;
#BoundString(size="15") String id;
}
Once you have this, you create Codec using a single line:
Codec<Header> codec = Codecs.create(Header.class);
And you use the Codec like this:
Header header = Codecs.decode(codec, file);

You could use the DataInputStream class as follows:
DataInputStream in = new DataInputStream(new BufferedInputStream(
new FileInputStream("filename")));
int x = in.readInt();
double y = in.readDouble();
etc.
Once you get these values you can do with them as you please. Look up the java.io.DataInputStream class in the API for more info.

I may have misunderstood you, but it seems to me you're creating in-memory structures you hope will be a byte-per-byte accurate representation of what you want to read from hard-disk, then copy the whole stuff onto memory and manipulate thence?
If that's indeed the case, you're playing a very dangerous game. At least in C, the standard doesn't enforce things like padding or aligning of members of a struct. Not to mention things like big/small endianness or parity bits... So even if your code happens to run it's very non-portable and risky - you depend on the compiler's creator not changing its mind on future versions.
Better to create an automaton to both validate the structure being read (byte per byte) from HD is valid, and filling an in-memory structure if it's indeed OK. You may loose some milliseconds (not so much as it may seem for modern OSes do a lot of disk read caching) though you gain platform and compiler independence. Plus, your code will be easily ported to another language.
Post Edit: In a way I sympathize with you. In the good-ol' days of DOS/Win3.11, I once created a C program to read BMP files. And used exactly the same technique. Everything was nice until I tried to compile it for Windows - oops!! Int was now 32 bits long, rather than 16! When I tried to compile on Linux, discovered gcc had very different rules for bit fields allocation than Microsoft C (6.0!). I had to resort to macro tricks to make it portable...

I used Javolution and javastruct, both handles the conversion between bytes and objects.
Javolution provides classes that represent C types. All you need to do is to write a class that describes the C structure. For example, from the C header file,
struct Date {
unsigned short year;
unsigned byte month;
unsigned byte day;
};
should be translated into:
public static class Date extends Struct {
public final Unsigned16 year = new Unsigned16();
public final Unsigned8 month = new Unsigned8();
public final Unsigned8 day = new Unsigned8();
}
Then call setByteBuffer to initialize the object:
Date date = new Date();
date.setByteBuffer(ByteBuffer.wrap(bytes), 0);
javastruct uses annotation to define fields in a C structure.
#StructClass
public class Foo{
#StructField(order = 0)
public byte b;
#StructField(order = 1)
public int i;
}
To initialize an object:
Foo f2 = new Foo();
JavaStruct.unpack(f2, b);

I guess FileInputStream lets you read in bytes. So, opening the file with FileInputStream and read in the sizeof(header). I am assuming that the header has a fixed format and size. I don't see that mentioned in the initial post, but assuming that is the case as it would get much more complex if the header has optional args and different sizes.
Once you have the info, there can be a header class in which you assign the contents of the buffer that you've already read. And then parse the records in a similar fashion.

Here is a link to read byte using a ByteBuffer (Java NIO)
http://exampledepot.com/egs/java.nio/ReadChannel.html

As other people mention DataInputStream and Buffers are probably the low-level API's you are after for dealing with binary data in java.
However you probably want something like Construct (wiki page has good examples too: http://en.wikipedia.org/wiki/Construct_(python_library), but for Java.
I don't know of any (Java versions) off hand, but taking that approach (declaratively specifying the struct in code) would probably be the right way to go. With a suitable fluent interface in Java it would probably be quite similar to a DSL.
EDIT: bit of googling reveals this:
http://javolution.org/api/javolution/io/Struct.html
Which might be the kind of thing you are looking for. I have no idea whether it works or is any good, but it looks like a sensible place to start.

I would create an object that wraps around a ByteBuffer representation of the data and provide getters to read directly from the buffer. In this way, you avoid copying data from the buffer to primitive types. Furthermore, you could use a MappedByteBuffer to get the byte buffer. If your binary data is complex, you can model it using classes and give each class a sliced version of your buffer.
class SomeHeader {
private final ByteBuffer buf;
SomeHeader( ByteBuffer fileBuffer){
// you may need to set limits accordingly before
// fileBuffer.limit(...)
this.buf = fileBuffer.slice();
// you may need to skip the sliced region
// fileBuffer.position(endPos)
}
public short getVersion(){
return buf.getShort(POSITION_OF_VERSION_IN_BUFFER);
}
}
Also useful are the methods for reading unsigned values from byte buffers.
HTH

I've written up a technique to do this sort of thing in java - similar to the old C-like idiom of reading bit-fields. Note it is just a start but could be expanded upon.
here

In the past I used DataInputStream to read data of arbitrary types in a specified order. This will not allow you to easily account for big-endian/little-endian issues.
As of 1.4 the java.nio.Buffer family might be the way to go, but it seems that the your code might actually be more complicated. These classes do have support for handling endian issues.

A while ago I found this article on using reflection and parsing to read binary data. In this case, the author is using reflection to read the java binary .class files. But if you are reading the data into a class file, it may be of some help.

Related

How to read different objects from byte[]

I have a byte[] with 3 different objects. How can I read from the byte[] and separate the objects?
My code:
public byte[] toByteArray() {
byte[] bytes;
byte[] sb = start.toString().getBytes();
byte[] gb = goal.toString().getBytes();
byte[] mb = gameBoard.toString().getBytes();
bytes = new byte[sb.length + gb.length + mb.length];
System.arraycopy(sb, 0, bytes, 0, sb.length);
System.arraycopy(gb, 0, bytes, sb.length, gb.length);
System.arraycopy(mb, 0, bytes, gb.length, mb.length);
return bytes;
}

Seems like you are talking about Java not JavaScript.
I recommend you to have a look at binary serialization which I guess is what you are looking for: Saving to binary/serialization java

If you store your data like this, it will be a very difficult task to read them.
I recommend using some build-in object-to-byte[] (and back) conversions like Serializable.
Also, to store several object inside one byte[] array, have a look into ObjectOutputStream

First of all you will need an actual byte[] where stuff can be read from. There are some issues about what you are trying.
toString() usually is not fit to get some data you can reconstruct the object from. It might work with an integer, get a bit messed up with floating point, and be outright impossible with complex objects which only tell about their type and id. (as Davide comment pointed)
There are no cues about where one object starts and ends. Even worse: you might have messed up the start position of 3rd object.
The JRE has a built-in serialization.
Other people use XML or JSON when they need to interact with something else. You might even implement your own flavor of java.text.Format which is able to format and parse your objects. Pick your poison.

Java: Efficiently converting an array of longs to an array of bytes

I have an array of longs I want to write to disk. The most efficient disk I/O functions take in byte arrays, for example:
FileOutputStream.write(byte[] b, int offset, int length)
...so I want to begin by converting my long[] to byte[] (8 bytes for each long). I'm struggling to find a clean way to do this.
Direct typecasting doesn't seem allowed:
ConversionTest.java:6: inconvertible types
found : long[]
required: byte[]
byte[] byteArray = (byte[]) longArray;
^
It's easy to do the conversion by iterating over the array, for example:
ByteBuffer bytes = ByteBuffer.allocate(longArray.length * (Long.SIZE/8));
for( long l: longArray )
{
bytes.putLong( l );
}
byte[] byteArray = bytes.array();
...however that seems far less efficient than simply treating the long[] as a series of bytes.
Interestingly, when reading the file, it's easy to "cast" from byte[] to longs using Buffers:
LongBuffer longs = ByteBuffer.wrap(byteArray).asLongBuffer();
...but I can't seem to find any functionality to go the opposite direction.
I understand there are endian considerations when converting from long to byte, but I believe I've already addressed those: I'm using the Buffer framework shown above, which defaults to big endian, regardless of native byte order.

No, there is not a trivial way to convert from a long[] to a byte[].
Your best option is likely to wrap your FileOutputStream with a BufferedOutputStream and then write out the individual byte values for each long (using bitwise operators).
Another option is to create a ByteBuffer and put your long values into the ByteBuffer and then write that to a FileChannel. This handles the endianness conversion for you, but makes the buffering more complicated.

Concerning the efficiency, many details will, in fact, hardly make a difference. The hard disk is by far the slowest part involved here, and in the time that it takes to write a single byte to the disk, you could have converted thousands or even millions of bytes to longs. Every performance test here will not tell you anything about the performance of the implementation, but about the performance of the hard disk. In doubt, one should make dedicated benchmarks comparing the different conversion strategies, and comparing the different writing methods, respectively.
Assuming that the main goal is a functionality that allows a convenient conversion and does not impose an unnecessary overhead, I'd like to propose the following approach:
One can create a ByteBuffer of sufficient size, view this as a LongBuffer, use the bulk LongBuffer#put(long[]) method (which takes care of endianness conversions, of necessary, and does this as efficient as it can be), and finally, write the original ByteBuffer (which is now filled with the long values) to the file, using a FileChannel.
Following this idea, I think that this method is convenient and (most likely) rather efficient:
private static void bulkAndChannel(String fileName, long longArray[])
{
ByteBuffer bytes =
ByteBuffer.allocate(longArray.length * Long.BYTES);
bytes.order(ByteOrder.nativeOrder()).asLongBuffer().put(longArray);
try (FileOutputStream fos = new FileOutputStream(fileName))
{
fos.getChannel().write(bytes);
}
catch (IOException e)
{
e.printStackTrace();
}
}
(Of course, one could argue about whether allocating a "large" buffer is the best idea. But thanks to the convenience methods of the Buffer classes, this could easily and with reasonable effort be modified to write "chunks" of data with an appropriate size, for the case that one really wants to write a huge array and the memory overhead of creating the corresponding ByteBuffer would be prohibitively large)

OP here.
I have thought of one approach: ByteBuffer.asLongBuffer() returns an instance of ByteBufferAsLongBufferB, a class which wraps ByteBuffer in an interface for treating the data as longs while properly managing endianness. I could extend ByteBufferAsLongBufferB, and add a method to return the raw byte buffer (which is protected).
But this seems so esoteric and convoluted I feel there must be an easier way. Either that, or something in my approach is flawed.

Resource file format processing in Java

I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}

I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.

I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.

Writing Bits to a file using BitSet & FileOutputStream

I've run into a bit of a problem when it comes to writing specific bits to a file. I apologise if this is a duplicate of anything but I could not find a reasonable answer with the searches I ran.
I have a number of difficulties with the following:
Writing a header (Long) bit by bit (converted to a byte array so the
FileOutputStream can utilise it) to the file.
Writing single bits to the file. For example, at one stage I am required to write a single bit set to 0 to the file so my initial thought would be to use a BitSet but Java seems to treat this as a null?
BitSet initialPadding = new BitSet();
initialPadding.set(0, false);
fileOutputStream.write(initialPadding.toByteArray());
1)
I create a FileOutputStream as shown below with the necessary file name:
FileOutputStream fileOutputStream = new FileOutputStream(file.getAbsolutePath());
I am attempting to create an ".amr" file so the first step before I perform any bit manipulation is to write a header to the beginning of the file. This has the following value:
Long defaultHeader = 0x2321414d520aL;
I've tried writing this to the file using the following method but I am pretty sure it does not write the correct result:
fileOutputStream.write(defaultHeader.byteValue());
Am I using the correct streams? Are my convertions completely wrong?
2)
I have a public BitSet fileBitSet;which has bits read in from a ".raw" file as the input. I need to be able to extract certain bits from the BitSet in order to write them to the file later. I do this using the following method:
public int getOctetPayloadHeader(int startPoint) {
int readLength = 0;
octetCMR = fileBitSet.get(0, 3);
octetRES = fileBitSet.get(4, 7);
if (octetRES.get(0, 3).isEmpty()) {
/* Keep constructing the payload header. */
octetFBit = fileBitSet.get(8, 8);
octetMode = fileBitSet.get(9, 12);
octetQuality = fileBitSet.get(13, 13);
octetPadding = fileBitSet.get(14, 15);
... }
What would be the best way to go for writing these bits to a file bearing in mind that I may be required to sometimes write a single bit or 81 bits at a particular offset in the fileBitSet ?

There is only one thing you can write to an OutputStream: bytes. You have to do the composing of your bits into bytes yourself; only you know the rules how the bits are to be put together into bytes.
As for stuff like:
Long defaultHeader = 0x2321414d520aL;
fileOutputStream.write(defaultHeader.byteValue());
You should take a close look at the javadocs for the methods you are using. byteValue() returns a single byte; so of course its not doing what you expect. Working with streams is well explained in oracles tutorials: http://docs.oracle.com/javase/tutorial/essential/io/streams.html
For writing single bits or groups of bits, you will need a custom OutputStream that handles grouping the bits into bytes to be written. Thats commonly called a BitStream (there is no such class in the JDK); you have to either write it yourself (which I highly recommend, its a very good excercise to teach you about bits and bytes) or find one on the web.

Replicating C struct padding in Java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).

Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).

This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.

Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.

For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.

A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.

This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?

With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.

you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$

I highly recommend protocol buffers for exactly this problem.

As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}

You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to read structured binary files with Java - java

Here is a link to read byte using a ByteBuffer (Java NIO) http://exampledepot.com/egs/java.nio/ReadChannel.html

I've written up a technique to do this sort of thing in java - similar to the old C-like idiom of reading bit-fields. Note it is just a start but could be expanded upon. here

A while ago I found this article on using reflection and parsing to read binary data. In this case, the author is using reflection to read the java binary .class files. But if you are reading the data into a class file, it may be of some help.

Related

How to read different objects from byte[]

Java: Efficiently converting an array of longs to an array of bytes

Resource file format processing in Java

Writing Bits to a file using BitSet & FileOutputStream

Replicating C struct padding in Java

Categories

Resources