How to write/read binary files that represent objects?

How to write/read binary files that represent objects? - java

I'm new to Java programming, and I ran into this problem:
I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects.
Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees.
The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.
My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them.
I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.
For the tree writing, I was looking for something like this (in C)
fwrite(tree, sizeof(struct TrieTree), 1, file)
For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.
Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.

I think you are (for starters) best off when trying to do this with serialization.
Here is just one example from stackoverflow: What is object serialization?
(I think copy&paste of the code does not make sense, please follow the link to read)
Admittedly this does not yet solve your index creation problem.

Here is an alternative to Java native serialization, Google Protocol Buffers.
I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.
What is it:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.
Performance:
This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind.
Here is stack overflow link discussing Java native vs GPB:
High performance serialization: Java vs Google Protocol Buffers vs ...?
How does it work:
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.
You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:
Person john = Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Read all about it here:
https://developers.google.com/protocol-buffers/
You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.

Related

How to parse freedict files (.dict and .index)

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

Java Serialization to transfer data between any language

Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.

I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.

There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?

Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack

Best Practice for Writing Data Using protobuf

We need our protobuf messages to contain as little data as possible. So what are the best practices we can follow in order to gain the maximum out of it. As an example writing byte[] as a String or ByteString ? What makes the difference? And adding a list of Integers as a repeated list or something else ?

As an example writing byte[] as a String or ByteString ?
If you want to write binary data, use a bytes fields (so ByteString). A string field is UTF-8-encoded text, so can't be used for all possible byte sequences.
And adding a list of integers as a repeated list or something else ?
Yes, use a repeated list - but with the [packed=true] option.
Basically, look over the whole encoding documentation and work out what's most appropriate for you. In particular, choose carefully between the various numeric representations, based on what your actual data will be. (If you're writing 32-bit values which are typically very large, consider using the fixed32 format instead of just int32 for example.)

Incremental streaming JSON library for Java

Can anyone recommend a JSON library for Java which allows me to give it chunks of data as they come in, in a non-blocking fashion? I have read through A better Java JSON library and similar questions, and haven't found precisely what I'd like.
Essentially, what I'd like is a library which allows me to do something like the following:
String jsonString1 = "{ \"A broken";
String jsonString2 = " json object\" : true }";
JSONParser p = new JSONParser(...);
p.parse(jsonString1);
p.isComplete(); // returns false
p.parse(jsonString2);
p.isComplete(); // returns true
Object o = p.getResult();
Notice the actual key name ("A broken json object") is split between pieces.
The closest I've found is this async-json-library which does almost exactly what I'd like, except it cannot recover objects where actual strings or other data values are split between pieces.

There are a few blocking streaming/incemental JSON parsers (as per Is there a streaming API for JSON?); but for async nothing yet that I am aware of.
The lib you refer to seems badly named; it does not seem to do real asynchronous processing, but merely allow one to parse sequence of JSON documents (which multiple other libs allow doing as well)
If there were people who really wanted this, writing one is not impossible -- for XML there is Aalto, and handling JSON is quite a bit simpler than XML.
For what it is worth, there is actually this feature request to add non-blocking parsing mode for Jackson; but very few users have expressed interest in getting that done (via voting for the feature request).
EDIT: (2016-01) while not async, Jackson ObjectMapper allows for convenient sub-tree by sub-tree binding of parts of the stream as well -- see ObjectReader.readValues() (ObjectReader created from ObjectMapper), or short-cut versions of ObjectMapper.readValues(...). Note the trailing s in there, which implies a stream of Objects, not just a single one.

Google Gson can incrementally parse Json from an InputStream
https://sites.google.com/site/gson/streaming

I wrote such a parser: JsonParser.java. See examples how to use it:JsonParserTest.java.

Replicating C struct padding in Java

According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).

Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).

This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.

Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.

For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.

A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.

This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?

With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.

you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$

I highly recommend protocol buffers for exactly this problem.

As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}

You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.