We need our protobuf messages to contain as little data as possible. So what are the best practices we can follow in order to gain the maximum out of it. As an example writing byte[] as a String or ByteString ? What makes the difference? And adding a list of Integers as a repeated list or something else ?
As an example writing byte[] as a String or ByteString ?
If you want to write binary data, use a bytes fields (so ByteString). A string field is UTF-8-encoded text, so can't be used for all possible byte sequences.
And adding a list of integers as a repeated list or something else ?
Yes, use a repeated list - but with the [packed=true] option.
Basically, look over the whole encoding documentation and work out what's most appropriate for you. In particular, choose carefully between the various numeric representations, based on what your actual data will be. (If you're writing 32-bit values which are typically very large, consider using the fixed32 format instead of just int32 for example.)
Related
Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.
I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.
There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?
Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack
I am developing a Java-based downloader for binary data. This data is transferred via a text-based protocol (UU-encoded). For the networking task the netty library is used. The binary data is split by the server into many thousands of small packets and sent to the client (i.e. the Java application).
From netty I receive a ChannelBuffer object every time a new message (data) is received. Now I need to process that data, beside other tasks I need to check the header of the package coming from the server (like the HTTP status line). To do so I call ChannelBuffer.array() to receive a byte[] array. This array I can then convert into a string via new String(byte[]) and easily check (e.g. compare) its content (again, like comparison to the "200" status message in HTTP).
The software I am writing is using multiple threads/connections, so that I receive multiple packets from netty in parallel.
This usually works fine, however, while profiling the application I noticed that when the connection to the server is good and data comes in very fast, then this conversion to the String object seems to be a bottleneck. The CPU usage is close to 100% in such cases, and according to the profiler very much time is spent in calling this String(byte[]) constructor.
I searched for a better way to get from the ChannelBuffer to a String, and noticed the former also has a toString() method. However, that method is even slower than the String(byte[]) constructor.
So my question is: Does anyone of you know a better alternative to achieve what I am doing?
Perhaps you could skip the String conversion entirely? You could have constants holding byte arrays for your comparison values and check array-to-array instead of String-to-String.
Here's some quick code to illustrate. Currently you're doing something like this:
String http200 = "200";
// byte[] -> String conversion happens every time
String input = new String(ChannelBuffer.array());
return input.equals(http200);
Maybe this is faster:
// Ideally only convert String->byte[] once. Store these
// arrays somewhere and look them up instead of recalculating.
final byte[] http200 = "200".getBytes("UTF-8"); // Select the correct charset!
// Input doesn't have to be converted!
byte[] input = ChannelBuffer.array();
return Arrays.equals(input, http200);
Some of the checking you are doing might just look at part of the buffer. If you could use the alternate form of the String constructor:
new String(byteArray, startCol, length)
That might mean a lot less bytes get converted to a string.
Your example of looking for "200" within the message would be an example.
2
You might find that you can use the length of the byte array as a clue. If some messages are long and you are looking for a short one, ignore the long ones and don't convert to characters. Or something like that.
3
Along with what #EricGrunzke said, partially looking in the byte buffer to filter out some messages and find that you don't need to convert them from bytes to characters.
4
If your bytes are ASCII characters, the conversion to characters might be quicker if you use charset "ASCII" instead of whatever the default is for your server:
new String(bytes, "ASCII")
might be faster in that case.
In fact, you might be able to pick and choose the charset for conversion byte-character in some organized fashion that speeds up things.
Depending on what you are trying to do there are a few options:
If you are just trying to get the response status to then can't you just call getStatus()? This would probably be faster than getting the string out.
If you are trying to convert the buffer, then, assuming you know it will be ASCII, which it sounds like you do, then just leave the data as byte[] and convert your UUDecode method to work on a byte[] instead of a String.
The biggest cost of the string conversion is most likely the copying of the data from the byte array to the internal char array of the String, this combined with the conversion is most likely just a bunch of work that you don't need to do.
Given a protobuf serialization is it possible to get a list of all tag numbers that are in the message? Generally is it possible to view the structure of the message without the defining .proto files?
Most APIs will indeed have some form of reader-based API that allows you to enumerate a raw protobuf stream. However, that by itself is not enough to fully understand the data, since without the schema the interpretation is ambiguous:
a varint could be zig-zag encoded (sint32/sint64), or not (int32/int64/uint32/uint64) - radically changing the meaning, or a boolean, or an enum
a fixed-32/fixed-64 could be a signed or unsigned integer, or could be an IEEE754 float/double
a length-prefixed chunk could be a UTF-8 string, a BLOB, a sub-message, or a "packed" repeated set of primitives; if it is a sub-message, you'll have to repeat recursively
So... yes and no. Certainly you can get the field numbers of the outermost message.
Another approach would be to use the regular API against a type with no members (message Naked {}), and then query the unexpected data (i.e. all of it) via the "extension" API that many implementations provide.
You can get all the tag numbers which appear in one particular message, but you won't get any nested messages - and you won't know the types of those fields, only some subset of possible types.
If you look at the wire encoding, you can see that (for example) byte arrays, strings and nested messages are all encoded the same way - so you may know that "field 3 is a length-prefixed binary field" but you won't know whether that means it's a nested message, a string or a byte array.
It's probably a stupid question but here's the thing. I was reading this question:
Storing 1 million phone numbers
and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:
I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion
Problem: how do I do that in Java? And ASCII text does stand for String?
For in-memory storage as indicated in the question:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
out.write(number);
out.write('\n');
}
byte[] data = baos.toByteArray();
But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.
Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.
In answer to an implied, but different question;
Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)
A: This is the general approach
First sort the list if its not sorted already.
From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
Write the stream to a compressed data set.
Test this to compare with a simple sending of all numbers.
You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.
The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.
If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like
NavigableMap<String, PhoneDetails[]>
In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)
BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.
Alot depending on what you want to do with the data and why you have it in memory at all.
You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.
I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.
Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.
According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).
Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).
This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.
Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.
For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.
A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.
This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?
With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.
you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$
I highly recommend protocol buffers for exactly this problem.
As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}
You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.