Efficient way to extract protocol data units from tcp byte stream

Efficient way to extract protocol data units from tcp byte stream - java

I've had two java projects (simple multiplayer games) that relied on a byte-based connection-oriented protocol
for communication.
In both cases I was unhappy with the implementation of the communication, since I couldn't come up with an intelligent, non-verbose and object-orientied way of writing and especially parsing the bytes.
For writing I had something like
ProtocolDataUnitX pdux = new ProtocolDataUnitX("MyName", 2013);
int[] bytes = pdux.getBytes();
out.write(bytes); // surrounded with try/catch etc.
That was acceptable to some extent, since I had an AbstractPDU class with some byte conversion convenience methods. But I had to define the getBytes() method
for every protocol data unit (pdu).
My approach for parsing the incoming byte stream lacked even more innovation.
private InputStream in;
...
#Override
public void run() {
int c;
while ((c = in.read()) != -1)) {
if (c == 0x01) {
// 0x01 means we have pdu #1 and can continue reading
// since we know what is coming.
// after we have all bytes and know the pdu
// we can determine the paramters. I.e., every pdu has a
// reverse constructor: bytes -> pdu
}
QUESTION
How do you handle these situations? What are the best practises here? Some protocols have the total length field encoded, some not. Some protocol data units have variable length.
Is there a reasonable approach here? Maybe some kind of schema definition? I don't want to produce ugly and confusing code anylonger for this matter.

Summary: best practice is to use an existing, mature protocol compiler. Google protobufs is a popular choice.
Over the years, many protocol definition systems have been developed. Most of these include compilers which take a protocol description and produce client and server code, often in multiple languages. The existence of such a compiler is very helpful in projects which are not restricted to a single client (or server) implementation, since it allows other teams to easily create their own clients or servers using the standard PDU definitions. Also, as you've observed, making a clean object-oriented interface is non-trivial, even in a language like Java which has most of the features you would need.
The question of whether PDUs should have explicit length or be self-delimiting (say, with an end-indicator) is interesting. There are a lot of advantages to explicit length: for one thing, it is not necessary to have a complete parser in order to accept the PDU, which can make for much better isolation of deserialization from transmission. If a transmission consists of a stream of PDUs, the explicit length field makes error recovery simpler, and allows early dispatch of PDUs to handlers. Explicit length fields also make it easier to embed a PDU inside another PDU, which is often useful, particularly when parts of the PDU must be encrypted.
On the other hand, explicit length fields require that the entire PDU be assembled in memory before transmission, which is awkward for large PDUs and impossible for streaming with a single PDU. If the length field itself is of variable length, which is almost always necessary, then it becomes awkward to create PDU components unless the final length is known at the start. (One solution to this problem is to create the serialized string backwards, but that is also awkward, and doesn't work for streaming.)
By and large, the balance has been in favour of explicit length fields, although some systems allow "chunking". A simple form of chunking is to define a maximum chunk size, and concatenate successive chunks with the maximum size along with the first following chunk with a size less than the maximum. (It's important to be able to specify 0-length chunks, in case the PDU is an even multiple of the maximum size.) This is a reasonable compromise; it allows streaming (with some work); but its a lot more engineering effort and it creates a lot of corner cases which need to be tested and debugged.
One important maxim in designing PDU formats is that every option is a potential information leak. To the extent possible, try to make any given internal object have only a single possible serialization. Also, remember that redundancy has a cost: anywhere there is duplication, it implies a test for validity. Keeping tests to a minimum is the key to efficiency, particularly on deserialization. Skipping a validity test is an invitation to security attacks.
In my opinion making an ad hoc protocol parser is not usually a good idea. For one thing, it's a lot of work. For another thing, there are lots of subtle issues and its better to use a system which has dealt with them.
While I'm personally a fan of ASN.1, which is widely used particularly in the telecommunications industry, it is not an easy technology to fit into a small project. The learning curve is pretty steep and there are not as many open-source tools as one might like.
Currently, probably the most popular option is Google protobufs, which is available for C++, Java and Python (and a number of other languages through contributed plugins). It's simple, reasonably easy to use, and open source.

Related

Is file compression a legit way to improve the performance of network communication for a large file

i have a large file and i want to send it over the network to multiple consumers via pub sub method. For that purpose, i choose to user jeromq. The code is working but i am not happy because i want to optimize it. The file is too large over than 2gb my question now is if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved? What do you think?
In addition except for the compression is there any other technique to use?
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
(Maybe my second idea is stupid or i dont have understand something correct please give me your directions.)

if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved?
Gzip has a relatively good compression ratio (though not the best) but it is slow. In practice, it is so slow that the network interconnect can be faster than compressing+decompressing a data stream. Gzip is only fine for relatively slow network. There are faster compression algorithms to do that, but with generally lower compression ratio. For example LZ4, is very fast for both compressing and decompressing a data stream in real time. There is a catch though: the compression ratio is strongly dependent of the kind of file being sent. Indeed, binary files or already compressed data will barely be compressed, especially with fast algorithm like LZ4 so compression will not worth it. For text-based data or ones with repeated pattern (or a reduced range of byte values), the compression can be very useful.
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
This is a common broadcast algorithm in distributed computing. This methods is used in distributed hash-table algorithms and also in MPI implementations (massively used on supercomputers). For example, MPI use a tree-based broadcast method when the number of machines is relatively big or/and the amount of data is also big. Note that this method introduce additional latency overheads that are generally small in your case unless the network has a very high latency. Distributed hash-table use more complex algorithm since they generally consider a not can fail (so they use dynamic adaptation) at any time and cannot be fully trusted (so they use checks like hashes and sometime even get data from multiple sources so to avoid malicious injection from specific machines). They also do not make assumption about the network structure/speed (resulting in an imbalanced download-speed).

design pattern for streaming protoBuf messages

I want to stream protobuf messages onto a file.
I have a protobuf message
message car {
... // some fields
}
My java code would create multiple objects of this car message.
How should I stream these messages onto a file.
As far as I know there are 2 ways of going about it.
Have another message like cars
message cars {
repeated car c = 1;
}
and make the java code create a single cars type object and then stream it to a file.
Just stream the car messages onto a single file appropriately using the writeDelimitedTo function.
I am wondering which is the more efficient way to go about streaming using protobuf.
When should I use pattern 1 and when should I be using pattern 2?
This is what I got from https://developers.google.com/protocol-buffers/docs/techniques#large-data
I am not clear on what they are trying to say.
Large Data Sets
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages
within a large data set. Usually, large data sets are really just a
collection of small pieces, where each small piece may be a structured
piece of data. Even though Protocol Buffers cannot handle the entire
set at once, using Protocol Buffers to encode each piece greatly
simplifies your problem: now all you need is to handle a set of byte
strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data
sets because different situations call for different solutions.
Sometimes a simple list of records will do while other times you may
want something more like a database. Each solution should be developed
as a separate library, so that only those who need it need to pay the
costs.

Have a look at Previous Question. Any difference in size and time will be minimal
(option 1 faster ??, option 2 smaller).
My advice would be:
Option 2 for big files. You process message by message.
Option 1 if multiple languages are need. In the past, delimited was not supported in all languages, this seems to be changing though.
Other wise personel preferrence.

Java - Object Stream efficiency over network

Quick design question: I need to implement a form of communication between a client-server network in my game-engine architecture in order to send events between one another.
I had opted to create event objects and as such, I was wondering how efficient it would be to serialize these objects and pass them through an object stream over the simple socket network?
That is, how efficient is it comparatively to creating a string representation of the object, sending the string over via a char stream, and parsing the string client side?
The events will be sent every game loop, if not more; but the event object itself is just a simple wrapper for a few java primitives.
Thanks for your insight!
(tl;dr - are object streams over networks efficient?)

If performance is the primary issue, I suggest using Protocol Buffers over both your own custom serialization and Java's native serialization.
Jon Skeet gives a good explanation as well as benchmarks here: High performance serialization: Java vs Google Protocol Buffers vs ...?
If you can't use PBs, I suspect Java's native serialization will be more optimized than manually serializing/deserializing from a String. Whether or not this difference is significant is likely dependent on how complex of an object you're serializing. As always, you should benchmark to confirm your predictions.
The fact that you're sending things over a network shouldn't matter.

Edit: For time-critical applications Protocol Buffers appear a better choice. However, it appears to me that there is a significant increase in development time. Effectively you'll have to code every exchange message twice: Once as a .proto file which is compiled and spits out java wrappers, and once as a POJO which makes something useful out of these wrappers. But that's guessing from the documentation.
End of Edit
Abstract: Go for the Object Stream
So, what is less? The time it takes to code the object, send the byte stream, and decode it - all by hand - or the time it takes to code the object, send the byte stream, and decode it - all by the trusty and tried serialization mechanism?
You should make sure the objects you send are as small as possible. This can be achieved with enum values, lookup tables and the such, where possible. Might shave a few bytes off each transmission. The serialization algorithm appears very speedy to me, and anything you would code would do exactly the same. When you reinvent the wheel, more often than not you end up with triangles.

Effectively compress strings of 10-1000 characters in Java?

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?

"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'

Parse large XML files over a network

I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the network in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.
In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"
Edit:
I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.

Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).
This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.
For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:
This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize
element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.

Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.
To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.

Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.

If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.

Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?
For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.
If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.