I have a c process which needs to send a lot of c structs (roughly 10,000 per second) to a java process that needs to put the data into a matching class.
The size of the data that needs to be sent will be around 50-100 Byte per packet.
Latency is a concern as the data needs to be displayed in real time, which is why i am looking for alternatives to my current solution.
Currently im doing this with JNI and a POSIX message queue. Is there a better way than using JNI and message queues/pipes? I read somewhere that calling JNI methods often has a lot of overhead. Could this become a problem when a lot of data has to be sent?
Another solution i had in mind was to just write the data to a UNIX Socket and parse it in java.
If you must process the data eventually using Java, then remove as many intermediate steps as possible.
If you can read the data directly into Java ( bypassing JNI and the C code ), then do so. Avoid the JNI, message queue and (presumably) a stage where C receives the data. The queue can't be helping latency either.
If the data starts in a C-friendly form that is unfriendly to Java, then I;d consider switching entirely to C or C++ rather than processing in Java at all.
You can achieve high throughput by avoiding unnecessary copying of memory. Transfer data between C and Java through direct byte buffers. From the Java side, you can then read the data in the byte buffers and copy values into the object fields. For two processes talking to each other, you could use a memory mapped file to do this (you would use a MappedByteBuffer for this).
For something simpler but with a bit more overhead, you could simply use the stdin/stdout of each process to communicate and send data that way. Or, as you suggested, a socket is another option.
With 10,000 structs at 100 bytes each, this will be 1 MB / second of data processed. This really shouldn't be a problem on modern hardware (for one of my projects I have managed easily over 1GB / second between direct buffers and Java objects with primitive + array fields, but this was all in one process between JNI and Java).
You might want to use the simpler approach first (stdin/stdout) and see is that fast enough first before digging into using memory mapped files.
This, I think, can be solved using typical IPC methods. I would use pipes over a message queue and sockets. The overhead of message queue is going to slow the processing. And pipes are faster than sockets but not by much.
For your issue, 10,000 structs at 100 bytes per packet comes to 1 MB/s. A modern multicore processor will handle this without a problem.
Related
This will be a bit abstract question since i don't even know if there are any developments like this.
Given we have an application which tries to deliver text data from point A to B.
A and B are quite far away so size of the data has significant effect on all important metrics we want to optimize for (speed, latency and throughput). First thing that comes to mind is compression, but compression is not that effective when we have to compress many many small messages but its very effective when the size of the compressed data is significant.
I have no experience with compression algorithms but my understanding is that bigger the input the better can be the compression rate since there is a bigger likelihood of repeating chunks and things that can be optimized.
One other way we could go is batching, by waiting for some N period of time and collecting all tiny messages and creating one compressed big one we could have good compression rate but we would sacrifice latency, the message that arrives first will take unnecessary delay of N.
Solution that I'm looking for is something like this, when a compression algorithm traverses the data set it is probably having some dictionary of things that it knows can be optimized. This dictionary is thrown away every time we finish with the compression and it is always sent with the message to B.
rawMsg -> [dictionary|compressedPayload] -> send to B
however if we could have this dictionary to be maintained in memory, and be sent only when there is a change in it that would mean that we can efficiently compress even small messages and avoid sending the dictionary to the other end every time...
rawMsg -> compress(existingDictrionaryOfSomeVersion, rawMsg) -> [dictionaryVersion|compressedPayload] -> send to B
now obviously the assumption here is that B will also keep the instance of dictionary and keep updating it when the newer version arrives.
Note that exactly this is happening already with protocols like protobuf or fix (in financial applications).
With any message you have schema (dictionary) and it is available on both ends and then you just send raw binary data, efficient and fast but your schema is fixed and unchanged.
I'm looking for something that can be used for free form text.
Is there any technology that allows to do this (without having some fixed schema)?
You can simply send the many small messages in a single compressed stream. Then they will be able to take advantage of the previous history of small messages. With zlib you can flush out each message, which will avoid having to wait for a whole block to be built up before transmitting. This will degrade compression, but not nearly as much as trying to compress each string individually (which will likely just end up expanding them). In the case of zlib, your dictionary is always the last 32K of messages that you have sent.
I want to stream data over network continuously. The source gives me a byte array that I'd want to store in a data structure which serves as buffer to compensate for any network lags.
What is the most efficient data structure to store the bytes in a queue fashion. Think of it as a pipe where one thread pumps in the data and other one reads and sends it over the network, while the pipe itself is long enough to contain multiple frames of the input data.
Is Queue efficient enough?
A Queue would not be efficient if you put bytes in one at a time. It would eat lots of memory, create GC pressure, and slow things down.
You could make the overhead of Queues reasonable if you put reasonably-sized (say 64kB) byte[]s or ByteBuffers in them. That buffer size could be tunable and changed based on performance experiments or perhaps even be adaptive at runtime.
TCP already compensates for network lags. If you are using UDP then you will need to handle congestion properly or things will go badly. In practice using TCP or UDP directly creates a lot of extra work and reinvention of wheels.
ZeroMQ (or the pure Java JeroMQ) is a good library option with an efficient wire protocol (good enough for realtime stock trading platforms). It handles the queueing transparently and gives a lot of options for different client models including things like PUB SUB that would help if you have lots of clients on a broadcast. Within a process ZeroMQ can manage the queueing of data being producuers and consumers. You could even use it to efficiently broadcast the same bytes to workers that do independent things with the same stream (ex: one doing usage metering and another doing transcoding).
There are other libraries that may also work. I think Netty handles things like this efficiently for example.
You should look into the OKIO libraray
I want to make this general question. If we have a program which reads data from outside of the program, should we first put the data in a container and then manipulate the data or should we work directly on the stream, if the stream api of a language is powerful enough?
For example. I am writing a program which reads from text file. Should I first put the data in a string and then manipulate instead of working directly on the stream. I am using java and let's say it has powerful enough (for my needs) stream classes.
Stream processing is generally preferable to accumulating data in memory.
Why? One obvious reason is that the file you are reading might not even fit into memory. You might not even know the size of the data before you've read it completely (imagine, that you are reading from a socket or a pipe rather than a file).
It is also more efficient, especially, when the size isn't known ahead of time - allocating large chunks of memory and moving data around between them can be taxing. Things like processing and concatenating large strings aren't free either.
If the io is slow (ever tried reading from a tape?) or if the data is being produced in real time by a peer process (socket/pipe), your processing of the data read can, at least in part, happen in parallel with reading, which will speed things up.
Stream processing is inherently easier to scale and parallelize if necessary, because your logic is forced to only depend on the current element, being processed, you are free from state. If the amount of data becomes too large to process sequentially, you can trivially scale your app, by adding more readers, and splitting the stream between them.
You might argue, that in case none of this matters, because the file you are reading is only 300 bytes. Indeed, for small amounts of data, this is not crucial (you may also bubble sort it while you are at it), but adopting good patterns and practices makes you a better programmer, and will help when it does matters. There is no disadvantage to it. No, it does not make your code more complicated. It might seem so to you at first, but that's simply because you are not used to stream processing. Once you get into the right mindset, and it becomes natural to you, you'll see that, if anything, the code, dealing with one small piece of data at a time, and not caring about indexes, pointers and positions, is simpler than the alternative.
All of the above applies to sequential processing though. You read the stream once, processing the data immediately, as it comes in, and discarding it (or, perhaps, writing out to the next stream in pipeline).
You mentioned RandomAccessFile ... that's a completely different beast. If you need random access, and the data fits in memory, put it in memory. Seeking the file back and forth is the same thing conceptually, only much slower. There is no benefit to it other than saving memory.
You should certainly process it as you receive it. The other way adds latency and doesn't scale.
I have a cluster of servers (potentially remote from each other) which all run Tomcat and communicate over HTTP using Apache HttpClient. A large number of these servers are data stores, and one of the servers is a front-facing webserver that serves as an intermediary between the client and the stores. A user should be able to upload a file to the webserver and the webserver will pass that file to a given number of stores.
So, the question: is it possible to take the file part of the upload from the client as an InputStream and write to multiple POST requests to the stores at the same time? If I were simply writing to local files, the obvious solution would simply be to read chunks of the InputStream into a byte array buffer and write from the buffer to each of the outputs in turn, but I'm at a loss as to how to convince HttpClient to "share" a stream like this.
And yes, I could simply read the entire InputStream into an object on the webserver and write it out to each store sequentially, but since I could potentially be accepting very large files I'd have to write the data to disk and then read it back for each store server, and the number of disk operations could swiftly become prohibitive. This is an implementation I'd prefer to avoid.
If the stores do not have the network bandwidth to keep up, how would it "share" the stream?
You can split up the incoming file and pass it on to the stores without writing it to disk, but if just one of the stores cannot keep up, you'll have to keep that file data in memory until it can accept it. If it's a big file, or many users, it can potentially take all your memory.
More technically what I mean is that you can create 5 threads that will send data as fast as they can to the stores and keep the file data in a shared FIFO structure. When the last thread has accessed a portion and sent that portion, that data can be removed from the data structure, but not before. If one is slow, the data structure can grow huge.
The data has to be somewhere, if not memory and not hard drive, then where?
So, keep the incoming data in memory until (if?) you're running out of memory (never?), then flush it to the hard drive. Keep trying to empty the data structure with the data by getting it sent to the stores and then removing.
You can rather easily code an ExecutorService to handle the re-transmit of data and cleaning up the data structure, but it won't solve the problem magically. :)
I haven't provided source code, because you don't seem to want this solution. You might need help implementing it later if you accept that you can't magically pass the data on without there being some chance of having to buffer it on the hard drive (or a worse solution would be to throtte the user uploads to MinimumBandwidth(store1, store2, store3, store4, store5)).
Edit/changing:
I'm not sure you really want an ExecutorService even though I said that. I would create my own custom Thread's to handle this actually. I would create a Collection from the concurrent package, probably a LinkedBlockingQueue that holds byte arrays (not bytes, arrays of bytes). Then I would create a map from Thread->Integer that holds the current index for each thread's process in passing on the data. When all progress numbers are above say 10 (meaning all threads have sent the first 10 chunks), then I remove the first 10 byte arrays, and subtract 10 from all the thread's progress to reset it.
Create your own output stream. Attach as many HTTP POST Clients to this stream. If you receive Date to your output stream send it to each of the connected POST Clients.
Can someone with the natural gift to explain complex things in an easy and straightforward way address this question? To acquire the best performance when should I use direct ByteBuffers versus regular ByteBuffers when doing network I/O with Java NIO?
For example: Should I read into a heap buffer and parse it from there, doing many get() (byte by byte) OR should I read it into a direct buffer and parse from the direct buffer?
To acquire the best performance when should I use direct ByteBuffers versus regular ByteBuffers when doing network I/O with Java NIO?
Direct buffers have a number of advantages
The avoid an extra copy of data passed between Java and native memory.
If they are re-used, only the page used are turning into real memory. This means you can make them much larger than they need to me and they only waste virtual memory.
You can access multi-byte primitives in native byte order efficiently. (Basically one machine code instruction)
Should I read into a heap buffer and parse it from there, doing many get() (byte by byte) OR should I read it into a direct buffer and parse from the direct buffer?
If you are reading a byte at a time, you may not get much advantage. However, with a direct byte buffer you can read 2 or 4 bytes at a time and effectively parse multiple bytes at once.
[real time] [selectors]
If you are parsing real time data, I would avoid using selectors. I have found using blocking NIO or busy waiting NIO can give you the lowest latency performance (assuming you have a relatively small number of connections e.g. up to 20)
A direct buffer is best when you are just copying the data, say from a socket to a file or vice versa, as the data doesn't have to traverse the JNI/Java boundary, it just stays in JNI land. If you are planning to look at the data yourself there's no point in a direct buffer.