I want to stream data over network continuously. The source gives me a byte array that I'd want to store in a data structure which serves as buffer to compensate for any network lags.
What is the most efficient data structure to store the bytes in a queue fashion. Think of it as a pipe where one thread pumps in the data and other one reads and sends it over the network, while the pipe itself is long enough to contain multiple frames of the input data.
Is Queue efficient enough?
A Queue would not be efficient if you put bytes in one at a time. It would eat lots of memory, create GC pressure, and slow things down.
You could make the overhead of Queues reasonable if you put reasonably-sized (say 64kB) byte[]s or ByteBuffers in them. That buffer size could be tunable and changed based on performance experiments or perhaps even be adaptive at runtime.
TCP already compensates for network lags. If you are using UDP then you will need to handle congestion properly or things will go badly. In practice using TCP or UDP directly creates a lot of extra work and reinvention of wheels.
ZeroMQ (or the pure Java JeroMQ) is a good library option with an efficient wire protocol (good enough for realtime stock trading platforms). It handles the queueing transparently and gives a lot of options for different client models including things like PUB SUB that would help if you have lots of clients on a broadcast. Within a process ZeroMQ can manage the queueing of data being producuers and consumers. You could even use it to efficiently broadcast the same bytes to workers that do independent things with the same stream (ex: one doing usage metering and another doing transcoding).
There are other libraries that may also work. I think Netty handles things like this efficiently for example.
You should look into the OKIO libraray
Related
I need to have lots of network connections open at the same time(!) and transfer data as fast as possible. Thousands of connections. Right now, I have one thread for each connection and reading charwise from the Inputstream of that connection.
And I have the strong suspicion that the CPU/switching between the thousands of threads might impose some performance problems here even though the servers are really slow (low two-digit KB/s), since I've observed that the throughput isn't even close to being proportional to the number of threads.
Therefore I'd like to ask some programmers experienced in parallel programming:
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming? How many connections per thread? Or do you have another idea for reading really really fast from multiple network input streams?
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?
reading charwise
You know data is transmitted in packets right? Reading a single character at a time is very inefficient. Each read has to traverse all the layers from your program to the network stack in the operating system. You should try to read one full segment of data at a time.
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?
That's why the operating system has a buffer for incoming data, also called a window. When TCP segments arrive, they are put into the receive buffer. When your program requests to read from the socket, the operating system returns data from the receive buffer. If the receive buffer is full, the packet is lost and has to be sent again.
For more about how TCP works, see https://beej.us/guide/bgnet/
Wikipedia is pretty good but fairly dense
https://en.m.wikipedia.org/wiki/Transmission_Control_Protocol
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming?
What you're describing would require moving from blocking I/O to non-blocking I/O. Non-blocking will require fewer system resources, but it is significantly harder to implement correctly and efficiently. So don't do it unless you have a pressing reason.
Thousands of threads (and stacks...) are probably too many for the OS scheduler, memory management units, caches...
You need just a few threads (one per CPU) and use a select()-based solution
on each of them.
Have a look at Selector, ServerSocketChannel and SocketChannel.
(see pages 30-31 of https://www.enib.fr/~harrouet/Data/Courses/Memo_Sockets.pdf)
Edit (after a question in the comments)
Selector is not just a clever algorithm encapsulated in a class.
It relies internally on the select() system-call (or equivalent,
there are many).
The operating system is aware of a set of file-descriptors (communication
means) it has to watch and, as soon as something happens on one (or
several) of them, it wakes up the process (or thread) which is blocked
on this selector.
The idea is to stay blocked as long as possible (to save resources) and
to be waken-up only on when something useful has to be done with incoming
(there are variants) data.
In your current implementation, you use thousands of threads which are
all blocked on a read()/recv() operation because you cannot know
beforehand which connection will be the next one to deliver something.
On the other hand, with a select()-based implementation, a single
thread can be blocked watching many connections at the same time
but will only react to handle the few ones which just delivered new
data.
So I suggest that you start a pool of few threads (one per CPU for example) and as soon as the main program accepts a new incoming
connection it chooses one of them (you can keep a count for each
of them) in order to make it in charge of this new connection.
All of this requires the proper synchronisation of course and probably
a trick (a special file descriptor in the selector for example) in
order to wake-up a blocked thread when it is assigned a new connection.
I have a c process which needs to send a lot of c structs (roughly 10,000 per second) to a java process that needs to put the data into a matching class.
The size of the data that needs to be sent will be around 50-100 Byte per packet.
Latency is a concern as the data needs to be displayed in real time, which is why i am looking for alternatives to my current solution.
Currently im doing this with JNI and a POSIX message queue. Is there a better way than using JNI and message queues/pipes? I read somewhere that calling JNI methods often has a lot of overhead. Could this become a problem when a lot of data has to be sent?
Another solution i had in mind was to just write the data to a UNIX Socket and parse it in java.
If you must process the data eventually using Java, then remove as many intermediate steps as possible.
If you can read the data directly into Java ( bypassing JNI and the C code ), then do so. Avoid the JNI, message queue and (presumably) a stage where C receives the data. The queue can't be helping latency either.
If the data starts in a C-friendly form that is unfriendly to Java, then I;d consider switching entirely to C or C++ rather than processing in Java at all.
You can achieve high throughput by avoiding unnecessary copying of memory. Transfer data between C and Java through direct byte buffers. From the Java side, you can then read the data in the byte buffers and copy values into the object fields. For two processes talking to each other, you could use a memory mapped file to do this (you would use a MappedByteBuffer for this).
For something simpler but with a bit more overhead, you could simply use the stdin/stdout of each process to communicate and send data that way. Or, as you suggested, a socket is another option.
With 10,000 structs at 100 bytes each, this will be 1 MB / second of data processed. This really shouldn't be a problem on modern hardware (for one of my projects I have managed easily over 1GB / second between direct buffers and Java objects with primitive + array fields, but this was all in one process between JNI and Java).
You might want to use the simpler approach first (stdin/stdout) and see is that fast enough first before digging into using memory mapped files.
This, I think, can be solved using typical IPC methods. I would use pipes over a message queue and sockets. The overhead of message queue is going to slow the processing. And pipes are faster than sockets but not by much.
For your issue, 10,000 structs at 100 bytes per packet comes to 1 MB/s. A modern multicore processor will handle this without a problem.
Is there any open source implementation of a "refillable" queue in Java?
Essentially, such a queue would be implemented as a class which reads data from a source and stores it in its memory buffer, which is replenished every time the queue capacity falls below a predefined threshold. Therefore, it requires:
An in memory buffer to hold the data.
An input source to fill the buffer whenever it goes below threshold.
JMS queues or any other messaging system which uses network serialization are not suitable, for performance reasons.
The scenario is trivial and easy to implement, but if there is a library that offers this functionality already, there is no need to reinvent it.
RabbitMQ is a message broker. In essence, it accepts messages from producers, and delivers them to consumers. In-between, it can route, buffer, and persist the messages and data according to rules you give it.
You can also use Google Guava
I have a binary protocol with some unknown amount of initial header data (unknown length until header is fully decoded) followed by a stream of data that should be written to disk for later processing.
I have an implementation that decodes the header and then writes the file data to disk as it comes in from one or more frames in ChannelBuffers (so my handler subclasses FrameDecoder, and builds the message step by step, not waiting for one ChannelBuffer to contain the entire message, and writing the file data to disk with each frame). My concern is whether this is enough, or if using something like ChunkedWriteHandler does more than this and is necessary to handle large uploads.
Is there a more optimal way of handling file data than just writing it directly to disk from the ChannelBuffer with each frame?
It should be enough as long as the throughput is good enough. Otherwise, you might want to buffer the received data so that you don't make system calls too often (say, 32KiB bounded buffer) when the amount of received data is not large enough.
Netty could be even faster if it exposed the transferTo/From operation to a user, but such a feature is not available yet.
You should also think about adding an ExecutionHandler in front if your Handler. This will help you to not get blocked by the disk I/O. Otherwise you may see slow downs on heavy disk access.
I'm working on an online game and I've hit a little snag while working on the server side of things.
When using nonblocking sockets in Java, what is the best course of action to handle complete packet data sets that cannot be processed until all the data is available? For example, sending a large 2D tiled map over a socket.
I can think of two ways to handle it:
Allocate the ByteBuffer large enough to handle the complete data set needed to process a large 2D tiled map from my example. Continue add read data to the buffer until it's all been received and process from there.
If the ByteBuffer is a smaller size (perhaps 1500), subsequent reads can be done and put out to a file until it can be processed completely from the file. This would prevent having to have large ByteBuffers, but degrades performance because of disk I/O.
I'm using a dedicated ByteBuffer for every SocketChannel so that I can keep reading in data until it's complete for processing. The problem is if my 2D Tiled Map amounts to 2MB in size, is it really wise to use 1000 2MB ByteBuffers (assuming 1000 is a client connection limit and they are all in use)? There must be a better way that I'm not thinking of.
I'd prefer to keep things simple, but I'm open to any suggestions and appreciate the help. Thanks!
Probably, the best solution for now is to use the full 2MB ByteBuffer and let the OS take care of paging to disk (virtual memory) if that's necessary. You probably won't have 1000 concurrent users right away, and when you do, you can optimize. You may be surprised what your real performance issues are.
I decided the best course of action was to simply reduce the size of my massive dataset and send tile updates instead of an entire map update. That way I can simply send a list of tiles that have changed on a map instead of the entire map over again. This reduces the need for such a large buffer and I'm back on track. Thanks.