Java Multithreading Network Connection Performance Advice

Java Multithreading Network Connection Performance Advice - java

I need to have lots of network connections open at the same time(!) and transfer data as fast as possible. Thousands of connections. Right now, I have one thread for each connection and reading charwise from the Inputstream of that connection.
And I have the strong suspicion that the CPU/switching between the thousands of threads might impose some performance problems here even though the servers are really slow (low two-digit KB/s), since I've observed that the throughput isn't even close to being proportional to the number of threads.
Therefore I'd like to ask some programmers experienced in parallel programming:
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming? How many connections per thread? Or do you have another idea for reading really really fast from multiple network input streams?
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?

reading charwise
You know data is transmitted in packets right? Reading a single character at a time is very inefficient. Each read has to traverse all the layers from your program to the network stack in the operating system. You should try to read one full segment of data at a time.
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?
That's why the operating system has a buffer for incoming data, also called a window. When TCP segments arrive, they are put into the receive buffer. When your program requests to read from the socket, the operating system returns data from the receive buffer. If the receive buffer is full, the packet is lost and has to be sent again.
For more about how TCP works, see https://beej.us/guide/bgnet/
Wikipedia is pretty good but fairly dense
https://en.m.wikipedia.org/wiki/Transmission_Control_Protocol
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming?
What you're describing would require moving from blocking I/O to non-blocking I/O. Non-blocking will require fewer system resources, but it is significantly harder to implement correctly and efficiently. So don't do it unless you have a pressing reason.

Thousands of threads (and stacks...) are probably too many for the OS scheduler, memory management units, caches...
You need just a few threads (one per CPU) and use a select()-based solution
on each of them.
Have a look at Selector, ServerSocketChannel and SocketChannel.
(see pages 30-31 of https://www.enib.fr/~harrouet/Data/Courses/Memo_Sockets.pdf)
Edit (after a question in the comments)
Selector is not just a clever algorithm encapsulated in a class.
It relies internally on the select() system-call (or equivalent,
there are many).
The operating system is aware of a set of file-descriptors (communication
means) it has to watch and, as soon as something happens on one (or
several) of them, it wakes up the process (or thread) which is blocked
on this selector.
The idea is to stay blocked as long as possible (to save resources) and
to be waken-up only on when something useful has to be done with incoming
(there are variants) data.
In your current implementation, you use thousands of threads which are
all blocked on a read()/recv() operation because you cannot know
beforehand which connection will be the next one to deliver something.
On the other hand, with a select()-based implementation, a single
thread can be blocked watching many connections at the same time
but will only react to handle the few ones which just delivered new
data.
So I suggest that you start a pool of few threads (one per CPU for example) and as soon as the main program accepts a new incoming
connection it chooses one of them (you can keep a count for each
of them) in order to make it in charge of this new connection.
All of this requires the proper synchronisation of course and probably
a trick (a special file descriptor in the selector for example) in
order to wake-up a blocked thread when it is assigned a new connection.

Related

Will creating more threads than available processors have performance overhead?

My goal is to handle WebSocket connections inside threads. If I use in a new Thread, the number of WebSocket connections that the server can handle is unknown. If I use in a Thread pool, the number of WebSocket connections that the server can handle is the thread pool size.
I am not sure about the correlation between available processors and threads. Does 1 processor execute 1 thread at a time?
My expected result: Creating more threads than the available processors is not advisable and you should re-design how you handle the WebSocket connections.
in a new Thread
final Socket socket = serverSocket.accept();
new Thread(new WebSocket(socket, listener)).start();
in a Thread pool
final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final Socket socket = serverSocket.accept();
es.execute(new WebSocket(socket, listener));
To avoid confusion, the WebSocket class is a custom class that implements Runnable. As far I know, Java SE does not have a WebSocket server, only a WebSocket client.

Make threads. A thousand if you want.
At the CPU core level, here's what's happening:
The CPU core is chugging along, doing work for a given websocket.
Pretty soon the core runs into a road block: Half of an incoming bunch of data has arrived, the rest is still making its way down the network cable, and thus the CPU can't continue until it arrives. Alternatively, the code that the CPU core is running is sending data out, but the network card's buffer is full, so now the CPU core has to wait for that network card to find its way to sending another packet down the cable before there's room.
Of course, if there's work to do (say, you have 10 cores in the box, and 15 web users are simultaneously connected, that leaves at least 5 users of your web site waiting around right now) - then the CPU should not just start twiddling its thumbs. It should go do something.
In practice, then, there's a whole boatload of memory that WAS relevant that no longer is (all that memory that contained all that state and other 'working items' that was neccessary to do the work for the websocket that we were working on, but which is currently 'blocked' by the network), and a whole bunch of memory that wasn't relevant that now becomes relevant (All the state and working memory of a websocket connection that was earlier put in the 'have yourself a bit of a timeout and wait around for the network packet to arrive' - for which the network packet has since arrived, so if a CPU core is free to do work, it can now go do work).
This is called a 'context switch', and it is ridiculously expensive, 500+ cycles worth. It is also completely unavoidable. You have to make the context switch. You can't avoid it. That means a cost is paid, and about 500 cycles worth just go down the toilet. It's what it is.
The thing is, there are 2 ways to pay that cost: You can switch to another thread, which is all sorts of context switching. Or, you have a single thread running so-called 'async' code that manages all this stuff itself and hops to another job to do, but then there's still a context switch.
Specifically, CPUs can't interact with memory at all anymore these days and haven't for the past decade. They can only interact with a CPU cache page. machine code is actually not really 'run directly' anymore, instead there's a level below that where a CPU notices it's about to run an instruction that touches some memory and will then map that memory command (after all, CPUs can no longer interact with it at all, memory is far too slow to wait for it) to the right spot in the cache. It'll also notice if the memory you're trying to access with your machinecode isn't in a cache page associated with that core at all, in which case it'll fire a page miss interrupt which causes the memory subsystem of your CPU/memory bus to 'evict a page' (write all back out to main memory) and then load in the right page, and only then does the CPU continue.
This all happens 'under the hood', you don't have to write code to switch pages, the CPU manages it automatically. But it's a heavy cost. Not quite as heavy as a thread switch but almost as heavy.
CONCLUSION: Threads are good, have many of them. It ensures CPUs won't twiddle their thumbs when there is work to do. Note that there are MANY blog posts that extoll the virtues of async, claiming that threads 'do not scale'. They are wrong. Threads scale fine, and async code also pays the cost of context switching, all the time.
In case you weren't aware, 'async code' is code that tries to never sleep (never do something that would ever wait. So, instead of writing 'getMeTheNextBlockOfBytesFromTheNetworkCard', you'd write: "onceBytesAreAvailableRunThis(code goes here)`). Writing async code in java is possible but incredibly difficult compared to using threads.
Even in the extremely rare cases where async code would be a significant win, Project Loom is close to completion which will grant java the ability to have thread-like things that you can manually manage (so-called fibers). That is the route the OpenJDK has chosen for this. In that sense, even if you think async is the answer, no it's not. Wait for Project Loom to complete, instead. If you want to read more, read What color is your function?, and callback hell. Neither post is java-specific but covers some of the more serious problems inherent in async.

DDoS resistant network application design

Okay, for my application I'm trying to decide on an architecture that's as DDoS resistant as possible. Obviously it will never be perfect but I'd like protection against simple attacks.
There's a few that I've thought of so far:
1) Single thread per connection.
This method seems to have unbelievable scalability problems, and with a tonne of connections, having too many threads seems like it would be a scheduling nightmare for the OS.
2) 2 threads. first thread will accept connections and append them to a list, the second thread loops through the list (with the proper synchro here) and checks if there's anything in the InputStream. Upon finding something, read a line. Any of the actual work will be done, including the reply, in a new event thread. The new thread is just passed the line that is read.
This method seems to have even bigger problems. It appears as though a simple cat /dev/urandom | telnet server port would lock it down.
3) This is similar to #2, but only read a single byte from each connection at each iteration, and processing it as a string when I get to a newline byte.
This seems like my best option so far, but it means that if the attack initiates a lot of connections and sends input on all of them, it could slow the loop down considerably.
Are there any other potential architectures that might be better suited for the job?

Large companies have entire teams that spend all day every day working to combat various DOS attacks. It' too much to discuss here. After trivial mitigation techniques (like SYN-cookies, etc.), your best bet is simply to have sufficient capacity to "eat it".
I would recommend writing your code for efficiency and then running it on a hosted service like Google's or Amazon's and let them deal with fending of DOS attacks and scaling your service to handle such spikes.

Java and queues: saturation issues with multithreaded I/O

This question relates to the latest version of Java.
30 producer threads push strings to an abstract queue. One writer thread pops from the same queue and writes the string to a file that resides on a 5400 rpm HDD RAID array. The data is pushed at a rate of roughly 111 MBps, and popped/written at a rate of roughly 80MBps. The program lives for 5600 seconds, enough for about 176 GB of data to accumulate in the queue. On the other hand, I'm restricted to a total of 64GB of main memory.
My question is: What type of queue should I use?
Here's what I've tried so far.
1) ArrayBlockingQueue. The problem with this bounded queue is that, regardless of the initial size of the array, I always end up with liveness issues as soon as it fills up. In fact, a few seconds after the program starts, top reports only a single active thread. Profiling reveals that, on average, the producer threads spend most of their time waiting for the queue to free up. This is regardless of whether or not I use the fair-access policy (with the second argument in the constructor set to true).
2) ConcurrentLinkedQueue. As far as liveness goes, this unbounded queue performs better. Until I run out of memory, about seven hundred seconds in, all thirty producer threads are active. After I cross the 64GB limit, however, things become incredibly slow. I conjecture that this is because of paging issues, though I haven't performed any experiments to prove this.
I foresee two ways out of my situation.
1) Buy an SSD. Hopefully the I/O rate increases will help.
2) Compress the output stream before writing to file.
Is there an alternative? Am I missing something in the way either of the above queues are constructed/used? Is there a cleverer way to use them? The Java Concurrency in Practice book proposes a number of saturation policies (Section 8.3.3) in the case that bounded queues fill up faster than they can be exhausted, but unfortunately none of them---abort, caller runs, and the two discard policies---apply in my scenario.

Look for the bottleneck. You produce more then you consume, a bounded queue makes absolutely sense, since you don't want to run out of memory.
Try to make your consumer faster. Profile and look where the most time is spent. Since you write to a disk here some thoughts:
Could you use NIO for your problem? (maybe FileChannel#transferTo())
Flush only when needed.
If you have enough CPU reserves, compress the stream? (as you already mentioned)
optimize your disks for speed (raid cache, etc.)
faster disks
As #Flavio already said, for the producer-consumer pattern, i see no problem there and it should be the way it is now. In the end the slowest party controls the speed.

I can't see the problem here. In a producer-consumer situation, the system will always go with the speed of the slower party. If the producer is faster than the consumer, it will be slowed down to the consumer speed when the queue fills up.
If your constraint is that you can not slow down the producer, you will have to find a way to speed up the consumer. Profile the consumer (don't start too fancy, a few System.nanoTime() calls often give enough information), check where it spends most of its time, and start optimizing from there. If you have a CPU bottleneck you can improve your algorithm, add more threads, etc. If you have a disk bottleneck try writing less (compression is a good idea), get a faster disk, write on two disks instead of one...

According to java "Queue implementation" there are other classes that should be right for you:
LinkedBlockingQueue
PriorityBlockingQueue
DelayQueue
SynchronousQueue
LinkedTransferQueue
TransferQueue
I don't know the performance of these classes or the memory usage but you can try by your self.
I hope that this helps you.

Why do you have 30 producers. Is that number fixed by the problem domain, or is it just a number you picked? If the latter, you should reduce the number of producers until they produce at total rate that is larger than the consumption by only a small amount, and use a blocking queue (as others have suggested). Then you will keep your consumer busy, which is the performance limiting part, while minimizing use of other resources (memory, threads).

you have only 2 ways out: make suppliers slower or consumer faster. Slower producers can be done in many ways, particullary, using bounded queues. To make consumer faster, try https://www.google.ru/search?q=java+memory-mapped+file . Look at https://github.com/peter-lawrey/Java-Chronicle.
Another way is to free writing thread from work of preparing write buffers from strings. Let the producer threads emit ready buffers, not strings. Use limited number of buffers, say, 2*threadnumber=60. Allocate all buffers at the start and then reuse them. Use a queue for empty buffers. Producing thread takes a buffer from that queue, fills it and puts into writing queue. Writing thread takes buffers from writing thread, writes to disk and puts into the empty buffers queue.
Yet another approach is to use asynchronous I/O. Producers initiate writing operation themselves, without special writing thread. Completion handler returns used buffer into tthe empty buffers queue.

Thread Pool Workers Overwhelmed With Runnables Crashing the JVM

Brief
I am running a multithreaded tcp server that uses a fixed thread pool with an unbounded Runnable queue. The clients dispatch the runnables to the pool.
In my stress test scenario, 600 clients attempt to login to the server and immediately broadcast messages to every other client simultaneously and repeatedly to no end and without sleeping (Right now the clients just discard the incoming messages). Using a quad-core with 1GB reserved for heap memory - and a parallel GC for both the young and old generations - the server crashes with a OOM exception after 20 minutes. Monitoring the garbage collector reveals that the tenured generation is slowly increasing, and a full GC only frees up a small fraction of memory. A snapshot of a full heap shows that the old generation is almost completely occupied by Runnables (and their outgoing references).
It seems the worker threads are not able to finish executing the Runnables faster than the clients are able to queue them for execution (For each incoming "event" to the server, the server will create 599 runnables as there are 600 - 1 clients - assuming they are all logged in at the time).
Question
Can someone please help me conceive a strategy on how to handle the overwhelmed thread pool workers?
Also
If I bound the queue, what policy should I implement to handle rejected execution?
If I increase the size of the heap, wouldn't that only prolong the OOM exception?
A calculation can be made to measure the amount of work done in the aggregation of Runnables. Perhaps this measurement be used as a basis for a locking mechanism to coordinate clients' dispatching work?
What reaction should the client experience when the server is overwhelmed with work?

Do not use an unbounded queue. I cannot tell you what the bound should be; your load tests should give you an answer to that question. Anyhow, make the bound configurable: at least dynamycalliy configurable, better yet adaptable to some load measurement.
You did not tell us how the clients submit their requests, but if HTTP is involved, there already is a status code for the overloaded case: 503 Service Unavailable.

I would suggest you limit the capacity of the queue and "push back" on the publisher to stop it publishing or drop the requests gracefully. You can do the former b making the Queue block when its full.
You should be able to calculate your maximum throughput based on you network bandwidth and message size. If you are getting less than this, I would consider changing how your server distributes data.
Another approach is to make your message handling more efficient. You could have each reading thread from each client write directly to the listening clients. This avoids the need for an explicit queue (you might think of the buffers in the Socket as a queue of bytes) and limits the speed to whatever the server can handle. It will also not use more memory under load (than it does when idle)
Using this approach you can achieve as high message rates as your network bandwidth can handle. (Even with a 10 Gig-E network) This moves the bottle neck elsewhere, meaning you still have a problem but your server shouldn't fail.
BTW: If you use direct ByteBuffers you can do this without creating garbage and with a minimum of heap. e.g. ~1 KB of heap per client.

It sounds as if you're doing load testing. I would determine what you consider to be "acceptable heavy load". What is the heaviest amount of traffic you can expect a single client to generate? Then double it. Or triple it. Or scale a manner similar to that. Use this threshold to throttle or deny clients that use this much bandwidth.
This has a number of perks. First, it gives you the kind of analysis you need to determine server load (users per server). Second it gives you a first line of defense against DDOS attacks.

You have to somehow throttle the incoming requests, and the mechanism for doing that should depend on the work you are trying to do. Anything else will simply result in an OOM under enough load, and thus open you up for DOS attacks (even unintentional ones).
Fundamentally, you have 4 choices:
Make clients wait until you are ready to accept their requests
Actively reject client requests until you are ready to accept new requests
Allow clients to timeout while trying to reach your server when it is not ready to receive requests
A blend of 2 or 3 of the above strategies.
The right strategy depends on how your real clients will react under the various circumstances – is it better for them to wait, possibly (effectively) indefinitely, or is it better that they know quickly that their work won't get done unless they try again later?
Whichever way you do it, you need to be able to count the number of tasks currently queued and either add a delay, block completely, or return an error condition based on the number of items in the queue.
A simple blocking strategy can be implemented by using a BlockingQueue implementation. However, this doesn't give particularly fine-grained control.
Or you can use a Semaphore to control permits to add tasks to the queue, which has the advantage of supplying a tryAcquire(long timeout, TimeUnit unit) method if you want to apply a mild throttling.
Whichever way, don't allow the threads that service the clients to grow without bounds, or else you'll simply end up with an OOM for a different reason!

Is a single Java thread better than multiple threading in my scenario?

Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks

Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.

Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.

It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.

First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system

As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.

In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.