Thread Pool Workers Overwhelmed With Runnables Crashing the JVM

Thread Pool Workers Overwhelmed With Runnables Crashing the JVM - java

Brief
I am running a multithreaded tcp server that uses a fixed thread pool with an unbounded Runnable queue. The clients dispatch the runnables to the pool.
In my stress test scenario, 600 clients attempt to login to the server and immediately broadcast messages to every other client simultaneously and repeatedly to no end and without sleeping (Right now the clients just discard the incoming messages). Using a quad-core with 1GB reserved for heap memory - and a parallel GC for both the young and old generations - the server crashes with a OOM exception after 20 minutes. Monitoring the garbage collector reveals that the tenured generation is slowly increasing, and a full GC only frees up a small fraction of memory. A snapshot of a full heap shows that the old generation is almost completely occupied by Runnables (and their outgoing references).
It seems the worker threads are not able to finish executing the Runnables faster than the clients are able to queue them for execution (For each incoming "event" to the server, the server will create 599 runnables as there are 600 - 1 clients - assuming they are all logged in at the time).
Question
Can someone please help me conceive a strategy on how to handle the overwhelmed thread pool workers?
Also
If I bound the queue, what policy should I implement to handle rejected execution?
If I increase the size of the heap, wouldn't that only prolong the OOM exception?
A calculation can be made to measure the amount of work done in the aggregation of Runnables. Perhaps this measurement be used as a basis for a locking mechanism to coordinate clients' dispatching work?
What reaction should the client experience when the server is overwhelmed with work?

Do not use an unbounded queue. I cannot tell you what the bound should be; your load tests should give you an answer to that question. Anyhow, make the bound configurable: at least dynamycalliy configurable, better yet adaptable to some load measurement.
You did not tell us how the clients submit their requests, but if HTTP is involved, there already is a status code for the overloaded case: 503 Service Unavailable.

I would suggest you limit the capacity of the queue and "push back" on the publisher to stop it publishing or drop the requests gracefully. You can do the former b making the Queue block when its full.
You should be able to calculate your maximum throughput based on you network bandwidth and message size. If you are getting less than this, I would consider changing how your server distributes data.
Another approach is to make your message handling more efficient. You could have each reading thread from each client write directly to the listening clients. This avoids the need for an explicit queue (you might think of the buffers in the Socket as a queue of bytes) and limits the speed to whatever the server can handle. It will also not use more memory under load (than it does when idle)
Using this approach you can achieve as high message rates as your network bandwidth can handle. (Even with a 10 Gig-E network) This moves the bottle neck elsewhere, meaning you still have a problem but your server shouldn't fail.
BTW: If you use direct ByteBuffers you can do this without creating garbage and with a minimum of heap. e.g. ~1 KB of heap per client.

It sounds as if you're doing load testing. I would determine what you consider to be "acceptable heavy load". What is the heaviest amount of traffic you can expect a single client to generate? Then double it. Or triple it. Or scale a manner similar to that. Use this threshold to throttle or deny clients that use this much bandwidth.
This has a number of perks. First, it gives you the kind of analysis you need to determine server load (users per server). Second it gives you a first line of defense against DDOS attacks.

You have to somehow throttle the incoming requests, and the mechanism for doing that should depend on the work you are trying to do. Anything else will simply result in an OOM under enough load, and thus open you up for DOS attacks (even unintentional ones).
Fundamentally, you have 4 choices:
Make clients wait until you are ready to accept their requests
Actively reject client requests until you are ready to accept new requests
Allow clients to timeout while trying to reach your server when it is not ready to receive requests
A blend of 2 or 3 of the above strategies.
The right strategy depends on how your real clients will react under the various circumstances – is it better for them to wait, possibly (effectively) indefinitely, or is it better that they know quickly that their work won't get done unless they try again later?
Whichever way you do it, you need to be able to count the number of tasks currently queued and either add a delay, block completely, or return an error condition based on the number of items in the queue.
A simple blocking strategy can be implemented by using a BlockingQueue implementation. However, this doesn't give particularly fine-grained control.
Or you can use a Semaphore to control permits to add tasks to the queue, which has the advantage of supplying a tryAcquire(long timeout, TimeUnit unit) method if you want to apply a mild throttling.
Whichever way, don't allow the threads that service the clients to grow without bounds, or else you'll simply end up with an OOM for a different reason!

Related

Will creating more threads than available processors have performance overhead?

My goal is to handle WebSocket connections inside threads. If I use in a new Thread, the number of WebSocket connections that the server can handle is unknown. If I use in a Thread pool, the number of WebSocket connections that the server can handle is the thread pool size.
I am not sure about the correlation between available processors and threads. Does 1 processor execute 1 thread at a time?
My expected result: Creating more threads than the available processors is not advisable and you should re-design how you handle the WebSocket connections.
in a new Thread
final Socket socket = serverSocket.accept();
new Thread(new WebSocket(socket, listener)).start();
in a Thread pool
final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final Socket socket = serverSocket.accept();
es.execute(new WebSocket(socket, listener));
To avoid confusion, the WebSocket class is a custom class that implements Runnable. As far I know, Java SE does not have a WebSocket server, only a WebSocket client.

Make threads. A thousand if you want.
At the CPU core level, here's what's happening:
The CPU core is chugging along, doing work for a given websocket.
Pretty soon the core runs into a road block: Half of an incoming bunch of data has arrived, the rest is still making its way down the network cable, and thus the CPU can't continue until it arrives. Alternatively, the code that the CPU core is running is sending data out, but the network card's buffer is full, so now the CPU core has to wait for that network card to find its way to sending another packet down the cable before there's room.
Of course, if there's work to do (say, you have 10 cores in the box, and 15 web users are simultaneously connected, that leaves at least 5 users of your web site waiting around right now) - then the CPU should not just start twiddling its thumbs. It should go do something.
In practice, then, there's a whole boatload of memory that WAS relevant that no longer is (all that memory that contained all that state and other 'working items' that was neccessary to do the work for the websocket that we were working on, but which is currently 'blocked' by the network), and a whole bunch of memory that wasn't relevant that now becomes relevant (All the state and working memory of a websocket connection that was earlier put in the 'have yourself a bit of a timeout and wait around for the network packet to arrive' - for which the network packet has since arrived, so if a CPU core is free to do work, it can now go do work).
This is called a 'context switch', and it is ridiculously expensive, 500+ cycles worth. It is also completely unavoidable. You have to make the context switch. You can't avoid it. That means a cost is paid, and about 500 cycles worth just go down the toilet. It's what it is.
The thing is, there are 2 ways to pay that cost: You can switch to another thread, which is all sorts of context switching. Or, you have a single thread running so-called 'async' code that manages all this stuff itself and hops to another job to do, but then there's still a context switch.
Specifically, CPUs can't interact with memory at all anymore these days and haven't for the past decade. They can only interact with a CPU cache page. machine code is actually not really 'run directly' anymore, instead there's a level below that where a CPU notices it's about to run an instruction that touches some memory and will then map that memory command (after all, CPUs can no longer interact with it at all, memory is far too slow to wait for it) to the right spot in the cache. It'll also notice if the memory you're trying to access with your machinecode isn't in a cache page associated with that core at all, in which case it'll fire a page miss interrupt which causes the memory subsystem of your CPU/memory bus to 'evict a page' (write all back out to main memory) and then load in the right page, and only then does the CPU continue.
This all happens 'under the hood', you don't have to write code to switch pages, the CPU manages it automatically. But it's a heavy cost. Not quite as heavy as a thread switch but almost as heavy.
CONCLUSION: Threads are good, have many of them. It ensures CPUs won't twiddle their thumbs when there is work to do. Note that there are MANY blog posts that extoll the virtues of async, claiming that threads 'do not scale'. They are wrong. Threads scale fine, and async code also pays the cost of context switching, all the time.
In case you weren't aware, 'async code' is code that tries to never sleep (never do something that would ever wait. So, instead of writing 'getMeTheNextBlockOfBytesFromTheNetworkCard', you'd write: "onceBytesAreAvailableRunThis(code goes here)`). Writing async code in java is possible but incredibly difficult compared to using threads.
Even in the extremely rare cases where async code would be a significant win, Project Loom is close to completion which will grant java the ability to have thread-like things that you can manually manage (so-called fibers). That is the route the OpenJDK has chosen for this. In that sense, even if you think async is the answer, no it's not. Wait for Project Loom to complete, instead. If you want to read more, read What color is your function?, and callback hell. Neither post is java-specific but covers some of the more serious problems inherent in async.

Java Multithreading Network Connection Performance Advice

I need to have lots of network connections open at the same time(!) and transfer data as fast as possible. Thousands of connections. Right now, I have one thread for each connection and reading charwise from the Inputstream of that connection.
And I have the strong suspicion that the CPU/switching between the thousands of threads might impose some performance problems here even though the servers are really slow (low two-digit KB/s), since I've observed that the throughput isn't even close to being proportional to the number of threads.
Therefore I'd like to ask some programmers experienced in parallel programming:
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming? How many connections per thread? Or do you have another idea for reading really really fast from multiple network input streams?
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?

reading charwise
You know data is transmitted in packets right? Reading a single character at a time is very inefficient. Each read has to traverse all the layers from your program to the network stack in the operating system. You should try to read one full segment of data at a time.
If I don't read a char, will the server wait to send the next one until I do? What if my thread is sleeping?
That's why the operating system has a buffer for incoming data, also called a window. When TCP segments arrive, they are put into the receive buffer. When your program requests to read from the socket, the operating system returns data from the receive buffer. If the receive buffer is full, the packet is lost and has to be sent again.
For more about how TCP works, see https://beej.us/guide/bgnet/
Wikipedia is pretty good but fairly dense
https://en.m.wikipedia.org/wiki/Transmission_Control_Protocol
Is it worth rewriting the entire program so that one thread reads from multiple InputStreams in a round robin like fashion? Would that, if there is a speedup, be worth the programming?
What you're describing would require moving from blocking I/O to non-blocking I/O. Non-blocking will require fewer system resources, but it is significantly harder to implement correctly and efficiently. So don't do it unless you have a pressing reason.

Thousands of threads (and stacks...) are probably too many for the OS scheduler, memory management units, caches...
You need just a few threads (one per CPU) and use a select()-based solution
on each of them.
Have a look at Selector, ServerSocketChannel and SocketChannel.
(see pages 30-31 of https://www.enib.fr/~harrouet/Data/Courses/Memo_Sockets.pdf)
Edit (after a question in the comments)
Selector is not just a clever algorithm encapsulated in a class.
It relies internally on the select() system-call (or equivalent,
there are many).
The operating system is aware of a set of file-descriptors (communication
means) it has to watch and, as soon as something happens on one (or
several) of them, it wakes up the process (or thread) which is blocked
on this selector.
The idea is to stay blocked as long as possible (to save resources) and
to be waken-up only on when something useful has to be done with incoming
(there are variants) data.
In your current implementation, you use thousands of threads which are
all blocked on a read()/recv() operation because you cannot know
beforehand which connection will be the next one to deliver something.
On the other hand, with a select()-based implementation, a single
thread can be blocked watching many connections at the same time
but will only react to handle the few ones which just delivered new
data.
So I suggest that you start a pool of few threads (one per CPU for example) and as soon as the main program accepts a new incoming
connection it chooses one of them (you can keep a count for each
of them) in order to make it in charge of this new connection.
All of this requires the proper synchronisation of course and probably
a trick (a special file descriptor in the selector for example) in
order to wake-up a blocked thread when it is assigned a new connection.

How to execute non blocking HTTP calls in Java?

I have a third party API, which I call using an HTTP GET request. Each request takes a few seconds to get a response.
Currently I am using a CompletableFuture which I am executing on a FixedThreadPool of size 64. This is causing the threads to be blocked until it recieves a response for the GET request, i.e. the threads sit idle after sending the GET response until they recieve a response. So my maximum number of simultaneous requests I can send out is limited by my thread size i.e. 64 here.
What can I use instead of CompletableFuture so that my threads don't sit idle waiting for the response?

As #user207421 says:
A truly asynchronous (i.e. event driven) HTTP client application is complicated.
A multi-threaded (but fundamentally synchronous) HTTP client application is simpler, and scales to as many threads as you have memory for.
Assuming that you have 64 worker threads processing requests, the actual bottleneck is likely to be EITHER your physical network bandwidth, OR your available client-side CPU. If you have hit those limits, then:
increasing the number of worker threads is not going to help, and
switching to an asynchronous (event driven) model is not going to help.
A third possibility is that the bottleneck is server-side resource limits or rate limiting. In this scenario, increasing the client-side thread count might help, have no effect, or make the problem worse. It will depend on how the server is implemented, the nature of the requests, etc.
If your bottleneck really is the number of threads, then a simple thing to try is reducing the worker thread stack size so that you can run more of them. The default stack size is typically 1MB, and that is likely to be significantly more than it needs to be. (This will also reduce the ... erm ... memory overhead of idle threads, if that is a genuine issue.)
There are a few Java asynchronous HTTP client libraries around. But I have never used one and cannot recommend one. And like #user207421, I am not convinced that the effort of changing will actually pay off.
What can I [do] so that my threads don't sit idle waiting for the response?
Idle threads is actually not the problem. An idle thread is only using memory (and some possible secondary effects which probably don't matter here). Unless you are short of memory, it will make little difference.
Note: if there is something else for your client to do while a thread is waiting for a server response, the OS thread scheduler will switch to a different thread.
So my maximum number of simultaneous requests I can send out is limited by my thread [pool] size i.e. 64 here.
That is true. However, sending more simultaneous requests probably won't help. If the client-side threads are sitting idle, that probably means that the bottleneck is either the network, or something on the server side. If this is the case, adding more threads won't increase throughput. Instead individual requests will take (on average) longer, and throughput will stay the same ... or possibly drop if the server starts dropping requests from its request queue.
Finally, if you are worried of the overheads of a large pool of worker threads sitting idle (waiting for the next task to do), use an execution service or connection pool that can shrink and grow its thread pool to meet changing workloads.

how can i utilize the power of CLUSTER ENVIRONMENT for my thread pool that is dealing with I/O bound jobs?

I have developed a JAVA based server having a Thread Pool that is dynamically growing with respect to client request rate.This strategy is known as FBOS(Frequency Based Optimization Strategy) FBOS for Thread pool System.
For example if request rate is 5 requests per second then my thread pool will have 5 threads to service client's requests. The client requests are I/O bound jobs of 1 seconds.i.e. each request is a runnable java object that have a sleep() method to simulate I/O operation.
If client request rate is 10 requests per second then my thread pool will have 10 threads inside in it to process clients. Each Thread have an internal timer object that is activated when its corresponding thread is idle and when its idle time becomes 5 seconds the timer will delete its corresponding thread from the Thread Pool to dynamically shrink the Thread Pool.
My strategy is working well for short I/O intensities.My server is working nicely for small request rate but for large request rate my Thread pool have large number of threads inside it. For example if request rate is 100 request per second then my Thread Pool will have 100 threads inside it.
Now I have 3 questions in my mind
(1) Can i face memory leaks using this strategy, for large request rate?
(2) Can OS or JVM face excessive Thread management overhead on large request rate that will slow down the system
(3) Last and very important question is that ,I am very curious to implement my thread Pool in a clustered environment(I am DUMMY in clustering).
I just want to take advice from all of you that how a clustering environment can give me more benefit in the scenario of Frequency Based Thread Pool for I/O bound jobs only. That is can a clustering environment give me benefit of using memories of other systems(nodes)?

The simplest solution to use is a cached thread pool, see Executors I suggest you try this first. This will create the number of threads to need at once. For an IO bound request, a single machine can easily expand to 1000s of threads without needing an additional server.
Can i face memory leaks using this strategy, for large request rate?
No, 100 per second is not particularly high. If you are talking over 10,000 per second, you might have a problem (or need another server)
Can OS or JVM face excessive Thread management overhead on large request rate that will slow down the system
Yes, my rule of thumb is that 10,000 threads wastes about 1 cpu in overhead.
Last and very important question is that ,I am very curious to implement my thread Pool in a clustered environment(I am DUMMY in clustering).
Given you look to be using up to 1% of one machine, I wouldn't worry about using multiple machines to do the IO. Most likely you want to process the results, but without more information you couldn't say whether more machines would help or not.
can a clustering environment give me benefit of using memories of other systems(nodes)?
It can help if you need it or it can add complexity you don't need if you don't need it.
I suggest you start with a real problem and look for a solution to solve it, rather than start with a cool solution and try to find a problem for it to solve.

busy spin to reduce context switch latency (java)

In my application there are several services that process information on their own thread, when they are done they post a message to the next service which then continue to do its work on its own thread. The handover of messages is done via a LinkedBlockingQueue. The handover normally takes 50-80 us (from putting a message on the queue until the consumer starts to process the message).
To speed up the handover on the most important services I wanted to use a busy spin instead of a blocking approach (I have 12 processor cores and want to dedicate 3 to these important services).
So.. I changed LinkedBlockingQueue to ConcurrentLinkedQueue
and did
for(;;)
{
Message m = queue.poll();
if( m != null )
....
}
Now.. the result is that the first message pass takes 1 us, but then the latency increases over the next 25 handovers until reaches 500 us and then the latency is suddenly back to 1 us and the starts to increase.. So I have latency cycles with 25 iterations where latency starts at 1 us and ends at 500 us. (message are passed approximately 100 times per second)
with an average latency of 250 it is not exactly the performance gain I was looking for.
I also tried to use the LMAX Disruptor ringbuffer instead of the ConcurrentLinkedQueue. That framwork have its own build in busy spin implementation and a quite different queue implementation, but the result was the same. So im quite certain that its not the fault of the queue or me abusing something..
Question is.. What the Heck is going on here? Why am I seeing this strange latency cycles?
Cheers!!

As far as I know thread scheduler can deliberately pause a thread for a longer time if it detects that this thread is using CPU quite intensively - to distribute CPU time between different threads fairer. Try adding LockSupport.park() in the consumer after queue is empty and LockSupport.unpark() in the producer after adding the message - it might make the latency less variable; whether it will actually be better comparing to blocking queue is a big question though.

If you really need doing the job the way you described (and not the way Andrey Nudko replied at Jan 5 at 13:22), then you definitedly need looking at the problem also from other viewpoints.
Just some hints:
Try checking your overall environment (outside the JVM). For example:
OS CPU scheduler has a huge impact on this..currently the default is very likely
http://en.wikipedia.org/wiki/Completely_Fair_Scheduler
number of running processes, etc.
"problems" inside your JVM
garbage collector (try different one: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#1.1.%20Types%20of%20Collectors%7Coutline)
Try changing thread priorities: Setting priority to Java's threads

This is just wild speculation (since as others have mentioned, you're not gathering any information on the queue length, failed polls, null polls etc.):
I used the force and read the source of ConcurrentLinkedQueue, or rather, briefly leafed through it for a minute or two. The polling is not quite your trivial O(1) operation. It might be the case that you're traversing more than a few nodes which have become stale, holding null; and there might be additional transitory states involving nodes linking to themselves as the next node as indication of staleness/removal from the queue. It may be that the queue is starting to build up garbage due to thread scheduling. Try following the links to the abstract algorithm mentioned in the code:
Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue by Maged M. Michael and Michael L. Scott (link has a PDF and pseudocode).

Here is my 2 cents. If you are running on linux/unix based systems, there is a way to dedicate a certain cpu to a certain thread. In essence, you can make the OS ignore that cpu for any scheduling. Checkout the isolation levels for cpu

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.