How to execute non blocking HTTP calls in Java?

How to execute non blocking HTTP calls in Java? - java

I have a third party API, which I call using an HTTP GET request. Each request takes a few seconds to get a response.
Currently I am using a CompletableFuture which I am executing on a FixedThreadPool of size 64. This is causing the threads to be blocked until it recieves a response for the GET request, i.e. the threads sit idle after sending the GET response until they recieve a response. So my maximum number of simultaneous requests I can send out is limited by my thread size i.e. 64 here.
What can I use instead of CompletableFuture so that my threads don't sit idle waiting for the response?

As #user207421 says:
A truly asynchronous (i.e. event driven) HTTP client application is complicated.
A multi-threaded (but fundamentally synchronous) HTTP client application is simpler, and scales to as many threads as you have memory for.
Assuming that you have 64 worker threads processing requests, the actual bottleneck is likely to be EITHER your physical network bandwidth, OR your available client-side CPU. If you have hit those limits, then:
increasing the number of worker threads is not going to help, and
switching to an asynchronous (event driven) model is not going to help.
A third possibility is that the bottleneck is server-side resource limits or rate limiting. In this scenario, increasing the client-side thread count might help, have no effect, or make the problem worse. It will depend on how the server is implemented, the nature of the requests, etc.
If your bottleneck really is the number of threads, then a simple thing to try is reducing the worker thread stack size so that you can run more of them. The default stack size is typically 1MB, and that is likely to be significantly more than it needs to be. (This will also reduce the ... erm ... memory overhead of idle threads, if that is a genuine issue.)
There are a few Java asynchronous HTTP client libraries around. But I have never used one and cannot recommend one. And like #user207421, I am not convinced that the effort of changing will actually pay off.
What can I [do] so that my threads don't sit idle waiting for the response?
Idle threads is actually not the problem. An idle thread is only using memory (and some possible secondary effects which probably don't matter here). Unless you are short of memory, it will make little difference.
Note: if there is something else for your client to do while a thread is waiting for a server response, the OS thread scheduler will switch to a different thread.
So my maximum number of simultaneous requests I can send out is limited by my thread [pool] size i.e. 64 here.
That is true. However, sending more simultaneous requests probably won't help. If the client-side threads are sitting idle, that probably means that the bottleneck is either the network, or something on the server side. If this is the case, adding more threads won't increase throughput. Instead individual requests will take (on average) longer, and throughput will stay the same ... or possibly drop if the server starts dropping requests from its request queue.
Finally, if you are worried of the overheads of a large pool of worker threads sitting idle (waiting for the next task to do), use an execution service or connection pool that can shrink and grow its thread pool to meet changing workloads.

Related

Will creating more threads than available processors have performance overhead?

My goal is to handle WebSocket connections inside threads. If I use in a new Thread, the number of WebSocket connections that the server can handle is unknown. If I use in a Thread pool, the number of WebSocket connections that the server can handle is the thread pool size.
I am not sure about the correlation between available processors and threads. Does 1 processor execute 1 thread at a time?
My expected result: Creating more threads than the available processors is not advisable and you should re-design how you handle the WebSocket connections.
in a new Thread
final Socket socket = serverSocket.accept();
new Thread(new WebSocket(socket, listener)).start();
in a Thread pool
final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final Socket socket = serverSocket.accept();
es.execute(new WebSocket(socket, listener));
To avoid confusion, the WebSocket class is a custom class that implements Runnable. As far I know, Java SE does not have a WebSocket server, only a WebSocket client.

Make threads. A thousand if you want.
At the CPU core level, here's what's happening:
The CPU core is chugging along, doing work for a given websocket.
Pretty soon the core runs into a road block: Half of an incoming bunch of data has arrived, the rest is still making its way down the network cable, and thus the CPU can't continue until it arrives. Alternatively, the code that the CPU core is running is sending data out, but the network card's buffer is full, so now the CPU core has to wait for that network card to find its way to sending another packet down the cable before there's room.
Of course, if there's work to do (say, you have 10 cores in the box, and 15 web users are simultaneously connected, that leaves at least 5 users of your web site waiting around right now) - then the CPU should not just start twiddling its thumbs. It should go do something.
In practice, then, there's a whole boatload of memory that WAS relevant that no longer is (all that memory that contained all that state and other 'working items' that was neccessary to do the work for the websocket that we were working on, but which is currently 'blocked' by the network), and a whole bunch of memory that wasn't relevant that now becomes relevant (All the state and working memory of a websocket connection that was earlier put in the 'have yourself a bit of a timeout and wait around for the network packet to arrive' - for which the network packet has since arrived, so if a CPU core is free to do work, it can now go do work).
This is called a 'context switch', and it is ridiculously expensive, 500+ cycles worth. It is also completely unavoidable. You have to make the context switch. You can't avoid it. That means a cost is paid, and about 500 cycles worth just go down the toilet. It's what it is.
The thing is, there are 2 ways to pay that cost: You can switch to another thread, which is all sorts of context switching. Or, you have a single thread running so-called 'async' code that manages all this stuff itself and hops to another job to do, but then there's still a context switch.
Specifically, CPUs can't interact with memory at all anymore these days and haven't for the past decade. They can only interact with a CPU cache page. machine code is actually not really 'run directly' anymore, instead there's a level below that where a CPU notices it's about to run an instruction that touches some memory and will then map that memory command (after all, CPUs can no longer interact with it at all, memory is far too slow to wait for it) to the right spot in the cache. It'll also notice if the memory you're trying to access with your machinecode isn't in a cache page associated with that core at all, in which case it'll fire a page miss interrupt which causes the memory subsystem of your CPU/memory bus to 'evict a page' (write all back out to main memory) and then load in the right page, and only then does the CPU continue.
This all happens 'under the hood', you don't have to write code to switch pages, the CPU manages it automatically. But it's a heavy cost. Not quite as heavy as a thread switch but almost as heavy.
CONCLUSION: Threads are good, have many of them. It ensures CPUs won't twiddle their thumbs when there is work to do. Note that there are MANY blog posts that extoll the virtues of async, claiming that threads 'do not scale'. They are wrong. Threads scale fine, and async code also pays the cost of context switching, all the time.
In case you weren't aware, 'async code' is code that tries to never sleep (never do something that would ever wait. So, instead of writing 'getMeTheNextBlockOfBytesFromTheNetworkCard', you'd write: "onceBytesAreAvailableRunThis(code goes here)`). Writing async code in java is possible but incredibly difficult compared to using threads.
Even in the extremely rare cases where async code would be a significant win, Project Loom is close to completion which will grant java the ability to have thread-like things that you can manually manage (so-called fibers). That is the route the OpenJDK has chosen for this. In that sense, even if you think async is the answer, no it's not. Wait for Project Loom to complete, instead. If you want to read more, read What color is your function?, and callback hell. Neither post is java-specific but covers some of the more serious problems inherent in async.

What is the overhead of a waiting thread?

I am working with Volley library in android for Http communications . By default volley library is keeping 4 threads which take http 'Request' objects(Request object contains all those details for making an http request like url,http method,data to be posted etc) from a BlockingQueue and make http requests concurrently . When I analyze my app requirement, only below 10% of time I will be using the all 4 threads at a time , and rest of the time I will be using 1 or 2 threads from that thread pool. So in effect 2 to 3 threads will be at wait() mode almost 90% of the time .
So here is my question,
1) What is the overhead of a thread which is in wait() mode , does it consume a significant amount of cpu cycles? and is it a good idea for me to keep all those threads in wait.
I assume that since a waiting thread will be continuously checking on a monitor/lock in a loop or so(internal implementation) to wake up ,it might consume a considerable amount of cpu cycles to maintain a waiting thread . Correct me if I am wrong.
Thanks .

What is the overhead of a thread which is in wait() mode
None. Waiting thread doesn't consume any CPU cycles at all, it just waits for being awakened. So don't bother yourself.
I assume that since a waiting thread will be continuously polling on a monitor/lock internally to wake up ,it might consume a considerable amount af cpu cycles to maintain a waiting thread . Correct me if I am wrong.
That's not true. A waiting thread doesn't do any polling on a monitor/ lock/ anything.
The only situation where a big number of threads can hurt performance is where there is many active threads (much more than nr of CPUs/ cores) which are often switched back and forth. Because CPU context switching also comes with some cost. Waiting threads only consumes memory, not CPU.
If you want to look at the internal implementation of threads - I have to disappoint you. Methods like wait()/ notify() are native - which means that their implementation depends on the JVM. So in case of the HotSpot JVM you can take a look at its source code (written in C++/ with a bit of the assembler).
But do you really need this? Why you don't want to trust a JVM documentation?

how can i utilize the power of CLUSTER ENVIRONMENT for my thread pool that is dealing with I/O bound jobs?

I have developed a JAVA based server having a Thread Pool that is dynamically growing with respect to client request rate.This strategy is known as FBOS(Frequency Based Optimization Strategy) FBOS for Thread pool System.
For example if request rate is 5 requests per second then my thread pool will have 5 threads to service client's requests. The client requests are I/O bound jobs of 1 seconds.i.e. each request is a runnable java object that have a sleep() method to simulate I/O operation.
If client request rate is 10 requests per second then my thread pool will have 10 threads inside in it to process clients. Each Thread have an internal timer object that is activated when its corresponding thread is idle and when its idle time becomes 5 seconds the timer will delete its corresponding thread from the Thread Pool to dynamically shrink the Thread Pool.
My strategy is working well for short I/O intensities.My server is working nicely for small request rate but for large request rate my Thread pool have large number of threads inside it. For example if request rate is 100 request per second then my Thread Pool will have 100 threads inside it.
Now I have 3 questions in my mind
(1) Can i face memory leaks using this strategy, for large request rate?
(2) Can OS or JVM face excessive Thread management overhead on large request rate that will slow down the system
(3) Last and very important question is that ,I am very curious to implement my thread Pool in a clustered environment(I am DUMMY in clustering).
I just want to take advice from all of you that how a clustering environment can give me more benefit in the scenario of Frequency Based Thread Pool for I/O bound jobs only. That is can a clustering environment give me benefit of using memories of other systems(nodes)?

The simplest solution to use is a cached thread pool, see Executors I suggest you try this first. This will create the number of threads to need at once. For an IO bound request, a single machine can easily expand to 1000s of threads without needing an additional server.
Can i face memory leaks using this strategy, for large request rate?
No, 100 per second is not particularly high. If you are talking over 10,000 per second, you might have a problem (or need another server)
Can OS or JVM face excessive Thread management overhead on large request rate that will slow down the system
Yes, my rule of thumb is that 10,000 threads wastes about 1 cpu in overhead.
Last and very important question is that ,I am very curious to implement my thread Pool in a clustered environment(I am DUMMY in clustering).
Given you look to be using up to 1% of one machine, I wouldn't worry about using multiple machines to do the IO. Most likely you want to process the results, but without more information you couldn't say whether more machines would help or not.
can a clustering environment give me benefit of using memories of other systems(nodes)?
It can help if you need it or it can add complexity you don't need if you don't need it.
I suggest you start with a real problem and look for a solution to solve it, rather than start with a cool solution and try to find a problem for it to solve.

Thread Pool Workers Overwhelmed With Runnables Crashing the JVM

Brief
I am running a multithreaded tcp server that uses a fixed thread pool with an unbounded Runnable queue. The clients dispatch the runnables to the pool.
In my stress test scenario, 600 clients attempt to login to the server and immediately broadcast messages to every other client simultaneously and repeatedly to no end and without sleeping (Right now the clients just discard the incoming messages). Using a quad-core with 1GB reserved for heap memory - and a parallel GC for both the young and old generations - the server crashes with a OOM exception after 20 minutes. Monitoring the garbage collector reveals that the tenured generation is slowly increasing, and a full GC only frees up a small fraction of memory. A snapshot of a full heap shows that the old generation is almost completely occupied by Runnables (and their outgoing references).
It seems the worker threads are not able to finish executing the Runnables faster than the clients are able to queue them for execution (For each incoming "event" to the server, the server will create 599 runnables as there are 600 - 1 clients - assuming they are all logged in at the time).
Question
Can someone please help me conceive a strategy on how to handle the overwhelmed thread pool workers?
Also
If I bound the queue, what policy should I implement to handle rejected execution?
If I increase the size of the heap, wouldn't that only prolong the OOM exception?
A calculation can be made to measure the amount of work done in the aggregation of Runnables. Perhaps this measurement be used as a basis for a locking mechanism to coordinate clients' dispatching work?
What reaction should the client experience when the server is overwhelmed with work?

Do not use an unbounded queue. I cannot tell you what the bound should be; your load tests should give you an answer to that question. Anyhow, make the bound configurable: at least dynamycalliy configurable, better yet adaptable to some load measurement.
You did not tell us how the clients submit their requests, but if HTTP is involved, there already is a status code for the overloaded case: 503 Service Unavailable.

I would suggest you limit the capacity of the queue and "push back" on the publisher to stop it publishing or drop the requests gracefully. You can do the former b making the Queue block when its full.
You should be able to calculate your maximum throughput based on you network bandwidth and message size. If you are getting less than this, I would consider changing how your server distributes data.
Another approach is to make your message handling more efficient. You could have each reading thread from each client write directly to the listening clients. This avoids the need for an explicit queue (you might think of the buffers in the Socket as a queue of bytes) and limits the speed to whatever the server can handle. It will also not use more memory under load (than it does when idle)
Using this approach you can achieve as high message rates as your network bandwidth can handle. (Even with a 10 Gig-E network) This moves the bottle neck elsewhere, meaning you still have a problem but your server shouldn't fail.
BTW: If you use direct ByteBuffers you can do this without creating garbage and with a minimum of heap. e.g. ~1 KB of heap per client.

It sounds as if you're doing load testing. I would determine what you consider to be "acceptable heavy load". What is the heaviest amount of traffic you can expect a single client to generate? Then double it. Or triple it. Or scale a manner similar to that. Use this threshold to throttle or deny clients that use this much bandwidth.
This has a number of perks. First, it gives you the kind of analysis you need to determine server load (users per server). Second it gives you a first line of defense against DDOS attacks.

You have to somehow throttle the incoming requests, and the mechanism for doing that should depend on the work you are trying to do. Anything else will simply result in an OOM under enough load, and thus open you up for DOS attacks (even unintentional ones).
Fundamentally, you have 4 choices:
Make clients wait until you are ready to accept their requests
Actively reject client requests until you are ready to accept new requests
Allow clients to timeout while trying to reach your server when it is not ready to receive requests
A blend of 2 or 3 of the above strategies.
The right strategy depends on how your real clients will react under the various circumstances – is it better for them to wait, possibly (effectively) indefinitely, or is it better that they know quickly that their work won't get done unless they try again later?
Whichever way you do it, you need to be able to count the number of tasks currently queued and either add a delay, block completely, or return an error condition based on the number of items in the queue.
A simple blocking strategy can be implemented by using a BlockingQueue implementation. However, this doesn't give particularly fine-grained control.
Or you can use a Semaphore to control permits to add tasks to the queue, which has the advantage of supplying a tryAcquire(long timeout, TimeUnit unit) method if you want to apply a mild throttling.
Whichever way, don't allow the threads that service the clients to grow without bounds, or else you'll simply end up with an OOM for a different reason!

Why is an ExecutorService created via newCachedThreadPool evil?

Paul Tyma presentation has this line:
Executors.newCacheThreadPool evil, die die die
Why is it evil ?
I will hazard a guess: is it because the number of threads will grow in an unbounded fashion. Thus a server that has been slashdotted, would probably die if the JVM's max thread count was reached ?

(This is Paul)
The intent of the slide was (apart from having facetious wording) that, as you mention, that thread pool grows without bound creating new threads.
A thread pool inherently represents a queue and transfer point of work within a system. That is, something is feeding it work to do (and it may be feeding work elsewhere too). If a thread pool starts to grow its because it cannot keep up with demand.
In general, that's fine as computer resources are finite and that queue is built to handle bursts of work. However, that thread pool doesn't give you control over being able to push the bottleneck forward.
For example, in a server scenario, a few threads might be accepting on sockets and handing a thread pool the clients for processing. If that thread pool starts to grow out of control - the system should stop accepting new clients (in fact, the "acceptor" threads then often hop into the thread-pool temporarily to help process clients).
The effect is similar if you use a fixed thread pool with an unbounded input queue. Anytime you consider the scenario of the queue filling out of control - you realize the problem.
IIRC, Matt Welsh's seminal SEDA servers (which are asynchronous) created thread pools which modified their size according to server characteristics.
The idea of stop accepting new clients sounds bad until you realize the alternative is a crippled system which is processing no clients. (Again, with the understanding that computers are finite - even an optimally tuned system has a limit)
Incidentally, JVMs limit threads to 16k (usually) or 32k threads depending on the JVM. But if you are CPU bound, that limit isn't very relevant - starting yet another thread on a CPU-bound system is counterproductive.
I've happily run systems at 4 or 5 thousand threads. But nearing the 16k limit things tend to bog down (this limit JVM enforced - we had many more threads in linux C++) even when not CPU bound.

The problem with Executors.newCacheThreadPool() is that the executor will create and start as many threads as necessary to execute the tasks submitted to it. While this is mitigated by the fact that the completed threads are released (the thresholds are configurable), this can indeed lead to severe resource starvation, or even crash the JVM (or some badly designed OS).

There are a couple of issues with it. Unbounded growth in terms of threads an obvious issue – if you have cpu bound tasks then allowing many more than the available CPUs to run them is simply going to create scheduler overhead with your threads context switching all over the place and none actually progressing much. If your tasks are IO bound though things get more subtle. Knowing how to size pools of threads that are waiting on network or file IO is much more difficult, and depends a lot on the latencies of those IO events. Higher latencies mean you need (and can support) more threads.
The cached thread pool continues adding new threads as the rate of task production outstrips the rate of execution. There are a couple of small barriers to this (such as locks that serialise new thread id creation) but this can unbound growth can lead to out-of-memory errors.
The other big problem with the cached thread pool is that it can be slow for task producer thread. The pool is configured with a SynchronousQueue for tasks to be offered to. This queue implementation basically has zero size and only works when there is a matching consumer for a producer (there is a thread polling when another is offering). The actual implementation was significantly improved in Java6, but it is still comparatively slow for the producer, particularly when it fails (as the producer is then responsible for creating a new thread to add to the pool). Often it is more ideal for the producer thread to simply drop the task on an actual queue and continue.
The problem is, no-one has a pool that has a small core set of threads which when they are all busy creates new threads up to some max and then enqueues subsequent tasks. Fixed thread pools seem to promise this, but they only start adding more threads when the underlying queue rejects more tasks (it is full). A LinkedBlockingQueue never gets full so these pools never grow beyond the core size. An ArrayBlockingQueue has a capacity, but as it only grows the pool when capacity is reached this doesn't mitigate the production rate until it is already a big problem. Currently the solution requires using a good rejected execution policy such as caller-runs, but it needs some care.
Developers see the cached thread pool and blindly use it without really thinking through the consequences.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.