Optimising max number of threads running on a CPU

Optimising max number of threads running on a CPU - java

Just wondering what is the best way to decide when to stop creating new threads on a single-core machine which is running the same program multiple times as a thread?
The threads are fetching web content and doing a bit of processing, which means the load of each thread is not constant all the way until the thread terminates.
I'm thinking to have a thread which monitors the CPU/RAM load, and stop creating threads if the load reaches a certain treshold, but also stop creating threads if a certain threads count has been reached, to make sure the CPU doesn't get overloaded.
Any feedback on what techniques are out there to achieve this?
Many thanks,
Vladimir

It is going to be difficult to do this by monitoring the CPU used by the current process. Those numbers tend to lag reality and the result is going to be peaks and valleys to a large degree. The problem is that your threads are mostly going to be blocked by IO and there is not any good way to anticipate when bytes will be available to be read in the near future.
That said, you could start out with a ThreadPoolExecutor at a certain max thread number (for a single processor let's say 4) and then check every 10 seconds or so the load average. If the load average is below what you want then you could call setMaximumPoolSize(...) with a larger value to increase it for the next 10 seconds. You may need to poll 30 or more seconds between each calculation to smooth out the performance of your application.
You could use the following code to track your total CPU time for all threads. Not sure if that's the best way to do it
long total = 0;
for (long id : threadMxBean.getAllThreadIds()) {
long cpuTime = threadMxBean.getThreadCpuTime(id);
if (cpuTime > 0) {
total += cpuTime;
}
}
// since is in nano-seconds
long currentCpuMillis = total / 1000000;
Instead of trying to maximize the CPU level for your spider, you might consider trying to maximize throughput. Take the sample of the number of pages spidered per a unit of time and increase or decrease the max number of threads in your ExecutorService until this is maximized.
One thing to consider is to use NIO and selectors so your threads are always busy as opposed to always waiting for IO. Here's a good example tutorial about NIO/Selectors. You might also consider using Pyronet which seems to provide some good features around NIO.

If async I/O is not a good fit, I would consider using thread pools, e.g. ThreadPoolExecutor, so you don't have the overhead of creating, destroying and recreating threads.
Then I would do performance testing to tweak the max number of threads offers the best performance.
You could start with 10 threads, then rerun your performance test with 20 threads until you hone in on an optimal value. At the same time I would use system tools (depending on your OS) to monitor the thread run queue, JVM, etc.
For the performance test you would have to ensure that your test is repeatable (i.e. using the same inputs) and representative of the actual input that your program would be using.

Related

At what point do short sleeps increase CPU util rather than reduce it

I'm in the design phase of a program that will have 10-30 threads, where each thread will process many small blocks of information.
I have the option of each block sleeping for 5 ms or not sleeping at all. I should choose whichever reduces load on the CPU.
Normally I would sleep to reduce CPU utilization, but I am concerned that many 5 ms sleeps may, because of context switching, cause CPU utilization to increase rather than decrease.
Are there any studies already done on the trade off between short sleeps and context switching on CPU utilization?

Sleeping is an option of last resort. Instead, have a look at the tool in the concurrent API, especially the Queues which allow you to put a task to sleep until a message arrives on which it should act.
Or look at Akka which allows you to easily build a system where a few threads process thousands of messages. The main drawback here is that you have to design your system around Akka - it's really not something that you can easily retrofit.
If speed is your main concern, look at blocking free algorithms like this one "Single Producer/Consumer lock free Queue step by step" or LMAX Disruptor.
Related:
http://ashkrit.blogspot.com/2012/11/lock-free-bounded-queue.html
A fast queue in Java

I tested it. Anecdotally, I started to see a benefit at a 3 ms sleep, increasing dramatically with each additional ms after that.
The test was done with 30 threads running on an Intel Core i3-4130. Each thread created an array list of 1000 Strings, each String is randomly generated with commons lang and is 1000 characters long.
The thread then alternated between shuffling and sorting the list.
Disclaimer: this is a short set of tests on one desktop cpu - not a study.

Java concurrency based on available FREE cpu

QUESTION
How do I scale to use more threads if and only if there is free cpu?
Something like a ThreadPoolExecutor that uses more threads when cpu cores are idle, and less or just one if not.
USE CASE
Current situation:
My Java server app processes requests and serves results.
There is a ThreadPoolExecutor to serve the requests with a reasonable number of max threads following the principle: number of cpu cores = number of max threads.
The work performed is cpu heavy, and there's some disk IO (DBs).
The code is linear, single threaded.
A single request takes between 50 and 500 ms to process.
Sometimes there are just a few requests per minute, and other times there are 30 simultaneous.
A modern server with 12 cores handles the load nicely.
The throughput is good, the latency is ok.
Desired improvement:
When there is a low number of requests, as is the case most of the time, many cpu cores are idle.
Latency could be improved in this case by running some of the code for a single request multi-threaded.
Some prototyping shows improvements, but as soon as I test with a higher number of concurrent requests,
the server goes bananas. Throughput goes down, memory consumption goes overboard.
30 simultaneous requests sharing a queue of 10 meaning that 10 can run at most while 20 are waiting,
and each of the 10 uses up to 8 threads at once for parallelism, seems to be too much for a machine
with 12 cores (out of which 6 are virtual).
This seems to me like a common use case, yet I could not find information by searching.
IDEAS
1) request counting
One idea is to count the current number of processed requests. If 1 or low then do more parallelism,
if high then don't do any and continue single-threaded as before.
This sounds simple to implement. Drawbacks are: request counter resetting must not contain bugs,
think finally. And it does not actually check available cpu, maybe another process uses cpu also.
In my case the machine is dedicated to just this application, but still.
2) actual cpu querying
I'd think that the correct approach would be to just ask the cpu, and then decide.
Since Java7 there is OperatingSystemMXBean.getSystemCpuLoad() see http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad()
but I can't find any webpage that mentions getSystemCpuLoad and ThreadPoolExecutor, or a similar
combination of keywords, which tells me that's not a good path to go.
The JavaDoc says "Returns the "recent cpu usage" for the whole system", and I'm wondering what
"recent cpu usage" means, how recent that is, and how expensive that call is.
UPDATE
I had left this question open for a while to see if more input is coming. Nope. Although I don't like the "no-can-do" answer to technical questions, I'm going to accept Holger's answer now. He has good reputation, good arguments, and others have approved his answer.
Myself I had experimented with idea 2 a bit. I queried the getSystemCpuLoad() in tasks to decide how large their own ExecutorService could be. As Holger wrote, when there is a SINGLE ExecutorService, resources can be managed well. But as soon as tasks start their own tasks, they cannot - it didn't work out for me.

There is no way of limiting based on “free CPU” and it wouldn’t work anyway. The information about “free CPU” is outdated as soon as you get it. Suppose you have twelve threads running concurrently and detecting at the same time that there is one free CPU core and decide to schedule a sub-task…
What you can do is limiting the maximum resource consumption which works quite well when using a single ExecutorService with a maximum number of threads for all tasks.
The tricky part is the dependency of the tasks on the result of the sub-tasks which are enqueued at a later time and might still be pending due to the the limited number of worker threads.
This can be adjusted by revoking the parallel execution if the task detects that its sub-task is still pending. For this to work, create a FutureTask for the sub-task manually and schedule it with execute rather than submit. Then proceed within the task as normally and at the place where you would perform the sub-task in a sequential implementation check whether you can remove the FutureTask from the ThreadPoolExecutor. Unlike cancel this works only if it has not started yet and hence is an indicator that there are no free threads. So if remove returns true you can perform the sub-task in-place letting all other threads perform tasks rather than sub-tasks. Otherwise, you can wait for the result.
At this place it’s worth noting that it is ok to have more threads than CPU cores if the tasks accommodate I/O operations (or may wait for sub-tasks). The important point here is to have a limit.
FutureTask<Integer> coWorker = new FutureTask<>(/* callable wrapping sub-task*/);
executor.execute(coWorker);
// proceed in the task’s sequence
if(executor.remove(coWorker)) coWorker.run();// do in-place if needed
subTaskResult=coWorker.get();
// proceed

It sounds like the ForkJoinPool introduced in Java 7 would be exactly what you need. The ForkJoinPool is specifically designed to keep all your CPUs exactly busy meaning that there are as many threads as there are CPUs and that all those threads are also working and not blocking (For the later make sure that you use ManagedBlockers for DB queries).
In a ForkJoinTask there is the method getSurplusQueuedTaskCount for which the JavaDoc says "This value may be useful for heuristic decisions about whether to fork other tasks." and as such serves as a better replacement for your getSystemCpuLoad solution to make decisions about task decompositions. This allows you to reduce the number of decompositions when system load is high and thus reduce the impact of the task decomposition overhead.
Also see my answer here for some more indepth explanation about the principles of Fork/Join-pools.

How to determine optimal number of threads for high latency network requests?

I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.
The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.
This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.
I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:
ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();
for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
futures.add(executorService.submit(new RetrieverCallable(quads)));
}
}
... I then dequeue all the elements in the queue and put the results in the required data structure:
int[] result = int[65536]
while(!futures.isEmpty()) {
try {
results[i] = futures.remove().get();
} catch (Exception e) {
addresses[i] = -1;
}
}
My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?
My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.
It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.
Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.

As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.
He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.
Tuning the pool size
Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
My emphasis.

Have you considered using Actors?
Best practises.
Actors should be like nice co-workers: do their job efficiently
without bothering everyone else needlessly and avoid hogging
resources. Translated to programming this means to process events and
generate responses (or more requests) in an event-driven manner.
Actors should not block (i.e. passively wait while occupying a Thread)
on some external entity—which might be a lock, a network socket,
etc.—unless it is unavoidable; in the latter case see below.
Sorry, I can't elaborate, because haven't much used this.
UPDATE
Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?

Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?
Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.
So less is always better, if only for reasons of risk reduction. And you cant have less than 1.

I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.
Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.
Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.

In our high-performance systems, we use the actor model as described by #Andrey Chaschev.
The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is
If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.

An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.
You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.
You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.

how many threads to run in java?

I had this brilliant idea to speed up the time needed for generating 36 files: use 36 threads!! Unfortunately if I start one connection (one j2ssh connection object) with 36 threads/sessions, everything lags way more than if I execute each thread at a time.
Now if I try to create 36 new connections (36 j2ssh connection objects) then each thread has a separate connection to server, either i get out of memory exception (somehow the program still runs, and successfully ends its work, slower than the time when I execute one thread after another).
So what to do? how to find the optimal thread number I should use?
because Thread.activeCount() is 3 before starting mine 36 threads?! i'm using Lenovo laptop Intel core i5.

You could narrow it down to a more reasonable number of threads with an ExecutorService. You probably want to use something near the number of processor cores available, e.g:
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(threads);
for (int i = 0; i < 36; i++) {
service.execute(new Runnable() {
public void run() {
// do what you need per file here
}
});
}
service.shutdown();

A good practice would be to spawn threads equivalent to the number of cores in your processor. I normally use a Executors.fixedThreadPool(numOfCores) executor service and keep feeding it the jobs from my job queue, simple. :-)

Your Intel i5 has two cores; hyperthreading makes them look like four. So you only get four cores' worth of parallelization; the rest of your threads are time sliced.
Assume 1MB RAM per thread just for thread creation, then add the memory that each thread requires to process the file. That will give you an idea about why you're getting out of memory errors. How big are the files you're dealing with? You can see that you'll have a problem if they're very large to have them in memory at the same time.
I'll assume that the server receiving the files can accept multiple connections, so there's value in trying this.
I'd benchmark with 1 thread and then increase them until I found that the performance curve was flattening out.

Brute force: Profile incrementally. Increase the number of threads gradually and check the performance. As the number to connections is just 36, its should be easy

You need to understand that if you create 36 threads you still have one or two processors and it would be switching between threats most of the time.
I would say you increment the threads a little bit, let's say 6 and see the behavior. And then go from there

One way to tune the numebr of threads to the size of the machine is to use
int processors = Runtime.getRuntime().availableProcessors();
int threads = processors * N; // N could 1, 2 or more depending on what you are doing.
ExecutorService es = Executors.newFixedThreadPool(threads);

First you have to find out where the bottle neck is.
If it is the SSH connection, it usually does not help to open multiple connections in parallel. Better use multiple channels on one connection, if needed.
If it is the disk IO, creating multiple threads writing (or reading) only helps if they are accessing different disks (which is seldom the case). But you could have another thread doing CPU-bound things while you are waiting on your disk IO in one thread.
If it is the CPU, and you have enough idle cores, more threads can help. Even more, if they don't need to access common data. But still, more threads than cores (+ some threads doing IO) does not help. (Also take in mind that usually there are other processes on your server, too.)

Using more threads than the number of cores on your machine is going only to slow down the whole process. It will speed up till you reach this number.

Be sure you don't create more threads than you have processing units or you are likely to create more overhead with context switching than you gain in concurrency. Also remember that you only have 1 HDD and 1 HDD controller as a result, I doubt multithreading is going to help you at all here.

time required to finish the multithreaded program?

A java process starts 5 threads , each thread takes 5 minutes. what will be the minimum and maximum time taken by process? will be of great help if one can explain in java threads and OS threads.
Edit : I want to know how java schedule threads at OS level.

This depends on the amount of logical processor cores you have and the already running processes and the priority of the threads. The theoretical minimum would be 5 minutes plus the little overhead in starting and controlling threads, if you have at least five logical processor cores. The theoretical maximum would be 25 minutes plus the little overhead, if you have only one logical processor core available. The mentioned overhead is usually not more than a few milliseconds.
The theoretical maximum can however be unpredictably (much) higher if there are at the same time a lot of another running threads with a higher priority from other processes than the JVM.
Edit : I want to know how java schedule threads at OS level.
The JVM just spawns another native thread and it get assigned to the process associated with JVM itself.

Minimum time, 5 minutes, assuming that threads run entirely concurrently with no interdependencies and have a dedicated core available. Maximum time, 25 minutes, assuming that each thread has to have exclusive use of some global resource and so can't run in parallel with any other thread.

A glib (but realistic answer) for the maximum is that they might take an infinite amount of time to complete, as multi-threaded programs often contain deadlock bugs.

It depends! There isn't enough information to quantify this.
Missing Info: Hardware - How many threads can run at the same time on your CPU. Workload - Does it take 5 minutes because it's doing something for 5 minutes or is it performing some calculation that usually takes about 5 min's and uses a lot of CPU resources.
When you run multiple threads concurrently there can be lock waits for resources or the threads may even have to take turns executing and although they have been running for 5 minuets they may only have had a few CPU seconds.
5 threads never euqals 5X output. It can get close but will never reach 5X.

I am not sure whether you are looking for CPU time spent by the thread. If that is the case, you can measure the CPU time, see below
ThreadMXBean tb = ManagementFactory.getThreadMXBean()
long startTime= tb.getCurrentThreadCpuTime();
Call the above when thread is created
long endTime= tb.getCurrentThreadCpuTime();
The difference between endTime - starTime, is the CPU time that the thread used

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.