Thread number and Java application performance

Thread number and Java application performance - java

Hi: I have a multi thread Java application. The current thread size is already 100. We are currently using 4 core CPU. But as one see in the near future, CPU core would be doubled, or even to 32 cores. In order to fully utilize cores, we need to increase our thread pool size. But as you may know (Maybe I am wrong), Java is good when there is 100 hundred threads, but there could be performance problem when thread is 200, 500, 1000 threads. Then shall we use other programming language, for example scala. Is my worry reasonable?

With modern JVMs, a Java process can create as many threads as the operating system will permit. Whether or not your application will be able to make good use of those threads depends on the design of your application.
If scalability is a concern, I would recommend that in the first instance you focus on your application's architecture (data structures, synchronization, etc). These issues need to be considered irrespective of the programming language, and there's nothing about Java that makes it inherently unsuitable for heavily multithreaded apps.

I once made experiments with threads, to find out, whether there is significant difference between Linux and Windows, and hit a kind of barrier at about 2000 threads on both platforms. The test is some years old, and I didn't repeat it, but later I found the same number mentioned by others, but I didn't save the link.
Without testing it, I think you're right about scala. The techniques used there - Actors - works with smaller objects, afaik, but I can't give you numbers.

If you have 4 cores, the optimal thread pool size may be 4 as this is the minimum number of threads required to keep all the CPUs busy. However, you can have any number of idle/waiting threads up to about 10K. This is a JVM thread library tipping point so switching to Scala won't make any difference. Note: you can have far more threads, I wouldn't recommend it.
If you have 10K threads and you want more, I suggest you buy another server. You can buy a lot for server for about $1000.
I ran a test creating lots of threads on my machine with Java 6 update 26, 32-bit and 64-bit on Ubuntu 11. The first 1000 threads took 72 ms to create, to go from 31K to 32K, took 3,861 ms to create. At about 32K threads I got this error
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)

#user84592: not sure about my answer, just brainstorming.
How about having installing virtual machine software on this machine, distributing CPU cores to them, it will make many machines instead of having one physical machine, and then you can have java application workload sliced to each of them...

Related

Optimal number of threads [duplicate]

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?

If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.

I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.

I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)

The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com

The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.

I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.

4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.

Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.

You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.

The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.

One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.

speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.

Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

Java thread limit, JVM 9

So according to most things I've read on the internet, the number of threads you can have in Java caps out around 10,000. However, in practice I can create nearly 500,000, at which point my computer becomes unresponsive. (The task manager goes a little funny - it starts claiming that though 99% of my 16 GB of memory is used, the highest-using program uses only ~300 MB. After everything stops responding, the fan quiets down, and the disk access light flashes only periodically, leading me to believe neither CPU nor disk is under heavy load.) I waited for about 15 minutes one test, and never got an exception (well, as far as I know).
For repeatability, I've (also) used the following code: https://github.com/jheusser/core-java-performance-examples/blob/master/src/test/java/com/google/code/java/core/threads/MaxThreadsMain.java as referenced here: https://dzone.com/articles/java-what-limit-number-threads .
I did, however, increase the upper limit on i from 100 * 1000 to 1000 * 1000, because it was successfully creating all the threads. One of the last messages it gave before the computer froze up was 440,000 threads: Time to create 4,000 threads was 1.002 seconds - it looks like it was averaging around 2 seconds per 4000, though.
I am using Windows 10 Pro, version 1703.
JRE: Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)
The next highest thread count I know of is about 100k, https://stackoverflow.com/a/46697264/513038. Now, a lot of the claimed limits were given many years ago, but they're based on stack size vs memory, and at 500,000 threads in 16 GB RAM (even assuming ALL of it was used), that's 32kb per thread by default, which is supposedly less than the minimum stack size. If that were true, I'd at least expect more StackOverflowErrors during normal operation. Has the threading system changed silently in the past 10 years? (Or even in the past few months: one of the posts I referenced was made just a few months ago, April 2018.)

Has the threading system changed silently in the past 10 years?
Nope. On Linux, MacOS and Windows, Java threads are implemented as native threads ... since a long time ago.
What has changed is the way that various different operating systems schedule native threads. The OS is where Java thread scheduling takes place, and where any hard limits on the number of threads supported will be enforced.
Basically, your tests try to see what happens when you try to use a pathologically large number of threads. The answer on Windows is that it breaks the OS.
And even if it didn't break the OS out-right, the chances are that for a Java application using 100,000's of threads:
the JVM's resource usage (stack memory) would be terrible,
native scheduler performance would be terrible, and
the application performance would be terrible.
Huge numbers of threads is the wrong way to write a practical Java application. Actors may be a better solution, or maybe an ExecutorService (with a bounded thread pool) or a ForkJoin pool. It will depend on the application, and other factors.
In short, those tests you are running are not instructive for a properly designed Java application. The solution for applications that use huge numbers of threads is to rewrite them.

How to decide the suitable number of threads to create in java?

I have a java application that creates SSL socket with remote hosts. I want to employ threads to hasten the process.
I want the maximum possible utilization that does not affect the program performance. How can I decide the suitable number of threads to use? After running the following line: Runtime.getRuntime().availableProcessors(); I got 4. My processor is an Intel core i7, with 8 GB RAM.

If you have 4 cores, then in theory you should have exactly four worker threads going at any given time for maximum optimization. Unfortunately what happens in theory never happens in practice. You may have worker threads who, for whatever reason, have significant amounts of downtime. Perhaps they're hitting the web for more data, or reading from a disk, much of which is just waiting and not utilizing the cpu.
Depending on how much waiting you're doing, you'll want to bump up the number. The costs to increasing the number of threads is that you'll have more context switching and competition for resources. The benefits are that you'll have another thread ready to work in case one of the other threads decides it has to break for something.
Your best bet is to set it to something (let's start with 4) and work your way up. Profile your code with each setting, and see if your benchmarks go up or down. You should soon see a pattern and a break-even point.
When it comes to optimization, you can theorize all you want about what should be the fastest, but you won't beat actually running and timing your code to truly answer this question.

As DarthVader said, you can use a ThreadPool (CachedThreadPool). With this construct you don't have to specify a concrete number of threads.
From the oracle site:
The newCachedThreadPool method creates an executor with an expandable thread pool. This executor is suitable for applications that launch many short-lived tasks.
Maybe thats what you are looking for.
About the number of cores is hard to say. You have 4 hyperthreading cores, at least one core you should leave for your OS. i would say 4-6 Threads.

How can I get the cpu usage a jvm process consumes for each cpu core using java?

Actually I'm using java to monitor the cpu usage for a certain java process.Here are my questions:
First,is there a limit that a single process can only consume cpu processing time on 1 or limited cpu cores?Or it can use cpu time on each of the cpu core?
Second,if I want to monitor the cpu usage of a certain java process for each cpu core,how can I do that?
And I prefer to handle it using pure java,not native method.

To the operating system, a single thread (which I assume is what you mean by "Java process") essentially cannot use CPU on more than one "processor" (which may or may not mean a physical core-- see below) simultaneously.
Generally, whenever a given thread gets a "turn at running", Windows (and I assume other operating systems) will attempt to schedule a given thread on to the same "processor" that it last ran on.
However, the situation is complicated by hyperthreading CPUs which actually present to the operating system several "processors" for what is actually a single core physical core. In this case, it is actually the CPU itself that switches between what instruction of what thread is running on which component of the given core at any one time. (Because, e.g. the core's arithmetic unit could be performing an arithmetic instruction for Thread 1 while the load/store unit is fetching data from memory for an instruction for Thread 2, etc.)
So given the complexity of this situation, even if you can get per-core measurements, I'm not entirely sure quite what useful meaning you would attach to them
P.S. If you'll permit the plug, I don't know if this Java-focussed article on thread scheduling that I wrote a couple of years ago might be useful. I should say I wrote it before either Windows 7 or the latest Intel Core CPUs were released, and there may be one or two updates to the information that would be pertinent (in particular, I don't address the issue of variable core speeds and how that could affect scheduling).

Java Performance Processes vs Threads

I am implementing a worker pool in Java.
This is essentially a whole load of objects which will pick up chunks of data, process the data and then store the result. Because of IO latency there will be significantly more workers than processor cores.
The server is dedicated to this task and I want to wring the maximum performance out of the hardware (but no I don't want to implement it in C++).
The simplest implementation would be to have a single Java process which creates and monitors a number of worker threads. An alternative would be to run a Java process for each worker.
Assuming for arguments sake a quadcore Linux server which of these solutions would you anticipate being more performant and why?
You can assume the workers never need to communicate with one another.

One process, multiple threads - for a few reasons.
When context-switching between jobs, it's cheaper on some processors to switch between threads than between processes. This is especially important in this kind of I/O-bound case with more workers than cores. The more work you do between getting I/O blocked, the less important this is. Good buffering will pay for threads or processes, though.
When switching between threads in the same JVM, at least some Linux implementations (x86, in particular) don't need to flush cache. See Tsuna's blog. Cache pollution between threads will be minimized, since they can share the program cache, are performing the same task, and are sharing the same copy of the code. We're talking savings on the order of 100's of nanoseconds to several microseconds per switch. If that's small potatoes for you, then read on...
Depending on the design, the I/O data path may be shorter for one process.
The startup and warmup time for a thread is generally much shorter. The OS doesn't have to start a process, Java doesn't have to start another JVM, classloading is only done once, JIT-compilation is only done once, and HotSpot optimizations are done once, and sooner.

Well usually, when discussing multi processing (/w one thread per process) versus multi threading in the same process, while the theoretical overhead is bigger in the first case than in the latter (and thus multi processing is theoretically slower than multi threading), in reality on most modern OSs this is not such a big issue. However when discussing it in the Java context, starting a new process is a lot more costly then starting a new thread. Starting a new process means starting up a new instance of the JVM which is very costly especially in terms of memory. I recommend that you start multiple threads in the same JVM.
Moreover, if you say inter-thread communication is not an issue, you can use Java's Executor Service to get a fixed thread pool of size 2x(number of available CPUs). The number of available CPU's can be autodetected at runtime via Java's Runtime class. This way you get a quick simple multithreading going without any boiler plate code.

Actually, if you do this with large scale taks using multiple jvm process is way faster than one jvm with multple threads. At least we never got one jvm runnning as fast as multple jvms.
We do some calculations where each task uses around 2-3GB ram and does some heavy number crunching. If we spawn 30 jvm's and run 30 task they perform around 15-20% better than spawning 30 threads in one jvm. We tried tuning the gc and the various memory sections and never catched up to the first variant.
We did this on various machines 14 tasks on a 16 core server, 34 tasks on a 36 core server etc. Multithreading in java always performed worde than multiple jvm processes.
It may not make any difference on simple tasks but on heavy calculations it seems jvm performce bad on threads.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.