JVM in container calculates processors wrongly?

JVM in container calculates processors wrongly? - java

I recently did some research again, and stumbled upon this. Before crying about it to the OpenJDK team, I wanted to see if anyone else has observed this, or disagrees with my conclusions.
So, it's widely known that the JVM for a long time ignored memory limits applied to the cgroup. It's almost as widely known that it now takes them into account, starting with Java 8 update something, and 9 and higher. Unfortunately, the calculations done based on the cgroup limits are so useless that you still have to do everything by hand. See google and the hundreds of articles on this.
What I only discovered a few days ago, and did not read in any of those articles, is how the JVM checks the processor count in cgroups. The processor count is used to decide on the number of threads used for various tasks, including also garbage collection. So getting it correct is important.
In a cgroup (as far as I understand, and I'm no expert) you can set a limit on the cpu time available (--cpus Docker parameter). This limits time only, and not parallelism. There are also cpu shares (--cpu-shares Docker parameter), which are a relative weight to distribute cpu time under load. Docker sets a default of 1024, but it's a purely relative scale.
Finally, there are cpu sets (--cpuset-cpus for Docker) to explicitly assign the cgroup, and such the Docker container, to a subset of processors. This is independent of the other parameters, and actually impacts parallelism.
So, when it comes to checking how many threads my container can have running in parallel, as far as I can tell, only the cpu set is relevant. The JVM though ignores that, instead using the cpu limit if set, otherwise the cpu shares (assuming the 1024 default to be an absolute scale). This is IMHO already very wrong. It calculates available cpu time to size thread pools.
It gets worse in Kubernetes. It's AFAIK best practice to set no cpu limit, so that the cluster nodes have high utilization. Also, you should set for most apps a low cpu request, since they will be idle most of the time and you want to schedule many apps on one node. Kubernetes sets the request in milli cpus as cpu share, which is most likely below 1000m. The JVM then always assumes one processor, even is your node is running on some 64 core cpu monster.
Has anyone ever observed this as well? Am I missing something here? Or did the JVM devs actually make things worse when implementing cgroup limits for the cpu?
For reference:
https://bugs.openjdk.java.net/browse/JDK-8146115
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
cat /sys/fs/cgroups/cpu/cpu.share while inside a container, locally or a cluster of your choice, to get settings used on startup

Being a developer of a large scale service (>15K containers running distributed Java applications in the own cloud), I also admit that so called "Java container support" is too far from being perfect. At the same time, I can understand the reasoning of JVM developers who implemented the current resource detection algorithm.
The problem is, there are so many different cloud environments and use cases for running containerized applications, that it's virtually impossible to address the whole variety of configurations. What you claim to be the "best practice" for most apps in Kubernetes, is not necessarily typical for other deployments. E.g. it's definitely not a usual case for our service, where most containers require the certain minimum guaranteed amount of CPU resources, and thus also have a quota they cannot exceed, in order to guarantee CPU for other containers. This policy works well for low-latency tasks. OTOH, the policy you've described, suits better for high-throughput or batch tasks.
The goal of the current implementation in HotSpot JVM is to support popular cloud environments out of the box, and to provide the mechanism for overriding the defaults.
There is an email thread where Bob Vandette explains the current choice. There is also a comment in the source code, describing why JVM looks at cpu.shares and divides it by 1024.
/*
* PER_CPU_SHARES has been set to 1024 because CPU shares' quota
* is commonly used in cloud frameworks like Kubernetes[1],
* AWS[2] and Mesos[3] in a similar way. They spawn containers with
* --cpu-shares option values scaled by PER_CPU_SHARES. Thus, we do
* the inverse for determining the number of possible available
* CPUs to the JVM inside a container. See JDK-8216366.
*
* [1] https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
* In particular:
* When using Docker:
* The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially
* fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the
* --cpu-shares flag in the docker run command.
* [2] https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
* [3] https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/docker/docker.cpp#L648
* https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/slave/containerizer/mesos/isolators/cgroups/constants.hpp#L30
*/
As to parallelism, I also second HotSpot developers that JVM should take cpu.quota and cpu.shares into account when estimating the number of available CPUs. When a container has a certain number of vcores assigned to it (in either way), it can rely only on this amount of resources, since there is no guarantee that more resources will be ever available to the process. Consider a container with 4 vcores running on a 64-core machine. Any CPU intensive task (GC is an example of such task) running in 64 parallel threads will quickly exhaust the quota, and the OS will throttle the container for a long period. E.g. each 94 out of 100 ms the application will be in a stop-the-world pause, since the default period for accounting quota (cpu.cfs_period_us) is 100 ms.
Anyway, if the algorithm does not work well in your particular case, it's always possible to override the number of available processors with -XX:ActiveProcessorCount option, or disable container awareness entirely with -XX:-UseContainerSupport.

Related

Optimal number of threads [duplicate]

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?

If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.

I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.

I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)

The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com

The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.

I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.

4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.

Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.

You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.

The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.

One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.

speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.

Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

JVM in Docker uses 100% CPU accross multiple cores

I have a Spring Boot app running in Docker which seems to struggle with its processing, I'll need to fix it.
Anyway, to get an idea of where the bottleneck is I made a simple top, I see that my Java process uses 100% CPU on a 4 cores machine. Good enough, I guess I need to parallelize some expensive actions in order to spread across multiple cores.
The thing is even if my main Java process seems to max out around 100%, machine wise I see that all 4 cores are used around 25%.
I'm clearly not an expert in Docker or JVM but I have to do something about it :/
To me, it looks like my JVM only see 1 core but docker manages to spread the work accross all cores.
Any thoughts about what might be going on ?
Oh and about the versions, it's running Docker 17.05, JDK 7. I might update Docker but not Java :(

I faced such an issue on Docker on AWS EC2 with 64 cores. The problem was that only one core was visible when calling Java with no options. All cores were visible if I used -XX:ActiveProcessorCount or -XX:-UseContainerSupport options. But in the latter case each of the cores is used less than 2-3% summing all together to about 100%. After a long search I found that tools like htop can see all physical cores from the container but there could be constraints limiting the number of available CPU capacity. For instance, the option --cpu-shares. One can check its value from within the container with cat /sys/fs/cgroup/cpu/cpu.shares. The 1024 points correspond to 1 core. For example one can set --cpu-shares 716 and only 70% of the core will be available from the container. This was my case. In your case the number of physical processors is 4 and probably cpu.shares has 1024 points. Thus, you load 25% of every core.
Useful links for reference:
JVM in container calculates processors wrongly?
https://bugs.openjdk.org/browse/JDK-8146115
https://docs.docker.com/config/containers/resource_constraints/

How to define program's requirements

Is there any easy, cheap (which don't require to test program on many hardware configuration) and painless method to define hardware requirements (like CPU, RAM memory etc), that are require to run my own program? How it's should be done?
I have quite resource-hungry program written in Java and i don't know how to define hardware specification that will be enough to run this aplication smoothly.

No, I don't think there is any generally applicable way to determine the minimum requirements that does not involve testing on some specified reference hardware.
You may be able to find some of the limitations by using Virtual Machines of some kind - it is easier to modify the parameters of some VM than modifying hardware. But there are artifacts generated by the interaction between host and VM that may influence your results.
It is also difficult to define the criteria for "acceptable performance" in general without knowing a lot about use cases.
Many programs will use more resources if they are available, but can also get along with less.
For example, consider a program using a thread pool with a size a based on the number of CPU cores. When running on a CPU with more cores, more work can be done in parallel, but at the same time overhead due to thread creation, synchronisation and aggregation of results increases. The effects are non-linear in the number of CPUs and depend a lot on the actual program and data. Similarly, the effects of decreasing available memory range from potentially throwing OutOfMemory-Errors for some inputs (but possibly not for others) to just running GC a bit more frequently (and the effects of that depend on the GC strategy, ranging from noticeable freezes to just a bit more CPU load).
All that is without even considering that programs don't usually live in isolation - they run on an operating system in parallel with other tasks that also consume resources.

Why do we use multi application server instances on the same server

I guess there is a good reason, but I don't understand why sometimes we put for example 5 instances having the same webapplications on the same physical server.
Has it something to do with an optimisation for a multi processor architecture?
The max allowed ram limit for JVM or something else?

Hmmm... After a long time I am seeing this question again :)
Well a multiple JVM instances on a single machine solves a lot of issues. First of let us face this: Although JDK 1.7 is coming into picture, a lot of legacy application were developed using JDK 1.3 or 1.4 or 1.5. And still a major chunk of JDK is divided among them.
Now to your question:
Historically, there are three primary issues that system architects have addressed by deploying multiple JVMs on a single box:
Garbage collection inefficiencies: As heap sizes grow, garbage collection cycles--especially for major collections--tended to introduce significant delays into processing, thanks to the single-threaded GC. Multiple JVMs combat this by allowing smaller heap sizes in general and enabling some measure of concurrency during GC cycles (e.g., with four nodes, when one goes into GC, you still have three others actively processing).
Resource utilization: Older JVMs were unable to scale efficiently past four CPUs or so. The answer? Run a separate JVM for every 2 CPUs in the box (mileage may vary depending on the application, of course).
64-bit issues: Older JVMs were unable to allocate heap sizes beyond the 32-bit maximum. Again, multiple JVMs allow you to maximize your resource utilization.
Availability: One final reason that people sometimes run multiple JVMs on a single box is for availability. While it's true that this practice doesn't address hardware failures, it does address a failure in a single instance of an application server.
Taken from ( http://www.theserverside.com/discussions/thread.tss?thread_id=20044 )
I have mostly seen weblogic. Here is a link for further reading:
http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/WLSTuning.html#wp1104298
Hope this will help you.

I guess you are referring to application clustering.
AFAIK, JVM's spawned with really large heap size have issues when it comes to garbage collection though I'm sure by playing around with the GC algorithm and parameters you can bring down the damage to a minimum. Plus, clustered applications don't have a single point of failure. If one node goes down, the remaining nodes can keep servicing the clients. This is one of the reasons why "message based architectures" are a good fit for scalability. Each request is mapped to a message which can then be picked up by any node in a cluster.
Another point would be to service multiple requests simultaneously in case your application unfortunately uses synchronized keyword judiciously. We currently have a legacy application which has a lot of shared state (unfortunately) and hence concurrent request handling is done by spawning around 20 JVM processes with a central dispatching unit which does all the dispatching work. ;-)

I would suggest you use around least JVM per NUMA region. If a single JVM uses more than one NUMA region (often a single CPU) the performance can degrade significantly, due to a significant increase in the cost of accessing main memory of another CPU.
Additionally using multiple servers can allow you to
use different versions of java or your your applications server.
isolate different applications which could interfere (they shouldn't but they might)
limit GC pause times between services.
EDIT: It could be historical. There may have been any number of reasons to have separate JVMs in the past but since you don't know what they were, you don't know if they still apply and it may be simpler to leave things as they are.

An additional reason to use mutliple instance is serviceability.
For example if you multiple different applications for multiple customers then having seperate instances of the appserver for each application can make life a little easier when you have to do an appserver restart during a release.

Suppose you have a average configuration host and installed single instance of the web/app server. Now your application becomes more popular and number of hits increases 2 fold. What you do now ?
Add one more physical server of same configuration and instal the application and load balance these two hosts.
This is not end of life for your application. Your application will keep on becoming more popular and hence the need to scale it up. What's going to be your strategy ?
keep adding more hosts of same configuration
buy a more powerful machine where you can create more logical application servers
Which option will you go far ?
You will do cost analysis, which will involve factors like- actual hardware cost, Cost of managing these servers (power cost, space occupied in data center) etc.
Apparently, it comes that the decision is not very easy. And in most cases it's more cost effective to have a more powerful machine.

one high-end server with one Application Server or multiple Application Servers?

If I have a high-end server, for example with 1T memory and 8x4core CPU...
will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
On App Server I will run some services (EAR whith message driven beans) which exchange message with each other.
btw, has java 64bit now no memory limitation any more?
http://java.sun.com/products/hotspot/whitepaper.html#64

will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
There are several things to take into account:
A single app server means a single point of failure. For many applications, this is not an option and using horizontal and vertical scaling is a common configuration (i.e. multiple VMs per machine and multiple machines). And adding more machines is obviously easier/cheaper if they are small.
A large heap takes longer to fill so the application runs longer before a garbage collection occurs. However, a larger heap also takes longer to compact and causes garbage collection to take longer. Sizing the VM usually means finding a good compromise between frequency and duration (in other words, you don't always want to give as much RAM as possible to one VM)
So, to my experience, running multiple machines hosting multiple JVM is the usual choice (and is usually cheaper than a huge beast and gives you more flexibility).

There is automatically a performance hit when you need to do out-of-process communications, so the question is if the application server does not scale well enough so this can pay off.
As a basic rule of thumb the JVM design allows the usage of any number of CPU's and any amount of RAM the operating system provides. The actual limits are JVM implementation specific, and you need to read the specifications very carefully before choosing to see if there is any limits relevant to you.
Given you have a JVM which can utilize the hardware, you then need an app server which can scale appropriately. A common bottleneck these days is the amount of web requests that can be processed per second - a modern server should be able to process 10000 requests per second (see http://www.kegel.com/c10k.html) but not all do.
So, first of all identify your most pressing needs (connections per second? memory usage? network bandwidth?) and use that to identify the best platform + jvm + app server combination. If you have concrete needs, vendors will usually be happy to assist you to make a sale.

Most likely you will gain by running multiple JVMs with smaller heaps instead of a single large JVM. There is a couple of reasons for this:
Smaller heaps mean shorter garbage collections
More JVMs means lesser competition for internal resources inside JVM such as thread pools and other synchronized access.
How many JVMs you should fit into that box depends on what the application does. The best way to determine this is to set up a load test that simulates production load and observe how the number of requests the system can handle grows with the number of added JVMs. At some point you will see that adding more JVMs does not improve throughput. That's where you should stop.
Yet, there is another consideration. It is better to have multiple physical machines rather than a single big fat box. This is reliability. Should this box go offline for some reason, it will take with it all the app servers that are running inside it. The infrastructure running many separate smaller physical machines is going to be less affected by the failure of a single machine as compared to a single box.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.