I have a Spring Boot app running in Docker which seems to struggle with its processing, I'll need to fix it.
Anyway, to get an idea of where the bottleneck is I made a simple top, I see that my Java process uses 100% CPU on a 4 cores machine. Good enough, I guess I need to parallelize some expensive actions in order to spread across multiple cores.
The thing is even if my main Java process seems to max out around 100%, machine wise I see that all 4 cores are used around 25%.
I'm clearly not an expert in Docker or JVM but I have to do something about it :/
To me, it looks like my JVM only see 1 core but docker manages to spread the work accross all cores.
Any thoughts about what might be going on ?
Oh and about the versions, it's running Docker 17.05, JDK 7. I might update Docker but not Java :(
I faced such an issue on Docker on AWS EC2 with 64 cores. The problem was that only one core was visible when calling Java with no options. All cores were visible if I used -XX:ActiveProcessorCount or -XX:-UseContainerSupport options. But in the latter case each of the cores is used less than 2-3% summing all together to about 100%. After a long search I found that tools like htop can see all physical cores from the container but there could be constraints limiting the number of available CPU capacity. For instance, the option --cpu-shares. One can check its value from within the container with cat /sys/fs/cgroup/cpu/cpu.shares. The 1024 points correspond to 1 core. For example one can set --cpu-shares 716 and only 70% of the core will be available from the container. This was my case. In your case the number of physical processors is 4 and probably cpu.shares has 1024 points. Thus, you load 25% of every core.
Useful links for reference:
JVM in container calculates processors wrongly?
https://bugs.openjdk.org/browse/JDK-8146115
https://docs.docker.com/config/containers/resource_constraints/
Related
A quick question regarding performance of the Spring Webflux event loop.
We performed some rudimentary performance test with gatling + JMeter + Yourkit for Spring Webflux.
We noticed some major difference between two scenarios:
1 - Scenario 1, Our test setup: send N amount of HTTP request to the Webflux web app deployed on a bare metal host. We took a MacBook Pro out of the box with 8 cores. We repeated the same test on a physical server with 8 cores. The test result between the two matches. We saw event loop taking care of switching cores between the IO, we saw Webflux shine and are very happy.
2 - Scenario 2, Exact same test, except it is now running on virtual cores. By virtual cores, I mean we request VMs with 8 cores, or we request 8 CPUs to Kubernetes. Here, we saw a major drop in performance.
Just a simple question, is there supposed to be a difference when Webflux and the event loop is running between bare metal physical cores, and when running with virtual ones?
Thank you
Your operations will done by real core in bare metal and virtual mode. The difference is that in virtual cores we have multi tenancy on real cores and there are other tasks which must be done by real core.
Your performance is also depend to other factors such as Network, Disk IO and so on. But if we suppose they are fixed, probably you reach an equal result on both if there is no other VM or tasks along side your VM. However in many cases you have interference with other tasks in virtualization mode and you take lower performance related to bare metal. Furthermore, in virtual mode you may take different performance in every single run.
Take a look at Bare Metal vs. Virtualization: What Performs Better? and see the performance comparison depicted at the end.
I recently did some research again, and stumbled upon this. Before crying about it to the OpenJDK team, I wanted to see if anyone else has observed this, or disagrees with my conclusions.
So, it's widely known that the JVM for a long time ignored memory limits applied to the cgroup. It's almost as widely known that it now takes them into account, starting with Java 8 update something, and 9 and higher. Unfortunately, the calculations done based on the cgroup limits are so useless that you still have to do everything by hand. See google and the hundreds of articles on this.
What I only discovered a few days ago, and did not read in any of those articles, is how the JVM checks the processor count in cgroups. The processor count is used to decide on the number of threads used for various tasks, including also garbage collection. So getting it correct is important.
In a cgroup (as far as I understand, and I'm no expert) you can set a limit on the cpu time available (--cpus Docker parameter). This limits time only, and not parallelism. There are also cpu shares (--cpu-shares Docker parameter), which are a relative weight to distribute cpu time under load. Docker sets a default of 1024, but it's a purely relative scale.
Finally, there are cpu sets (--cpuset-cpus for Docker) to explicitly assign the cgroup, and such the Docker container, to a subset of processors. This is independent of the other parameters, and actually impacts parallelism.
So, when it comes to checking how many threads my container can have running in parallel, as far as I can tell, only the cpu set is relevant. The JVM though ignores that, instead using the cpu limit if set, otherwise the cpu shares (assuming the 1024 default to be an absolute scale). This is IMHO already very wrong. It calculates available cpu time to size thread pools.
It gets worse in Kubernetes. It's AFAIK best practice to set no cpu limit, so that the cluster nodes have high utilization. Also, you should set for most apps a low cpu request, since they will be idle most of the time and you want to schedule many apps on one node. Kubernetes sets the request in milli cpus as cpu share, which is most likely below 1000m. The JVM then always assumes one processor, even is your node is running on some 64 core cpu monster.
Has anyone ever observed this as well? Am I missing something here? Or did the JVM devs actually make things worse when implementing cgroup limits for the cpu?
For reference:
https://bugs.openjdk.java.net/browse/JDK-8146115
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
cat /sys/fs/cgroups/cpu/cpu.share while inside a container, locally or a cluster of your choice, to get settings used on startup
Being a developer of a large scale service (>15K containers running distributed Java applications in the own cloud), I also admit that so called "Java container support" is too far from being perfect. At the same time, I can understand the reasoning of JVM developers who implemented the current resource detection algorithm.
The problem is, there are so many different cloud environments and use cases for running containerized applications, that it's virtually impossible to address the whole variety of configurations. What you claim to be the "best practice" for most apps in Kubernetes, is not necessarily typical for other deployments. E.g. it's definitely not a usual case for our service, where most containers require the certain minimum guaranteed amount of CPU resources, and thus also have a quota they cannot exceed, in order to guarantee CPU for other containers. This policy works well for low-latency tasks. OTOH, the policy you've described, suits better for high-throughput or batch tasks.
The goal of the current implementation in HotSpot JVM is to support popular cloud environments out of the box, and to provide the mechanism for overriding the defaults.
There is an email thread where Bob Vandette explains the current choice. There is also a comment in the source code, describing why JVM looks at cpu.shares and divides it by 1024.
/*
* PER_CPU_SHARES has been set to 1024 because CPU shares' quota
* is commonly used in cloud frameworks like Kubernetes[1],
* AWS[2] and Mesos[3] in a similar way. They spawn containers with
* --cpu-shares option values scaled by PER_CPU_SHARES. Thus, we do
* the inverse for determining the number of possible available
* CPUs to the JVM inside a container. See JDK-8216366.
*
* [1] https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
* In particular:
* When using Docker:
* The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially
* fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the
* --cpu-shares flag in the docker run command.
* [2] https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
* [3] https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/docker/docker.cpp#L648
* https://github.com/apache/mesos/blob/3478e344fb77d931f6122980c6e94cd3913c441d/src/slave/containerizer/mesos/isolators/cgroups/constants.hpp#L30
*/
As to parallelism, I also second HotSpot developers that JVM should take cpu.quota and cpu.shares into account when estimating the number of available CPUs. When a container has a certain number of vcores assigned to it (in either way), it can rely only on this amount of resources, since there is no guarantee that more resources will be ever available to the process. Consider a container with 4 vcores running on a 64-core machine. Any CPU intensive task (GC is an example of such task) running in 64 parallel threads will quickly exhaust the quota, and the OS will throttle the container for a long period. E.g. each 94 out of 100 ms the application will be in a stop-the-world pause, since the default period for accounting quota (cpu.cfs_period_us) is 100 ms.
Anyway, if the algorithm does not work well in your particular case, it's always possible to override the number of available processors with -XX:ActiveProcessorCount option, or disable container awareness entirely with -XX:-UseContainerSupport.
What are the principles for app deployment in docker?
I see two concepts
Create image per app version
Create app binaries and somehow deploy them to container(utilize i.e. Tomcats hot deploy)
Maybe there are others, I personally like the first, but there must me tremendous amount of data, if you release very often. How would one choose one over the another?
I'd like to know how others deploys their java application so I can make my personal opinion.
Update 2019: See "Docker memory limit causes SLUB unable to allocate with large page cache"
I mentioned in "Docker support in Java 8 — finally!" last May (2019), that new evolutions from Java 10, backported in Java 8, means Docker will report more accurately the memory used.
As mbluke adds in the comments:
The resource issues have been addressed in later versions of Java.
As of Java SE 8u131, and in JDK 9, the JVM is Docker-aware with respect to Docker CPU limits transparently.
Starting with Java JDK 8u131+ and JDK 9, there’s an experimental VM option that allows the JVM ergonomics to read the memory values from CGgroups.
To enable it on, you must explicit set the parameters -XX:+UnlockExperimentalVMOptions and -XX:+UseCGroupMemoryLimitForHeap on the JVM Java 10 has these set by default and there is no need for the flags.
January 2018: original answer
As any trade-off, it depends on your situation/release cycle.
But do consider also Java might be ill-fitted for a docker environment in the first place, depending on its nature.
See "Nobody puts Java in a container"
So we have finished developing our JVM based application, and now package it into a docker image and test it locally on our notebook. All works great, so we deploy 10 instances of that container onto our production cluster. All the sudden the application is throttling and not achieving the same performance as we have seen on our test system. And our test system is even this high-performance system with 64 cores…
What has happened?
In order to allow multiple containers to run isolated side-by-side, we have specified it to be limited to one cpu (or the equivalent ratio in CPU shares). Unfortunately, the JVM will see the overall number of cores on that node (64) and use that value to initialize the number of default threads we have seen earlier. As started 10 instances we end up with:
10 * 64 Jit Compiler Threads
10 * 64 Garbage Collection threads
10 * 64 ….
And our application,being limited in the number of cpu cycles it can use, is mostly dealing with switching between different threads and does cannot get any actual work done.
All the sudden the promise of containers, “Package once, run anywhere’ seem violated…
So to be specific, how to cope with the amount of data generated when you do build image per release? If you build your app everytime on top of tomcat image, the disk space needed for store the images will grow quickly, right?
2 techniques:
multi-stage build to make sure your application does not include anything but what is need at runtime (and not any compilation files). See my answer here;
bind mounts: you could simply copy your wars in a volume mounted by a single Tomcat container.
I have a situation in which I need to create thousands of instances of a class from third party API. Each new instance creates a new thread. I start getting OutOfMemoryError once threads are more than 1000. But my application requires creating 30,000 instances. Each instance is active all the time. The application is deployed on a 64 bit linux box with 8gb RAM and only 2 gb available to my application.
The way the third party library works, I cannot use the new Executor framework or thread pooling.
So how can I solve this problem?
Note that using thread pool is not an option. All threads are running all the time to capture events.
Sine memory size on the linux box is not in my control but if I had the choice to have 25GB available to my application in a 32GB system, would that solve my problem or JVM would still choke up?
Are there some optimal Java settings for the above scenario ?
The system uses Oracle Java 1.6 64 bit.
I concur with Ryan's Answer. But the problem is worse than his analysis suggests.
Hotspot JVMs have a hard-wired minimum stack size - 128k for Java 6 and 160k for Java 7.
That means that even if you set the stack size to the smallest possible value, you'd need to use roughly twice your allocated space ... just for thread stacks.
In addition, having 30k native threads is liable to cause problems on some operating systems.
I put it to you that your task is impossible. You need to find an alternative design that does not require you to have 30k threads simultaneously. Alternatively, you need a much larger machine to run the application.
Reference: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2012-June/003867.html
I'd say give up now and figure another way to do it. Default stack size is 512K. At 30k threads, that's 15G in stack space alone. To fit into 2G, you'll need to cut it down to less than 64K stacks, and that leaves you with zero memory for the heap, including all the Thread objects, or the JVM itself.
And that's just the most obvious problem you're likely to run into when running that many simultaneous threads in one JVM.
I think we are missing lots of details, but would a distributed plateform would work? Each of individual instances would manage a range of your classes instances. Those plateform could be running on different pcs or virtual machines and communicate with each other?
I had the same problem with an SNMP provider that required a thread for each outstanding get (I wanted to have tens of thousands of outstanding gets going on at once). Now that NIO exists I'd just rewrite the library myself if I had to do this again.
You cannot solve it in "Java Code" or configuration. Windows chokes at around 2-3000 threads in my experience (this may have changed in later versions). When I was doing this I surprisingly found that Linux supported even less threads (around 1000).
When the system stops supplying threads, "Out of Memory" is the exception you should expect to see--so I'm sure that's it--I started getting this exception long before I ran out of memory. Perhaps you could hack linux somehow to support more, but I have no idea how.
Using the concurrent package will not help here. If you could switch over to "Green" threads it might, but that might take recompiling the JVM (it would be nice if it was available as a command line switch, but I really don't think it is).
I guess there is a good reason, but I don't understand why sometimes we put for example 5 instances having the same webapplications on the same physical server.
Has it something to do with an optimisation for a multi processor architecture?
The max allowed ram limit for JVM or something else?
Hmmm... After a long time I am seeing this question again :)
Well a multiple JVM instances on a single machine solves a lot of issues. First of let us face this: Although JDK 1.7 is coming into picture, a lot of legacy application were developed using JDK 1.3 or 1.4 or 1.5. And still a major chunk of JDK is divided among them.
Now to your question:
Historically, there are three primary issues that system architects have addressed by deploying multiple JVMs on a single box:
Garbage collection inefficiencies: As heap sizes grow, garbage collection cycles--especially for major collections--tended to introduce significant delays into processing, thanks to the single-threaded GC. Multiple JVMs combat this by allowing smaller heap sizes in general and enabling some measure of concurrency during GC cycles (e.g., with four nodes, when one goes into GC, you still have three others actively processing).
Resource utilization: Older JVMs were unable to scale efficiently past four CPUs or so. The answer? Run a separate JVM for every 2 CPUs in the box (mileage may vary depending on the application, of course).
64-bit issues: Older JVMs were unable to allocate heap sizes beyond the 32-bit maximum. Again, multiple JVMs allow you to maximize your resource utilization.
Availability: One final reason that people sometimes run multiple JVMs on a single box is for availability. While it's true that this practice doesn't address hardware failures, it does address a failure in a single instance of an application server.
Taken from ( http://www.theserverside.com/discussions/thread.tss?thread_id=20044 )
I have mostly seen weblogic. Here is a link for further reading:
http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/WLSTuning.html#wp1104298
Hope this will help you.
I guess you are referring to application clustering.
AFAIK, JVM's spawned with really large heap size have issues when it comes to garbage collection though I'm sure by playing around with the GC algorithm and parameters you can bring down the damage to a minimum. Plus, clustered applications don't have a single point of failure. If one node goes down, the remaining nodes can keep servicing the clients. This is one of the reasons why "message based architectures" are a good fit for scalability. Each request is mapped to a message which can then be picked up by any node in a cluster.
Another point would be to service multiple requests simultaneously in case your application unfortunately uses synchronized keyword judiciously. We currently have a legacy application which has a lot of shared state (unfortunately) and hence concurrent request handling is done by spawning around 20 JVM processes with a central dispatching unit which does all the dispatching work. ;-)
I would suggest you use around least JVM per NUMA region. If a single JVM uses more than one NUMA region (often a single CPU) the performance can degrade significantly, due to a significant increase in the cost of accessing main memory of another CPU.
Additionally using multiple servers can allow you to
use different versions of java or your your applications server.
isolate different applications which could interfere (they shouldn't but they might)
limit GC pause times between services.
EDIT: It could be historical. There may have been any number of reasons to have separate JVMs in the past but since you don't know what they were, you don't know if they still apply and it may be simpler to leave things as they are.
An additional reason to use mutliple instance is serviceability.
For example if you multiple different applications for multiple customers then having seperate instances of the appserver for each application can make life a little easier when you have to do an appserver restart during a release.
Suppose you have a average configuration host and installed single instance of the web/app server. Now your application becomes more popular and number of hits increases 2 fold. What you do now ?
Add one more physical server of same configuration and instal the application and load balance these two hosts.
This is not end of life for your application. Your application will keep on becoming more popular and hence the need to scale it up. What's going to be your strategy ?
keep adding more hosts of same configuration
buy a more powerful machine where you can create more logical application servers
Which option will you go far ?
You will do cost analysis, which will involve factors like- actual hardware cost, Cost of managing these servers (power cost, space occupied in data center) etc.
Apparently, it comes that the decision is not very easy. And in most cases it's more cost effective to have a more powerful machine.