How does garbage collection affect jetty latencies? - java

I have a jetty application processing about 2k requests a second. Machine has 8 cores and JVM heap size is 8GB. There are a lot of memory mapped files and internal caching so that takes up most of the heap space (4.5 GB).
Here are the stats after the application is stable and the JVM is done tuning Young and Old gen spaces:
Young Generation : 2.6GB
Old Generation : 5.4GB
I'm seeing that my young GC is invoked every 3 seconds and the entire Eden space is cleared (i.e. very less data is passed onto old generation). I understand that filling up the Young generation so quickly means I'm allocating way too many objects and that this is an issue. But there is definitely no memory leak in my application, since the servers have been up since 2 weeks with no OOM crashes.
Young GC is a stop the world event. So my understanding is that all threads are paused during this time. So when I monitor latencies from the logs, I can see that every 2-3 seconds about 6-9 requests have a response time of > 100ms (My average response time is < 10 ms). And when Full GC is called, I see that 6-9 reqeusts have a response time of > 3 seconds (That's how long Full GC takes and since it's invoked very very less, it is not an issue here)
My question is since my jetty application has a 200 size threadpool and no bounded request queue, shouldn't calling young GC have an accordion effect on my response times? Will a 100 ms buffer be added to all the requests in my queue?
If so, what is the best way to measure response times from being added to the queue to the output response? Because the 6-9 request thing I mentioned above is from checking the logs. So basically, when the application logic is invoked to just before the response is sent, I maintain start and end time variables and subtract these 2 and dump it to the logs.
One way would be to check my load balancer. But since these servers are behind an ELB, I don't really have much access here other than average response times which don't really help me.

You should enable GC logging for your application. Try adding following jvm command line arguments
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:-PrintTenuringDistribution -XX:+PrintGCCause -XX:+PrintGCApplicationStoppedTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<REPLACE_ME> -XX:GCLogFileSize=20M -Xloggc:<path_to_gc_log_dir>/gc.log
Then look at the events in GC logs and try to correlate it with application logs

Related

Why is my java process minor GC'ing for so long?

I have a Java process which does a lot of JSON processing and therefore creates a bunch of objects that are frequently collected by the Garbage Collector.
Typically, the GCs are Minor and last no longer than 500ms - 1 second max.
However during peak processing times the Minor GCs run 40x longer, we are talking about GCs that last 40-60 seconds. Sometimes they can be over a minute in duration.
In a 5 minute period this process can be spending 70% of it's time doing GCs.
My initial troubleshooting has lead me to believe the machine was Swapping which indeed it was so I turned it off but still I am getting these long minor GCs.
So I have turned off swapping, I have ensured the VM the process runs on is not overloaded by which of course I mean there is plenty of free available memory on the host and free CPU cores.
What else should I be looking at?
What else could cause a JVM to Minor GC for such a long duration of time?

Full GC do not seem execute (out of memory)

I have 2 web servers (4 cores / 16 GB RAM / Cent OS 7) behind a load balancer using round robin algorithm.
Applications are build with Java using Apache/Tomcat.
Servers, Apache, Tomcat and Webapps have the same configuration, with the heap size : -Xms12840m -Xmx12840m
The problem is that the 1st server goes through an out of memory. Kernel kill the java process because of an out of memory. While the 2nd server is more stable.
I tried to monitor and analyse the Heap dump memory using VisualVM and also the GC with jstat.
About the heap dump memory, I didn't found any memory leak, which does not mean that there are not.
But with VisualJM / Monitor I can observe that a Full GC is done on the 2nd server when the old generation is full. Which is not the case on the 1st server. In fact, it seems the first server is constantly more busy than the second despite of the round robin algorithm used by the load balancer.
So, on the 1st server it seems that the JVM has not the time to proceed to a full GC before the out of memory.
By default, the ratio between the young/old generation is 1:2
Minor GC on the young generation are ok, when the Eden if full a minor GC is done. But when the old generation growth near the 100%, there is no Full GC.
So, how can I optimise the GC in order to avoid the out of memory ?
Why the full GC is not done on server 1 ?
Is it because of a peak of requests on the server and then the JVM is not able to proceed a full GC in time ?
Thanks for your help.

Web-application execution gets unresponsive with high GC, CPU activity and metaspace doesn't seem to increase

We are performing performance testing and tuning activities in of our projects. I have used JVM configs mentioned in this article
Exact JVM options are:
set "JAVA_OPTS=-Xms1024m -Xmx1024m
-XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=1024m
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintHeapAtGC -Xloggc:C:\logs\garbage_collection.logs
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100m -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=C:\logs\heap_dumps\'date'.hprof
-XX:+UnlockDiagnosticVMOptions"
Still we see that the issue is not resolved. I am sure that there are some issues within our code(Thread implementation etc) and the external libraries that we use(like log4j etc) but I was at least hoping some performance improvement with employing these JVM tuning options.
The reports from Gceasy.io suggest that:
It looks like your application is waiting due to lack of compute
resources
(either CPU or I/O cycles). Serious production applications shouldn't be
stranded because of compute resources. In 1 GC event(s), 'real' time took
more than 'usr' + 'sys' time.
Some known code issues:
There is lot of network traffic to some external webapp which accepts only one
connection at a time. But this delay is acceptable for our application.
Some of threads block on Log4j. We are using Log4j for console, db and file appending.
There can be issue with MySQL tuning as well. But for now, we want to rule out these possibilities and just understand any other factors that might be affecting our execution.
What I was hoping with the tuning that, there should be less GC activity, metaspace should be managed properly. But this is not observed why?
Here are some of the snapshots:
Here we can how metaspace is stuck at 40MB and do not exceed that.
There is a lot of GC activity also been seen
Another image depicting overall system state:
What could be our issue? Need some definitive pointers on these!
UPDATE-1: Disk usage monitoring
UPDATE-2: Added the screenshot with heap.
SOME MORE UPDATES: Well, I did not mention earlier that our processing involves selenium (Test automation) execution which spawns more than couple of web-browsers using the chrome/ firefox webdrivers. While monitoring I saw that in the background processes, Chrome is using a lot of memory. Can this be a possible reason for slow down?
Here are the screenshots for the same:
Other pic that shows the background processes
EDIT No-5: Adding the GC logs
GC_LOGS_1
GC_LOGS_2
Thanks in advance!
You don't seem to have a GC problem. Here's a plot of your GC pause times over the course of more than 40 hours of your app running:
From this graph we can see that most of the GC pause times are below 0.1 seconds, some of them are in the 0.2-0.4 seconds, but since the graph itself contains 228000 data points, it's hard to figure out how the data is distributed. We need a histogram plot containing the distribution of the GC pause times. Since the vast majority of these GC pause times are very low, with a very few outliers, plotting the distribution in a histogram linearly is not informative. So I created a plot containing the distribution of the logarithm of those GC pause times:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the number of occurences. The histogram has 500 bins.
As you can see from these two graphs, the GC pause times are clustered into two groups, and most of the GC pause times are very low on the order of magnitude of milliseconds or less. If we plot the same histogram on a log scale on the y axis too, we get this graph:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the 10 based logarithm of the number of occurences. The histogram has 50 bins.
On this graph it becomes visible, that we you have a few tens of GC pause times that might be measurable for a human, which are in the order of magnitude of tenths of seconds. These are probably those 120 full GCs that you have in your first log file. You might be able to reduce those times further if you were using a computer with more memory and disabled swap file, so that all of the JVM heap stays in RAM. Swapping, especially on a non-SSD drive can be a real killer for the garbage collector.
I created the same graphs for the second log file you posted, which is a much smaller file spanning of around 8 minutes of time, consisting of around 11000 data points, and I got these images:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the number of occurences. The histogram has 500 bins.
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the 10 based logarithm of the number of occurences. The histogram has 50 bins.
In this case, since you've been running the app on a different computer and using different GC settings, the distribution of the GC pause times is different from the first log file. Most of them are in the sub-millisecond range, with a few tens, maybe hundreds in the hundredth of a second range. We also have a few outliers here that are in the 1-2 seconds range. There are 8 such GC pauses and they all correspond to the 8 full GCs that occured.
The difference between the two logs and the lack of high GC pause times in the first log file might be attributed to the fact that the machine running the app that produced the first log file has double the RAM vs the second (8GB vs 4GB) and the JVM was also configured to run the parallel collector. If you're aiming for low latency, you're probably better off with the first JVM configuration as it seems that the full GC times are consistently lower than in the second config.
It's hard to tell what your issue is with your app, but it seems it's not GC related.
First thing I will check is Disk IO... If your processor is not loaded 100% during performance testing most likely Disk IO is a problem(e.g. you are using hard drive)... Just switch for SSD(or in-memory disk) to resolve this
GC just does its work... You re selected concurrent collector to perform GC.
From the documentation:
The mostly concurrent collector performs most of its work concurrently (for example, while the application is still running) to keep garbage collection pauses short. It is designed for applications with medium-sized to large-sized data sets in which response time is more important than overall throughput because the techniques used to minimize pauses can reduce application performance.
What you see matches this description: GC takes time, but "mainly" do not pause application for a long time
As an option you may try to enable Garbage-First Garbage Collector (use -XX:+UseG1GC) and compare results. From the docs:
G1 is planned as the long-term replacement for the Concurrent Mark-Sweep Collector (CMS). Comparing G1 with CMS reveals differences that make G1 a better solution. One difference is that G1 is a compacting collector. Also, G1 offers more predictable garbage collection pauses than the CMS collector, and allows users to specify desired pause targets.
This collector allows to set maximum GC phase length, e.g. add -XX:MaxGCPauseMillis=200 option, which says that you're OK until GC phase takes less than 200ms.
Check you log files. I have seen similar issue in production recently and guess what was the problem. Logger.
We use log4j non asysnc but it is not log4j issue. Some exception or condition led to log around a million lines in the log file in span of 3 minutes. Coupled with high volume and other activities in the system, that led to high disk I/O and web application became unresponsive.

How to prevent ParNew to stop application for a few minutes

One day my application got stuck for a ~5 min. I believe it happened because of ParNew GC. I don't have GC logs but the internal tool shows that ParNew consumed ~35% CPU at that time. Now I wonder how to prevent that in future.
The application runs with JDK 1.8 and 2.5G heap. The GC options are XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode.
I know I can use XX:MaxGCPauseMillis and XX:GCTimeRatio. What else would you propose to prevent ParNew from stopping the application for a few minutes ?
GC is optimized not to eat your cpu time, being stuck for 5 minutes is far from normal, either you released millions of objects or most likely your allocated almost (if not all) your heap space, forcing the GC to retrieve every piece of memory it could find. I would also check for memory leaks.

PSYoungGen Pause Jitter

I am observing some strange behaviors in my java application. Minor GC pause times are very stable, between 2 to 4 milliseconds, and spaced several seconds apart (ranging from around 4 seconds to minutes depending on busyness). I have noticed that before the first full GC, minor collection pause times can spike to several hundred milliseconds, sometimes breaching the seconds mark. However, after the first full collection, these spikes go away and minor collection pause times do not spike anymore and remain rock steady between 2-4 milliseconds. These spikes do not appear to correlate with tenured heap usage.
I'm not sure how to diagnose the issue. Obviously something changed from the full collection, or else the spikes would continue to happen. I'm looking for some ideas on how to resolve it.
Some details:
I am using the -server flag. The throughput parallel collector is used.
Heap size is at 1.5G, default ratio is used between young and tenured generation. Survivor rations remain at default. I'm not sure how relevant these are to this investigation as the same behavior is shown despite more tweaking.
On startup, I make several DB calls. Most of the information can be GC'd away (and does upon a full collection). Some instances of my application will GC while others will not.
What I've tried/thought about:
Full Perm Gen? I think the parallel collector handles this fine and does not need more flags, unlike CMS.
Manually triggering a full GC after startup. I will be trying this, hopefully making the problem go away based on observations. However, this is only a temporary solution as I still don't understand why there is even an issue.
First, wanted more information on this but since comment needs 50 repo which I dont have so asking here.
This is too less info to work with.
To diagnose the issue you can switch on the GC logs and post the behaviour that you notice. Also, you can use jstat to view the heap space usage live while application is running: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
To turn on GC logs you can read here : http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Categories

Resources