I am observing some strange behaviors in my java application. Minor GC pause times are very stable, between 2 to 4 milliseconds, and spaced several seconds apart (ranging from around 4 seconds to minutes depending on busyness). I have noticed that before the first full GC, minor collection pause times can spike to several hundred milliseconds, sometimes breaching the seconds mark. However, after the first full collection, these spikes go away and minor collection pause times do not spike anymore and remain rock steady between 2-4 milliseconds. These spikes do not appear to correlate with tenured heap usage.
I'm not sure how to diagnose the issue. Obviously something changed from the full collection, or else the spikes would continue to happen. I'm looking for some ideas on how to resolve it.
Some details:
I am using the -server flag. The throughput parallel collector is used.
Heap size is at 1.5G, default ratio is used between young and tenured generation. Survivor rations remain at default. I'm not sure how relevant these are to this investigation as the same behavior is shown despite more tweaking.
On startup, I make several DB calls. Most of the information can be GC'd away (and does upon a full collection). Some instances of my application will GC while others will not.
What I've tried/thought about:
Full Perm Gen? I think the parallel collector handles this fine and does not need more flags, unlike CMS.
Manually triggering a full GC after startup. I will be trying this, hopefully making the problem go away based on observations. However, this is only a temporary solution as I still don't understand why there is even an issue.
First, wanted more information on this but since comment needs 50 repo which I dont have so asking here.
This is too less info to work with.
To diagnose the issue you can switch on the GC logs and post the behaviour that you notice. Also, you can use jstat to view the heap space usage live while application is running: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html
To turn on GC logs you can read here : http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
Related
We are performing performance testing and tuning activities in of our projects. I have used JVM configs mentioned in this article
Exact JVM options are:
set "JAVA_OPTS=-Xms1024m -Xmx1024m
-XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=1024m
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintHeapAtGC -Xloggc:C:\logs\garbage_collection.logs
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100m -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=C:\logs\heap_dumps\'date'.hprof
-XX:+UnlockDiagnosticVMOptions"
Still we see that the issue is not resolved. I am sure that there are some issues within our code(Thread implementation etc) and the external libraries that we use(like log4j etc) but I was at least hoping some performance improvement with employing these JVM tuning options.
The reports from Gceasy.io suggest that:
It looks like your application is waiting due to lack of compute
resources
(either CPU or I/O cycles). Serious production applications shouldn't be
stranded because of compute resources. In 1 GC event(s), 'real' time took
more than 'usr' + 'sys' time.
Some known code issues:
There is lot of network traffic to some external webapp which accepts only one
connection at a time. But this delay is acceptable for our application.
Some of threads block on Log4j. We are using Log4j for console, db and file appending.
There can be issue with MySQL tuning as well. But for now, we want to rule out these possibilities and just understand any other factors that might be affecting our execution.
What I was hoping with the tuning that, there should be less GC activity, metaspace should be managed properly. But this is not observed why?
Here are some of the snapshots:
Here we can how metaspace is stuck at 40MB and do not exceed that.
There is a lot of GC activity also been seen
Another image depicting overall system state:
What could be our issue? Need some definitive pointers on these!
UPDATE-1: Disk usage monitoring
UPDATE-2: Added the screenshot with heap.
SOME MORE UPDATES: Well, I did not mention earlier that our processing involves selenium (Test automation) execution which spawns more than couple of web-browsers using the chrome/ firefox webdrivers. While monitoring I saw that in the background processes, Chrome is using a lot of memory. Can this be a possible reason for slow down?
Here are the screenshots for the same:
Other pic that shows the background processes
EDIT No-5: Adding the GC logs
GC_LOGS_1
GC_LOGS_2
Thanks in advance!
You don't seem to have a GC problem. Here's a plot of your GC pause times over the course of more than 40 hours of your app running:
From this graph we can see that most of the GC pause times are below 0.1 seconds, some of them are in the 0.2-0.4 seconds, but since the graph itself contains 228000 data points, it's hard to figure out how the data is distributed. We need a histogram plot containing the distribution of the GC pause times. Since the vast majority of these GC pause times are very low, with a very few outliers, plotting the distribution in a histogram linearly is not informative. So I created a plot containing the distribution of the logarithm of those GC pause times:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the number of occurences. The histogram has 500 bins.
As you can see from these two graphs, the GC pause times are clustered into two groups, and most of the GC pause times are very low on the order of magnitude of milliseconds or less. If we plot the same histogram on a log scale on the y axis too, we get this graph:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the 10 based logarithm of the number of occurences. The histogram has 50 bins.
On this graph it becomes visible, that we you have a few tens of GC pause times that might be measurable for a human, which are in the order of magnitude of tenths of seconds. These are probably those 120 full GCs that you have in your first log file. You might be able to reduce those times further if you were using a computer with more memory and disabled swap file, so that all of the JVM heap stays in RAM. Swapping, especially on a non-SSD drive can be a real killer for the garbage collector.
I created the same graphs for the second log file you posted, which is a much smaller file spanning of around 8 minutes of time, consisting of around 11000 data points, and I got these images:
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the number of occurences. The histogram has 500 bins.
In the above image, the X axis is the 10 base logarithm of the GC pause time, the Y axis is the 10 based logarithm of the number of occurences. The histogram has 50 bins.
In this case, since you've been running the app on a different computer and using different GC settings, the distribution of the GC pause times is different from the first log file. Most of them are in the sub-millisecond range, with a few tens, maybe hundreds in the hundredth of a second range. We also have a few outliers here that are in the 1-2 seconds range. There are 8 such GC pauses and they all correspond to the 8 full GCs that occured.
The difference between the two logs and the lack of high GC pause times in the first log file might be attributed to the fact that the machine running the app that produced the first log file has double the RAM vs the second (8GB vs 4GB) and the JVM was also configured to run the parallel collector. If you're aiming for low latency, you're probably better off with the first JVM configuration as it seems that the full GC times are consistently lower than in the second config.
It's hard to tell what your issue is with your app, but it seems it's not GC related.
First thing I will check is Disk IO... If your processor is not loaded 100% during performance testing most likely Disk IO is a problem(e.g. you are using hard drive)... Just switch for SSD(or in-memory disk) to resolve this
GC just does its work... You re selected concurrent collector to perform GC.
From the documentation:
The mostly concurrent collector performs most of its work concurrently (for example, while the application is still running) to keep garbage collection pauses short. It is designed for applications with medium-sized to large-sized data sets in which response time is more important than overall throughput because the techniques used to minimize pauses can reduce application performance.
What you see matches this description: GC takes time, but "mainly" do not pause application for a long time
As an option you may try to enable Garbage-First Garbage Collector (use -XX:+UseG1GC) and compare results. From the docs:
G1 is planned as the long-term replacement for the Concurrent Mark-Sweep Collector (CMS). Comparing G1 with CMS reveals differences that make G1 a better solution. One difference is that G1 is a compacting collector. Also, G1 offers more predictable garbage collection pauses than the CMS collector, and allows users to specify desired pause targets.
This collector allows to set maximum GC phase length, e.g. add -XX:MaxGCPauseMillis=200 option, which says that you're OK until GC phase takes less than 200ms.
Check you log files. I have seen similar issue in production recently and guess what was the problem. Logger.
We use log4j non asysnc but it is not log4j issue. Some exception or condition led to log around a million lines in the log file in span of 3 minutes. Coupled with high volume and other activities in the system, that led to high disk I/O and web application became unresponsive.
I am running a build system. We used to use CMS collector, but we started suffering under very long full GC cycles, throughput (time not doing GC) was around 90%. So I now decided to switch to G1 with the assumtion that even if I have longer overall GC time, the pauses will be shorter hence ensuring higher availability. So this idea seemed to work even better than I expeced, I was seeing no full GC for almost 3 days, throughput was 97%, overall GC performance was way better. (All screenshots and data got from GCViewer)
Until now (day 6). Today the system simply went berzerk. Old space utilized is just barely under 100%. I am seeing Full GC triggered almost every 2-3 minutes or so:
Old space utilization:
Heap size is 20G (128G Ram total). The flags I am currently using are:
-XX:+UseG1GC
-XX:MaxPermSize=512m
-XX:MaxGCPauseMillis=800
-XX:GCPauseIntervalMillis=8000
-XX:NewRatio=4
-XX:PermSize=256m
-XX:InitiatingHeapOccupancyPercent=35
-XX:+ParallelRefProcEnabled
plus logging flags. What I seem to be missing is -XX:+ParallelGCThreads=20 (I have 32 processors), default should be 8. I have also read from oracle that it would be suggested to have -XX:+G1NewSizePercent=4 for 20G heap, default should be 5.
I am using Java HotSpot(TM) 64-Bit Server VM 1.7.0_76, Oracle Corporation
What would you suggest? Do I have obvious mistakes? What to change?
Am I do greedy by giving Java only 20G? The assumption here is that giving it too much heap would mean longer GC as there is simply more to clean (peasant logic).
PS: Application is not mine. For me its a box-product.
What would you suggest? Do I have obvious mistakes? What to change? Am I do greedy by giving Java only 20G? The assumption here is that giving it too much heap would mean longer GC as there is simply more to clean (peasant logic).
If it triggers full GCs but your occupancy stays near those 20GB then it's possible that the GC simply does not have enough breathing room, either to meet the demand of huge allocations or or to meet some of its goals (throughput, pause times), forcing full GCs as a fallback.
So what you can attempt is increasing the heap limit or relaxing the throughput goals.
As mentioned earlier in my comment you can also try upgrading to java8 for improved G1 heuristics.
For further advice GC logs covering the "berzerk" behavior would be useful.
This question already has answers here:
How to force garbage collection in Java?
(25 answers)
Closed 9 years ago.
I have an application that's running on a 24x6 schedule. Currently, after running for a few days, a Full GC is performed automatically - and usually during a busy part of the day, which negatively impacts user response times.
What I'd like to do is force a Full GC - perhaps at midnight each night, during very low usage times - to prevent it from happening during the day. I've tried System.gc(), but it doesn't seem to guarantee when a Full GC will happen, or even if it will. Is there some means of doing this?
Version info:
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) Server VM (build 11.0-b16, mixed mode)
Additionally -
Minor GCs are running, about every 10-15 seconds. But these are are not freeing up enough RAM to get the app through a full week. When a Full GC does happen, nearly 50% of the heap is freed.
In relation to the above, calling System.gc() doesn't seem to mean that any of the next GCs, will be of the Full form needed to free those large chucks of memory. Yes, it is enabled (or not disabled, depending on how you read the -XX option).
I've already played around with several of the CMS GC settings, which has helped greatly, but not solved the problem. Originally it was throwing OOMs, two to three times a week.
I want to stop the endless loops of:
constantly adding to heap space - which can only go on for so long
constant tuning and testing of GC settings - it is long past the point of diminishing return
I don't want to treat this like an NT machine and bounce it nightly. There are active user sessions throughout the night, and bouncing the app would mean loosing session data.
To be more specific, I'm looking more for a technique to use to ensure that a Full GC is going to happen, rather than a simple method/function to call to run it.
At the moment I'm looking at the modification of the percentage threshold used by CMS to determine when a Full GC is required.
Thanks for any help.
jmap -histo:live <PID> will force Full GC as "side effect" of finding all live objects. You can schedule it to recycle your JVM processes on off-working hours.
Your JVM build 1.6.0_11-b03 is pretty ancient, but jmap should be supported on all 1.6 HotSpot JVMs.
No.
System.gc() suggests to the GC that you would like a collection.
Further, there is probably very little garbage generated during the quiet period and that is why calling System.gc() doesn't do much.
During peak time there is, presumably, more activity and therefore more garbage being generated - hence the need for a collection.
It should be obvious that you cannot defer collection in that simplistic manner. The JVM will collect when it needs to.
You need to look into tuning your GC - if you have stop the world collection happening then you have some issue. This shouldn't really happen on a modern server JVM.
You should look into tuning the CMS collector - this is a pretty good article on the basics of the GC system in Java. In Java 7 there is the new G1GC which may or may not be better.
You should find a way to simulate load conditions and try different GC parameters, the CMS GC has many tuning parameters and configuring it is somewhat of a dark art...
This is a somewhat more involved article on GC tuning and benchmarking.
I'd say yes - schedule a process for your quiet time that does something you believe will trigger a GC.
Eat some serious memory. Allocate a truckload of objects and use weak references to keep track of them - just do something at your quiet time that should trigger a GC.
Do make sure you have in place some logic that detects the GC and stops the process.
There is no way to force and immediate collection as the garbage collector is non-deterministic.
Our JBoss 3.2.6 application server is having some performance issues and after turning on the verbose GC logging and analyzing these logs with GCViewer we've noticed that after a while (7 to 35 hours after a server restart) the GC going crazy. It seems that initially the GC is working fine and doing a GC every hour or so but at a certain point it starts going crazy and performing full GC's every minute. As this only happens in our production environment have not been able to try turning off explicit GCs (-XX:-DisableExplicitGC) or modify the RMI GC interval yet but as this happens after a few hours it does not seem to be caused by the know RMI GC issues.
Any ideas?
Update:
I'm not able to post the GCViewer output just yet but it does not seem to be hitting the max heap limitations at all. Before the GC goes crazy it is GC-ing just fine but when the GC goes crazy the heap doesn't get above 2GB (24GB max).
Besides RMI are there any other ways explicit GC can be triggered? (I checked our code and no calls to System.gc() are being made)
Is your heap filling up? Sometimes the VM will get stuck in a 'GC loop' when it can free up just enough memory to prevent a real OutOfMemoryError but not enough to actually keep the application running steadily.
Normally this would trigger an "OutOfMemoryError: GC overhead limit exceeded", but there is a certain threshold that must be crossed before this happens (98% CPU time spent on GC off the top of my head).
Have you tried enlarging heap size? Have you inspected your code / used a profiler to detect memory leaks?
You almost certainly have a memory leak and the if you let the application server continue to run it will eventually crash with an OutOfMemoryException. You need to use a memory analysis tool - one example would be VisualVM - and determine what is the source of the problem. Usually memory leaks are caused by some static or global objects that never release object references that they store.
Good luck!
Update:
Rereading your question it sounds like things are fine and then suddenly you get in this situation where GC is working much harder to reclaim space. That sounds like there is some specific operation that occurs that consumes (and doesn't release) a large amount of heap.
Perhaps, as #Tim suggests, your heap requirements are just at the threshold of max heap size, but in my experience, you'd need to pretty lucky to hit that exactly. At any rate some analysis should determine whether it is a leak or you just need to increase the size of the heap.
Apart from the more likely event of a memory leak in your application, there could be 1-2 other reasons for this.
On a Solaris environment, I've once had such an issue when I allocated almost all of the available 4GB of physical memory to the JVM, leaving only around 200-300MB to the operating system. This lead to the VM process suddenly swapping to the disk whenever the OS had some increased load. The solution was not to exceed 3.2GB. A real corner-case, but maybe it's the same issue as yours?
The reason why this lead to increased GC activity is the fact that heavy swapping slows down the JVM's memory management, which lead to many short-lived objects escaping the survivor space, ending up in the tenured space, which again filled up much more quickly.
I recommend when this happens that you do a stack dump.
More often or not I have seen this happen with a thread population explosion.
Anyway look at the stack dump file and see whats running. You could easily setup some cron jobs or monitoring scripts to run jstack periodically.
You can also compare the size of the stack dump. If it grows really big you have something thats making lots of threads.
If it doesn't get bigger you can at least see which objects (call stacks) are running.
You can use VisualVM or some fancy JMX crap later if that doesn't work but first start with jstack as its easy to use.
I have an application that is responsible for archiving old applications, which will do a large number of applications at a time and so it will need to run for days at a time.
When my company developed this they did a fair bit of performance testing on it and they seemed to get decent numbers out of this, but I have been running an archive for a customer recently and it seems to be running really slowly and the performance seems to be degrading even more longer it runs.
There does not appear to be a memory leak, as since I have monitoring it with jconsole there still is plenty of memory available and does not appear to be shrinking.
I have noticed however that the survivor space and tenured gen of the heap can very quickly fill up until a garbage collection comes along and clears it out which seems to be happening rather frequently which I am not sure if that could be a source of the apparent slow down.
The application has been running now for 7 days 3 hours and according to jconsole it has spent 6 hours performing copy garbage collection (772, 611 collections) and 12 hours and 25 minutes on marksweep compaction's (145,940 collections).
This seems like a large amount of time to be spent on garbage collection and I am just wondering if anyone has looked into something like this before and knows if this is normal or not?
Edits
Local processing seems to be slow, for instance I am looking at one part in the logs that took 5 seconds to extract some xml from a SOAP envelope using xpath which it then appends to a string buffer along with a root tag.. that's all it does. I haven't profiled it yet, as this is running in production, I would either have to pull the data down over the net or set up a large test base in our dev environment which may end up having to do.
Running Java HotSpot Client VM version 10.0-b23
Really just need high throughput, haven't configured any specific garbage collection parameters, would be running what ever the defaults would be. Not sure how to find what collectors would be in use?
Fix
End up getting a profiler going on it, turned out the cause of the slow down was some code that was constantly trimming lines off a status box outputting logging statements which was pretty badly done. Should have figured the garbage collection was symptom from constantly copying the status text into memory, rather than an actual cause.
Cheers Guys.
According to your numbers, total garbage collection time was about 18 hours out of 7 days execution time. At about 10% of total execution time, that's slightly elevated, but even if you managed to get this down to 0%, you'd only have saved 10% execution time ... so if you're looking for substantial savings, you should better look into the other 90%, for instance with a profiler.
Without proper profiling, this is a guessing game. As an anectode, though, a few years ago a web app I was involved with at the time suddenly slowed down (response time) by a factor of 10 after a JDK upgrade. We ended up chasing it down to an explicit GC invocation added by a genious who was no longer with the company.
There is a balance you will try and maintain between JVM heap footprints and GC time. Another question might be do you have heap (and generations) (under-)allocated in such a way which mandates too frequent GCing. When deploying muti-tenant JVMs on these system, I've tried to maintain the balance to under 5% total GC time along with aggressive heap shrinkage to keep footprint low (again, multi-tenant). Heap and generations will mostly ALWAYS fill as to avoid frequent GCing to whatever it is set. Remove the -Xms parameter to see a more realistic steady state (if it has any idle time)
+1 to the suggestion on profiling though; it may be something not related to GC, but code.