Full GC becoming very frequent - java

I've got a Java webapp running on one tomcat instance. During peak times the webapp serves around 30 pages per second and normally around 15.
My environment is:
O/S: SUSE Linux Enterprise Server 10 (x86_64)
RAM: 16GB
server: Tomcat 6.0.20
JVM: Java HotSpot(TM) 64-Bit Server VM 1.6.0_14
JVM options:
CATALINA_OPTS="-Xms512m -Xmx1024m -XX:PermSize=128m -XX:MaxPermSize=256m
-XX:+UseParallelGC
-Djava.awt.headless=true
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
JAVA_OPTS="-server"
After a couple of days of uptime the Full GC starts occurring more frequently and it becomes a serious problem to the application's availability. After a tomcat restart the problem goes away but, of course, returns after 5 to 10 or 30 days (not consistent).
The Full GC log before and after a restart is at http://pastebin.com/raw.php?i=4NtkNXmi
It shows a log before the restart at 6.6 days uptime where the app was suffering because Full GC needed 2.5 seconds and was happening every ~6 secs.
Then it shows a log just after the restart where Full GC only happened every 5-10 minutes.
I've got two dumps using jmap -dump:format=b,file=dump.hprof PID when the Full GCs where occurring (I'm not sure whether I got them exactly right when a Full GC was occurring or between 2 Full GCs) and opened them in http://www.eclipse.org/mat/ but didn't get anything useful in Leak Suspects:
60MB: 1 instance of "org.hibernate.impl.SessionFactoryImpl" (I use hibernate with ehcache)
80MB: 1,024 instances of "org.apache.tomcat.util.threads.ThreadWithAttributes" (these are probably the 1024 workers of tomcat)
45MB: 37 instances of "net.sf.ehcache.store.compound.impl.MemoryOnlyStore" (these should be my ~37 cache regions in ehcache)
Note that I never get an OutOfMemoryError.
Any ideas on where should I look next?

When we had this issue we eventually tracked it down to the young generation being too small. Although we had given plenty of ram the young generation wasn't given it's fair share.
This meant that small garbage collections would happen more frequently and caused some young objects to be moved into the tenured generation meaning more large garbage collections also.
Try using the -XX:NewRatio with a fairly low value (say 2 or 3) and see if this helps.
More info can be found here.

I've switched from -Xmx1024m to -Xmx2048m and the problem went away. I now have 100 days of uptime.

Beside tuning the various options of JVM I would also suggest to upgrade to a newer release of the VM, because later versions have much better tuned garbage collector (also without trying the new experimental one).
Beside that also if it's (partially) true that assigning more ram to JVM could increase the time required to perform GC there is a tradeoff point between using the whole 16 GB of memory and increasing your memory occupation, so you can try double all values, to start
Xms1024m -Xmx2048m -XX:PermSize=256m -XX:MaxPermSize=512m
Regards
Massimo

What might be happening in your case is that you have a lot of objects who live a little longer than NewGen life cycle. If survivor space is too small, they go straight to the OldGen. -XX:+PrintTenuringDistribution could provide some insight. Your NewGen is large enough, so try decreasing SurvivorRatio.
also, jconsole will probably provide more visual insight into what happens with your memory, try it.

Related

Java native memory leak with G1 and huge memory

We currently have problems with a java native memory leak. Server is quite big (40cpus, 128GB of memory). Java heap size is 64G and we run a very memory intensive application reading lot of data to strings with about 400 threads and throwing them away from memory after some minutes.
So the heap is filling up very fast but stuff on the heap becomes obsolete and can be GCed very fast, too. So we have to use G1 to not have STW breaks for minutes.
Now, that seems to work fine - heap is big enough to run the application for days, nothing leaking here. Anyway the Java process is growing and growing over time until all the 128G are used and the aplication crashes with an allocation failure.
I've read a lot about native java memory leaks, including the glibc issue with max. arenas (we have wheezy with glibc 2.13, so no fix possible here with setting MALLOC_ARENA_MAX=1 or 4 without a dist upgrade).
So we tried jemalloc what gave us graphs for:
inuse-space:
and
inuse-objects:
.
I don't get it what's the issue here, has someone an idea?
If I set MALLOC_CONF="narenas:1" for jemalloc as environment parameter for the tomcat process running our app, could that still use the glibc malloc version anyway somehow?
This is our G1 setup, maybe some issue here?
-XX:+UseCompressedOops
-XX:+UseNUMA
-XX:NewSize=6000m
-XX:MaxNewSize=6000m
-XX:NewRatio=3
-XX:SurvivorRatio=1
-XX:InitiatingHeapOccupancyPercent=55
-XX:MaxGCPauseMillis=1000
-XX:PermSize=64m
-XX:MaxPermSize=128m
-XX:+PrintCommandLineFlags
-XX:+PrintFlagsFinal
-XX:+PrintGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution
-XX:-UseAdaptiveSizePolicy
-XX:+UseG1GC
-XX:MaxDirectMemorySize=2g
-Xms65536m
-Xmx65536m
Thanks for your help!
We never called System.gc() explicitly, and meanwhile stopped using G1, not specifying anything other than xms and xmx.
Therefore using nearly all the 128G for the heap now. The java process memory usage is high - but constant for weeks. I'm sure this is some G1 or at least general GC issue. The only disadvantage by this "solution" are high GC pauses, but they decreased from up to 90s to about 1-5s with increasing the heap, which is ok for the benchmark we drive with our servers.
Before that, I played around with -XX:ParallelGcThreads options which had significant influence on the memory leak speed when decreasing from 28 (default for 40 cpus) downwards to 1. The memory graphs looked somewhat like a hand fan using different values on different instances...

wso2 axis2 OutOfMemory few times a day

What is the possible reason of constant OOM errors?
On our developement environment we have 3 nodes, VM on master node is initialized with: Xms=3GB, Xmx=3GB. Also there are about 30 proxies and 30 endpoints defined. Developers loading their changes (car files) constantly during the day without restarting carbon. Few times a day it freezes. Maybe constant configartion changes kill carbon? On preproduction environment carbon works flawlessly :/
I did heap dump and the 'leak suspect report' result is:
One instance of "org.apache.axis2.context.ConfigurationContext" loaded
by "axis2" occupies 661 590 744 (79,50%) bytes. The memory is
accumulated in one instance of
"java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "".
Report result:
Histogram:
According to the Heap Dump most of the retained heap is occupied by the ConfigurationContext instances. So this OOM issue seems to occur due to the heavy configuration load in your development system. May be due to large capps deployed frequently by lot of developers. Since that incident does not happen in your production, that ESB never goes OOM.
Thanks & Regards,
Ravindra.
Freezing happens mostly due to full gc cycles happening in the system, You can change your gc settings in order to do a proper gc configuration
For a typical deployment wso2 suggest to have below GC configuration. -Xms512m -Xmx2048m -XX:MaxPermSize=1024m Since you are using 3gb of RAM you have to adjust that based on your scenario. You can refer to https://docs.wso2.com/display/CLUSTER44x/Production+Deployment+Guidelines for more information regarding the deployments
Since you are using Xms=3GB, Xmx=3GB there is a possibility it will wait until the whole 3gb get filled and then do full gc cycles. Therefore better to adjust the Xms value to about 1gb so it motivate to have minor gc cycles and cleanup the unnecessary stuffs rather than waiting for full gc

GC pauses get really long after several days

I am running a build system. We used to use CMS collector, but we started suffering under very long full GC cycles, throughput (time not doing GC) was around 90%. So I now decided to switch to G1 with the assumtion that even if I have longer overall GC time, the pauses will be shorter hence ensuring higher availability. So this idea seemed to work even better than I expeced, I was seeing no full GC for almost 3 days, throughput was 97%, overall GC performance was way better. (All screenshots and data got from GCViewer)
Until now (day 6). Today the system simply went berzerk. Old space utilized is just barely under 100%. I am seeing Full GC triggered almost every 2-3 minutes or so:
Old space utilization:
Heap size is 20G (128G Ram total). The flags I am currently using are:
-XX:+UseG1GC
-XX:MaxPermSize=512m
-XX:MaxGCPauseMillis=800
-XX:GCPauseIntervalMillis=8000
-XX:NewRatio=4
-XX:PermSize=256m
-XX:InitiatingHeapOccupancyPercent=35
-XX:+ParallelRefProcEnabled
plus logging flags. What I seem to be missing is -XX:+ParallelGCThreads=20 (I have 32 processors), default should be 8. I have also read from oracle that it would be suggested to have -XX:+G1NewSizePercent=4 for 20G heap, default should be 5.
I am using Java HotSpot(TM) 64-Bit Server VM 1.7.0_76, Oracle Corporation
What would you suggest? Do I have obvious mistakes? What to change?
Am I do greedy by giving Java only 20G? The assumption here is that giving it too much heap would mean longer GC as there is simply more to clean (peasant logic).
PS: Application is not mine. For me its a box-product.
What would you suggest? Do I have obvious mistakes? What to change? Am I do greedy by giving Java only 20G? The assumption here is that giving it too much heap would mean longer GC as there is simply more to clean (peasant logic).
If it triggers full GCs but your occupancy stays near those 20GB then it's possible that the GC simply does not have enough breathing room, either to meet the demand of huge allocations or or to meet some of its goals (throughput, pause times), forcing full GCs as a fallback.
So what you can attempt is increasing the heap limit or relaxing the throughput goals.
As mentioned earlier in my comment you can also try upgrading to java8 for improved G1 heuristics.
For further advice GC logs covering the "berzerk" behavior would be useful.

Does the behaviour of the Java garbage collector evolve over time or get impacted by JIT?

We have a production web application running on our intranet which:
is restarted at 0300 each day in order to perform a backup of its database
has the same load on it throughout the working day (0800 to 1700)
is running on Java HotSpot(TM) 64-Bit Server VM version 20.45-b01
is running on a physical machine with 16 cores and 32 gigs of RAM, running Linux 2.6.18-128.el5
does not share the machine with any other significant process
is configured with:
-Xms2g
-XX:PermSize=256m
-Xmx4g
-XX:MaxPermSize=256m
-Xss192k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=50
-XX:+DisableExplicitGC
Each day the heap usage:
rises gradually to 90% from startup until 0800
remains at 90% until 0930
remains at 70% from 0930 until 1415
drops to 50% at 1415
drops to 37% at 1445
at which point the heap rises to 55% in about 40 minutes and is collected back to 37%, ad infinitum until the next restart.
We have AppDynamics installed on the JVM and can see that Major Garbage Collections take place roughly every minute without much of an impact on the memory (except the falls outlined above of course) until the memory reaches 37%, when the Major collections become much less frequent.
There are obviously hundreds of factors external to the behaviour of a web application, but one avenue of research is the fact that Hotspot JIT information is obviously lost when the JVM is stopped.
Are there GC optimisations/etc which are also lost with the shutdown of the JVM? Is the JVM effectively consuming more memory than it needs to because certain Hotspot optimisations haven't yet taken place?
Is it possible that we would get better memory performance from this application if the JVM wasn't restarted and we found another way to perform a backup of the database?
(Just to reiterate, I know that there are a hundred thousand things that could influence the behaviour of an application, especially an application that hardly anyone else knows! I really just want to know whether there are certain things to do with the memory performance of a JVM which are lost when it is stopped)
Yes, it is possible for the behaviour of GCs to change over time due to JIT optimisation. One example is 'Escape Analysis' which has been enabled by default since Java 6u23. This type of optimisation can prevent some objects from being created in the heap and therefore not require garbage collection at all.
For more information see Java 7's HotSpot Performance Enhancements.

Appropriate JVM/GC tuning for 4GB JVM with 3GB cache

I am looking for the appropriate settings to configure the JVM for a web application. I have read about old/young/perm generation, but I have trouble using those parameters at best for this configuration.
Out of the 4 GB, around 3 GB are used for a cache (applicative cache using EhCache), so I'm looking for the best set up considering that. FYI, the cache is static during the lifetime of the application (loaded from disk, never expires), but heavily used.
I have profiled my application already, and I have performed optimization regarding the DB queries, the application's architecture, the cache size, etc... I am just looking for JVM configuration advices here. I have measured 99% throughput for the Garbage Collector, and 6-8s pauses when the Full GC runs (approximately once every 1/2h).
Here are the current JVM parameters:
-XX:+UseParallelGC -XX:+AggressiveHeap -Xms2048m -Xmx4096m
-XX:NewSize=64m -XX:PermSize=64m -XX:MaxPermSize=512m
-verbose:gc -XX:+PrintGCDetails -Xloggc:gc.log
Those parameters may be completely off because they have been written a long time ago... Before the application became that big.
I am using Java 1.5 64 bits.
Do you see any possible improvements?
Edit: the machine has 4 cores.
-XX:+UseParallel*Old*GC should speed up the Full GCs on a multi core machine.
You could also profile with different NewRatio values. Your cached objects will live in the tenured generation so profile it with -XX:NewRatio=7 and then again with some higher and lower values.
You may not be able to accurately replicate realistic use during profiling, so make sure you monitor GC when it is in real life use and then you can make minor changes (e.g. to survivor space etc) and see what effect they have.
Old advice was not to use AggressiveHeap with Xms and Xmx, I am not sure if that is still true.
Edit: Please let us know which OS/hardware platform you are deployed on.
Full collections every 30 mins indicates the old generation is quite full. A high value for newRatio will give it more space at the expense of the young gen. Can you give the JVM more than 4g or are you limited to that?
It would also be useful to know what your goals / non functional requirements are. Do you want to avoid these 6 / 7 second pauses at the risk of lower throughput or are those pauses an acceptable compromise for highest possible throughput?
If you want to minimise the pauses, try the CMS collector by removing both
-XX:+UseParallelGC -XX:+UseParallelOldGC
and adding
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Profile with that with various NewRatio values and see how you get on.
One downside of the CMS collector is that unlike the parallel old and serial collectors, it doesn't compact the old generation. If the old generation gets too fragmented and a minor collection needs to promote a lot of objects to the old gen at once, a full serial collection may be invoked which could mean a long pause. (I've seen this once in prod but with the IBM JVM which went out of memory instead of invoking a compacting collection!)
This might not be a problem for you - it depends on the nature of the application - but you can insure against it by restarting nightly or weekly.
I would use Java 6 update 30 or 7 update 2, 64-bit as they are much more efficient. e.g. they use 32-bit references by default.
I would also configure Ehcache to use direct memory or a memory mapped file if possible. This should minimise the impact on GC.
Using these options its possible to almost eliminate your heap foot print. e.g. I have an app which uses up to 180 GB of memory mapped files on a machine with 16 GB of memory and the heap size is 6 MB. A full GC takes up to 11 ms when trigger manually, not that it ever GCs. ;)
If you want a simple example where I map in an 8 TB file into memory and update it. http://vanillajava.blogspot.com/2011/12/using-memory-mapped-file-for-huge.html
I hope you just removed -server to not inflate the post, otherwise you should instantly enable it. Apart from the bit longer startup time (which really isn't an issue for a web application that should run days) I don't see any reason to use anything but c2. That could give some nice performance improvements in general. Umn back to topic:
Sadly the best thing I can think of won't work with your ancient JVM. The G1 garbage collector was basically designed to reduce latency. Not only does it try to reduce pauses in general, it also offers some tuning parameters to set pause goals and intervals. See this page.
There is an experimental backport to java6 though I doubt it's kept up to date. And nobody is wasting any time on optimizing GC or anything else for Java 1.5 anymore I fear.
PS: There would also be IBM's JVM and obviously azul systems (ok that wasn't a serious proposition ;) ), but those are obviously out of the question.. just wanted to mention them.

Categories

Resources