I have a Java application running on Open SDK 1.7.0_95 64 bin environment and I am seeing two major garbage collection every hour which is hitting application response time to peek and I am intended to avoid it. As of now profiling my application using your-kit profile. Can anyone help me with the steps to troubleshoot the cause of these major GCs so that I can avoid it if possible. Application is deployed on Jboss EAP 6.2 on Linux environment.
OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
Please let me know if anything else needed from my side.
Thanks in advance!
Regards,
Divya Garg
The question you ask is of the kind "my car don't go fast enough what are the steps to find out what's going on". The answer to that is - consult a qualified auto repair service.
What you can try:
Turn on -XX:+PrintGC to see more of what's happenning.
Major GC once in an hour or so which halts the application is a hint to bad GC configuration. Normally it happens in memory-greedy apps which are given huge amount of memory. Then, when the time comes the GC has to clean several gigabytes which can halt processing for a few minutes. Check parallel collector or, generally "garbage collection tuning" which is an art of its own.
You may (or may not) have a memory leak problem, profile to find out (like it seems you do). Profiling is also an art of its own. Look for biggest allocations and where they come from, try to find out why are they still referenced and can't be cleaned up.
Related
I have a Tomcat 7 with a few applications installed, for now it's usual to upload new versions of the applications pretty fast: so I pack the war and undeploy the existing one and deploy the new war or redeploy an existing application, The problem comes few weeks later when I see that the memory is almost full in order to prevent some unexpected outofmemory exceptions that makes our customers without service
Do I have to restart time to time Tomcat? ¿Is that normal?
Is there a pragmatic solution for zero Tomcat restarts?
Please only expert answers
UPDATE: This library can really help to avoid permgem issues and classloader leaks:
https://github.com/mjiderhamn/classloader-leak-prevention, But remember that it's your responsability to not having leaks, there's no silver bullet. Visual VM can really help detecting classloader leaks.
The pragmatic solution for zero Tomcat restarts is coding applications that does not leak memory. It's that easy - Tomcat it not likely to leak that memory on it's own ;)
In other words: As long as your application is leaking memory restarting is unavoidable. You can tune memory settings, but you're just postponing the inevitable.
Tomcat7 can detect some memory leaks (and even remedy some simple ones) - this feature is exposed through the Tomcat Manager if you install that. It does not, however, help you point our where the leak is.
Speaking from recent experience I had an app that did not seem to leak during normal operation, but had a PermGen leak - this means I do not really see the leak until I try to do what you're doing - hot deploy. I could only do a few deploys before filling the PermGen as Tomcat was unable to unload the old instance of the classes...
There may be better explanations of PermGen leak on the web, but a quick google gave me this: http://cdivilly.wordpress.com/2012/04/23/permgen-memory-leak/
If, however, your application crashes after running for some time it is likely that is has a proper memory leak. You can tune heap sizes and garbage collection settings in an attempt to remedy/work around this, but it may very well be that the only real fix is to track down the leak, fix the code and build a version that does not leak.
...but unless you can ensure every release will be leak-free you need a strategy for restarting your service. If you don't do it during release, look into doing it off peak hours.
If you have a cluster you can look into setting up Tomcat with memcached for session replication so as to not disrupt users when restarting a node in the cluster. You can also look into setting up monit, god, runit, upstart, systemd etc. for automatic restart of failed services...
You need add the VM arguments in the eclipse with minimum and maximum length. In eclipse goto window--> preferences-->click on Java-->click on installed JREs. then double click on jre record add the vm arguments as -Xms768m -Xmx1024m.
I consider running VisualVM against a production JVM to see what's going on there - it started to consume too much CPU for some reason.
It must not result in a JVM failure so I'm trying to estimate all the risks.
The only issue that I see on their site that could potentially bring JVM down is related to class sharing and -Xshare JVM option, but afaik class sharing is not enabled in server mode and/or on x64 systems.
So is it really safe to run VisualVM against a production JVM, if it's not - what are the risks that one should consider, and how much load (CPU/memory) does running VisualVM against a JVM (and profiling with it) put on it?
Thanks
AFAIK VisualVM can be used in production, but I would only use it on a server which is lightly loaded. What you could do is wait for the service to slow down and later when its not used as much test it to see if some of the collections are surprising large. Or you could trigger a heap dump and analyze it offline.
And you can't get stats on method calls without significant overhead. Java 6 and 7 are better than java 5 but it could still slow your application by 30% even wityh a commercial profiler.
Actually, you can get some information without a lot of overhead by using stack dumps. There is even a script to help you do this at https://gist.github.com/851961
This type of profiling is the least intrusive that you can get.
I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).
The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:
a SIGSEGV is being thrown
on libjvm.so
eden space is always full (100%)
JVM runs with the following options:
CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"
I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.
I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.
My application does not, directly, use any native code.
Any ideas of where I should look next?
Edit - more info:
1) There is no client vm in this JDK:
[foo#localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[foo#localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
2) Changing the O/S is not possible.
3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.
4) I've done static analysis on my application but nothing serious came up.
5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.
6) /var/log/messages does not contain any useful information before or during the time of the crash
More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).
After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.
Do you have compiler output? i.e. PrintCompilation (and if you're feeling particularly brave, LogCompilation).
I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.
Basically what I'd do is;
switch on PrintCompilation
since that doesn't give timestamps, write a script that watches that logfile (like a sleep every second and print new rows) and reports when methods were compiled (or not)
repeat the test
check the compiler output to see if the crash corresponds with compilation of some method
repeat a few more times to see if there is a pattern
If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.
some references
for dealing with logcompilation output --> http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool
for info on .hotspot_compiler --> http://futuretask.blogspot.com/2005/01/java-tip-7-use-hotspotcompiler-file-to.html or http://blogs.oracle.com/javawithjiva/entry/hotspotrc_and_hotspot_compiler
a really simple, quick & dirty script for watching the compiler output --> http://pastebin.com/Haqjdue9
note that this was written for solaris which always has bizarre options to utils compared to the gnu equivalents so no doubt easier ways to do this on other platforms or using different languages
The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try
the serial (young) collector (IIRC it can be combined with a parallel old)
ParNew + CMS
G1
if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).
A few ideas:
Use a different JDK, Tomcat and/or OS version
Slightly modify test parameters, e.g. 25 threads at 7.2 M pageviews/day
Monitor or profile memory usage
Debug or tune the Garbage Collector
Run static and dynamic analysis
Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.
Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.
Can you reproduce the problem faster if:
You decrease the memory availble to the JVM?
You decrease the available system resources (i.e. drain system memory so JVM does not have enough)
You change your use cases to a simpler model?
One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.
I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.
Good luck,
Jacob
Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.
If I was you, I'd do the following:
try slightly older Tomcat/JVM versions. You seem to be running the newest and greatest. I'd go down two versions or so, possibly try JRockit JVM.
do a thread dump (kill -3 java_pid) while the app is running to see the full stacks. Your current dump shows lots of threads being blocked - but it is not clear where do they block (I/O? some internal lock starvation? anything else?). I'd even maybe schedule kill -3 to be run every minute to compare any random thread dump with the one just before the crash.
I have seen cases where Linux JDK just dies whereas Windows JDK is able to gracefully catch an exception (was StackOverflowException then), so if you can modify the code, add "catch Throwable" somewhere in the top class. Just in case.
Play with GC tuning options. Turn concurrent GC on/off, adjust NewSize/MaxNewSize. And yes, this is not scientific - rather desperate need for working solution. More details here: http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
Let us know how this was sorted out!
Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.
I've just tried the -XX:+DoEscapeAnalysis option enabled on a jdk6-u18 VM (on solaris) and had a rather disappointing experience. I'm running a scala application which has rather a lot of actors (20,000 of them). This is a recipe for garbage-creation!
Typically the app can run with 256Mb of heap but generates huge amounts of garbage. In its steady state it:
spends 10% of time in GC
generates >150Mb of garbage in <30s which then gets GC'd
I thought that escape analysis might help so I enabled the option and re-ran the app. I found that the app became increasingly unable to clear away the garbage it had collected until it seemed eventually to spend the entire time doing GC and the app was "flatlining" at its full allocation.
At this point I should say that the app did not throw a OutOfMemoryError which I would have expected. Perhaps JConsole (which I was using to perform the analysis) does not properly display GC stats with this option on (I'm not convinced)?
I then removed the option and restarted and the app became "normal" again! Anyone have any idea what might be going on?
1 Did the escape analysis show up as being enabled in JConsole? You need make sure you're running the VM with the -server option. I assume you had this working, but I just thought I'd check.
2 I don't think escape analysis will help the situation with Scala Actors. You might see a big gain if you do something like:
def act():Unit = {
val omgHugeObject = new OMGHugeObject();
omgHugeObject.doSomethingCrazy();
}
In the example above the EscapeAnalysis would make it so omgHugeObject could be allocated on the stack instead of the heap and thus not create garbage. I don't think it is likely that the escape analysis will help with actors. Their references will always "escape" to the actor subsystem.
3
Are you on the most recent release of Scala? There was a memory leak that I believe was fixed in a recent version. This even caused Lift to spawn off its own Actor library that you might look into.
4 You might try the G1Garbage collector You can enable it with:
-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
from the jdk-u18 release notes:
Note that Escape analysis-based optimization (-XX:+DoEscapeAnalysis) is disabled in 6u18. This option will be restored in a future Java SE 6 update.
I suggest you try to increase the new generation size, e.g. -XX:NewSize=96M XX:NewRatio=3. Use JVisualVM (included in the JDK), with the Visual GC Plugin to watch how the young and old spaces are utilised.
I've just upgraded some old Java source which has been running on a Sun Java 1.4.2 VM to Sun Java (JRE) 6 VM. More or less the only thing I had to change was to add explicit datatypes for some abstract objects (Hashmap's, Vector's and so on). The code itself it quite memory intensive, using up to 1G of heap memory (using -Xmx1024m as a parameter to start the VM).
Since I read alot about better performance on newer Java VM's, this was one of the reasons I did this upgrade.
Can anyone think of a reason why the performance is worse in my case now (just in general, of course, since you can't take a look at the code)?
Has anyone some advice for a non Java guru what to look for if I wanted to optimize (speed wise) existing code? Any hints, recommended docs, tools?
Thanks.
Not much information here. But here are a couple of things you might want to explore:
Start the VM with Xmx and Xms as the same value (in your case 1024M)
Ensure that the server jvm dll is being used to start the virtual machine.
Run a profiler to see what objects are hogging memory or what objects are not being garbage collected
Hook up your VM with the jconsole and trace through the objects
If your application nearly runs out of free space, garbage collection time may dominate computation time.
Enable gc debugging to look for this. Or, even better, simply start jconsole and attach it to your program.
Theoretically it could be that you application consumes more memory, because there were changes to the way Strings share their internal char[]. Less sharing is done after 1.4.
Check my old blog at http://www.sdn.sap.com/irj/scn/weblogs?blog=/pub/wlg/5100 (new blog is here)
I would compare the Garbage Collector logs to see whether memory usage is really the problem.
If that doesn't help, us a profiler such as Yourkit to find the differences.
Definitely use a profiler on the app (YourKit is great)...it's easy to waste a lot of time guessing at the problem when most of the time you'll be able to narrow it down really quickly in the profiler.