JVM crashes under stress on RHEL 5.2 - java

I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).
The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:
a SIGSEGV is being thrown
on libjvm.so
eden space is always full (100%)
JVM runs with the following options:
CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"
I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.
I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.
My application does not, directly, use any native code.
Any ideas of where I should look next?
Edit - more info:
1) There is no client vm in this JDK:
[foo#localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
[foo#localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
2) Changing the O/S is not possible.
3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.
4) I've done static analysis on my application but nothing serious came up.
5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.
6) /var/log/messages does not contain any useful information before or during the time of the crash
More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).
After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.

Do you have compiler output? i.e. PrintCompilation (and if you're feeling particularly brave, LogCompilation).
I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.
Basically what I'd do is;
switch on PrintCompilation
since that doesn't give timestamps, write a script that watches that logfile (like a sleep every second and print new rows) and reports when methods were compiled (or not)
repeat the test
check the compiler output to see if the crash corresponds with compilation of some method
repeat a few more times to see if there is a pattern
If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.
some references
for dealing with logcompilation output --> http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool
for info on .hotspot_compiler --> http://futuretask.blogspot.com/2005/01/java-tip-7-use-hotspotcompiler-file-to.html or http://blogs.oracle.com/javawithjiva/entry/hotspotrc_and_hotspot_compiler
a really simple, quick & dirty script for watching the compiler output --> http://pastebin.com/Haqjdue9
note that this was written for solaris which always has bizarre options to utils compared to the gnu equivalents so no doubt easier ways to do this on other platforms or using different languages
The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try
the serial (young) collector (IIRC it can be combined with a parallel old)
ParNew + CMS
G1
if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).

A few ideas:
Use a different JDK, Tomcat and/or OS version
Slightly modify test parameters, e.g. 25 threads at 7.2 M pageviews/day
Monitor or profile memory usage
Debug or tune the Garbage Collector
Run static and dynamic analysis

Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.

Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.
Can you reproduce the problem faster if:
You decrease the memory availble to the JVM?
You decrease the available system resources (i.e. drain system memory so JVM does not have enough)
You change your use cases to a simpler model?
One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.
I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.
Good luck,
Jacob

Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.

If I was you, I'd do the following:
try slightly older Tomcat/JVM versions. You seem to be running the newest and greatest. I'd go down two versions or so, possibly try JRockit JVM.
do a thread dump (kill -3 java_pid) while the app is running to see the full stacks. Your current dump shows lots of threads being blocked - but it is not clear where do they block (I/O? some internal lock starvation? anything else?). I'd even maybe schedule kill -3 to be run every minute to compare any random thread dump with the one just before the crash.
I have seen cases where Linux JDK just dies whereas Windows JDK is able to gracefully catch an exception (was StackOverflowException then), so if you can modify the code, add "catch Throwable" somewhere in the top class. Just in case.
Play with GC tuning options. Turn concurrent GC on/off, adjust NewSize/MaxNewSize. And yes, this is not scientific - rather desperate need for working solution. More details here: http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
Let us know how this was sorted out!

Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.

Related

Troubleshooting Major Garbage Collection Cause

I have a Java application running on Open SDK 1.7.0_95 64 bin environment and I am seeing two major garbage collection every hour which is hitting application response time to peek and I am intended to avoid it. As of now profiling my application using your-kit profile. Can anyone help me with the steps to troubleshoot the cause of these major GCs so that I can avoid it if possible. Application is deployed on Jboss EAP 6.2 on Linux environment.
OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
Please let me know if anything else needed from my side.
Thanks in advance!
Regards,
Divya Garg
The question you ask is of the kind "my car don't go fast enough what are the steps to find out what's going on". The answer to that is - consult a qualified auto repair service.
What you can try:
Turn on -XX:+PrintGC to see more of what's happenning.
Major GC once in an hour or so which halts the application is a hint to bad GC configuration. Normally it happens in memory-greedy apps which are given huge amount of memory. Then, when the time comes the GC has to clean several gigabytes which can halt processing for a few minutes. Check parallel collector or, generally "garbage collection tuning" which is an art of its own.
You may (or may not) have a memory leak problem, profile to find out (like it seems you do). Profiling is also an art of its own. Look for biggest allocations and where they come from, try to find out why are they still referenced and can't be cleaned up.

Out of memory error: Java heap size when memory is available

I'm running java with java -Xmx240g mypackage.myClass
OS is Ubuntu 12.10.
top says MiB Mem 245743 total, and shows that java process has virt 254g since the very beginning, and res is steadily increasing up to 169g. At that point it looks like it starts garbage collect a lot, I think so because the program is single-threaded at that point, and CPU% is mostly 100% up to this point, and it jumps around 1300-2000 at this point (I conclude it is multithreaded garbage collector), and then res slowly moves to 172g. At that point java crashes with
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at the line with new double[2000][5]
java -version says
java version "1.7.0_15"
OpenJDK Runtime Environment (IcedTea7 2.3.7) (7u15-2.3.7-0ubuntu1~12.10)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
Hardware is Amazon cr1.8xlarge instance
It seems to me that java crashes even when there's a lot of memory available. It is clearly not possible, I have to interpret some numbers wrong. Where should I look to understand what's going on?
Edit:
I don't specify any GC options. The only command-line option is -Xmx240g
My program is successfully working on many inputs, and top said sometimes that it uses up to 98.3% of memory. However I reproduced the situation described above with certain program input.
Edit2:
This is scientific application. It has gigantic tree (1-10 millions of nodes), in each node there are couple double arrays with size approx. 300x3 - 900x5. After initial tree creation program does not allocate much memory. Most of the time there are some arithmetic operations going on with these arrays.
Edit3:
HotSpot JVM died the same way, used CPU a lot at 170-172g mark and crashed with the same error. Looks like 70-75% of memory is some magical line that JVM does not want to cross.
Final solution:
With -XX:+UseConcMarkSweepGC -XX:NewRatio=12 program made it through 170g mark and is happily working further.
Analysis
The first thing you need to do is get a heap dump so you can figure out exactly what the heap looks like when the JVM crashes. Add this set of flags to the command line:
-XX:+HeapDumpOnOutOfMemoryError -verbose:gc -XX:+PrintGCDetails
When a crash happens, the JVM is going to write out the heap to disk. And frankly, its going to take a long time on a heap that size. Download Eclipse MAT or install the plugin if you're already running Eclipse. From there, you can load up the heap dump and run a couple of canned reports. You'll want to check the Leak Suspects and Dominator Tree to see where your memory is going and determine that you don't have an actual leak.
After that, I would recommend you read this document by Oracle about Garbage Collection, however here are some things you can consider:
Concurrent GC
-XX:+UseConcMarkSweepGC
I've never heard of anyone getting away with using the parallel only collector on a heap that size. You can activate the concurrent collector, and you'll want to read up on incremental mode and determine if its right for your workload / hardware combo.
Heap Free Ratio
-XX:MinHeapFreeRatio=25
Dial this down to lower the bar for the garbage collector when you do a full collection. This may prevent you from running out of memory doing a full collection. 40% is the default, experiment with smaller values.
New Ratio
-XX:NewRatio
We'll need to hear more about your actual workload: is this a webapp? A swing app? Depending on how long objects are expected to remain alive on the heap will have an impact on the new ratio value. Server-mode VMs like the one you're running have a fairly high new ratio by default (8:1), this may not be ideal for you if you have a lot of long-lived objects.
If I understood your question correcly, it looks like memory leak actually happening before the program hits the line new double[2000][5]. It seems the memory is already low whe nthe line is hit, thus it throws up when this line asks for more memory.
I would use jvisualvm or similar tools to find out where the memory leak is. Memory leak I've encountered mostly to do with Strings being created in a loop, Cache not being cleared etc.
As a general advice, NEVER use OpenJDK, even less for production environments, it is much slower than the one from Sun/Oracle.
Apart from that I have never seen VM using sooo much memory, but I guess that is what you need (or maybe you have a code using more memory than needed?)
EDIT : OpenJDK for server is fine, only differences with Sun/Oracle JDK is regarding desktop stuff (sound, gui...) so ignore that part.

VisualVM in production?

I consider running VisualVM against a production JVM to see what's going on there - it started to consume too much CPU for some reason.
It must not result in a JVM failure so I'm trying to estimate all the risks.
The only issue that I see on their site that could potentially bring JVM down is related to class sharing and -Xshare JVM option, but afaik class sharing is not enabled in server mode and/or on x64 systems.
So is it really safe to run VisualVM against a production JVM, if it's not - what are the risks that one should consider, and how much load (CPU/memory) does running VisualVM against a JVM (and profiling with it) put on it?
Thanks
AFAIK VisualVM can be used in production, but I would only use it on a server which is lightly loaded. What you could do is wait for the service to slow down and later when its not used as much test it to see if some of the collections are surprising large. Or you could trigger a heap dump and analyze it offline.
And you can't get stats on method calls without significant overhead. Java 6 and 7 are better than java 5 but it could still slow your application by 30% even wityh a commercial profiler.
Actually, you can get some information without a lot of overhead by using stack dumps. There is even a script to help you do this at https://gist.github.com/851961
This type of profiling is the least intrusive that you can get.

Java using too much memory on Linux?

I was testing the amount of memory java uses on Linux. When just staring up an application that does absolutely NOTHING it already reports that 11 MB is in use. When doing the same on a Windows machine about 6 MB is in use. These were measured with the top command and the windows task manager. The VM on linux I use is the 1.6_0_11 one, and the hotspot VM is Server 11.2. Starting the application using -client did not influence anything.
Why does java take this much memory? How can I reduce this?
EDIT: I measure memory using the windows task manager and in Linux I open the terminal and type top.
Also, I am only interested in how to reduce this or if I even CAN reduce this. I'll decide for myself whether a couple of megs is a lot or not. It's just that the difference of 5 MB between windows and Linux is strange, and I want to know if I am able to do this on Linux too.
If you think 11MB is "too much" memory... you'd better avoid using Java entirely. Seriously, the JVM needs to do quite a lot of stuff (bytecode verifier, GC, loading all the essential classes), and in an age where average desktop machines have 4GB of RAM, keeping the base JVM overhead (and memory use in generay) very low is simply not a design priority.
If you need your app to run on an embedded system (pretty much the only case where 11 MB might legitimately be considered "too much"), then there are special JVMs designed for such sytems that use less RAM - but at the cost of lacking many of the features and/or performance of mainstream JVMs.
You can control the heap size otherwise default values will be used, java -X gives you an explanation of the meaning of these switches
i.g.
set JAVA_OPTS="-Xms6m -Xmx6m"
java ${JAVA_OPTS} MyClass
The question you might really be asking is, "Does windows task manager and Linux top report memory in the same way?" I'm sure there are others that can answer this question better than I, but I suspect that you may not be doing an apples to apples comparison.
Try using the jconsole application on each respective machine to do a more granular inspection. You'll find jconsole on your sdk under the bin directory.
There is also a very extensive discussion of java memory management at http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/
The short answer is that how memory is being allocated is a more complex answer than just looking at a single figure at the top of a user simplifed system utility.
Both Top and TaskManager will report how much memory has been allocated to a process, not how much the process is actually using, so I would say it's not an apples to apples comparison. Regardless, in the age of Gigs of memory what's a couple megs here or there on startup?
Linux and Windows are radically different operating systems and use RAM very differently. Windows kind of allocates as you go, and Linux caches more at once, and prepares for the future, so that the next operations are smooth.
This explanation is not quite right, but it's close enough for you.

debugging tomcat crash

I have an instance of Tomcat which periodically crashes for unknown reasons.
There are no errors left in the logs, only a line in Event Viewer saying "Tomcat terminated unexpectedly".
In a test environment I have been unable to replicate the issue. I am therefore mostly restricted to passive monitoring of the production environment.
The problem does not seem to be related to memory as the unexpected terminations show no obvious correlation to the process' memory usage.
What steps could I take to further diagnose this problem?
EDIT:
Some corrections/clarifications:
It is actually not a single "instance" of Tomcat, rather several instances with similar configurations.
OS is Windows 2003.
Java version is Java 6.
UPDATE:
Looks like the issue might be related to memory after all. Discovered some crash dumps which were created in the Tomcat directory (not .../Tomcat/logs).
The dumps mostly contained errors such as:
java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate. Out of swap space?
This is unexpected as the process sometimes crashed when it's memory usage was a relatively low point (compared to historical usage).
In all dumps, perm gen space is at 99% usage, but in absolute terms this usage is not consistent, and is nowhere near the limit specified in -XX:MaxPermSize.
This indicates to me that the whole JVM crashed, which is a rather unusual thing. I would consider the following steps:
First check the hardware is ok. Run memtest86+ - http://www.memtest86.com/ or on a Ubuntu cd - to test the memory. Let it run a while to be absolutely certain.
Then see if the version of Java you use, is ok. Some versions of Java 6 broke some subtle functionality. The latest Java 5 might be a good solution at this point.
Disable the Tomcat native code to improve performance. There is a native library which Tomcat uses for something. Since you have a crashing JVM, getting rid of native code is a very good start.
See if there is some restrictions in the version of Windows you use. A cpu usage limit before termination, or any other quota.
Generally if a process crashes in windows, a dump file is created. Load the dump file in windbg (windows debugger) and get a stack trace of the thread that caused the exception. This should give you a better idea what the problem is.

Categories

Resources