How serious is the Java7 "Solr/Lucene" bug? - java

Apparently Java7 has some nasty bug regarding loop optimization: Google search.
From the reports and bug descriptions I find it hard to judge how significant this bug is (unless you use Solr or Lucene).
What I'd like to know:
How likely is it that my (any) program is affected?
Is the bug deterministic enough that normal testing will catch it?
Note: I can't make users of my program use -XX:-UseLoopPredicate to avoid the problem.

The problem with any hotspot bugs, is that you need to reach the compilation threshold (e.g. 10000) before it can get you: so if your unit tests are "trivial", you probably won't catch it.
For example, we caught the incorrect results issue in lucene, because this particular test creates 20,000 document indexes.
In our tests we randomize different interfaces (e.g. different Directory implementations) and indexing parameters and such, and the test only fails 1% of the time, of course its then reproducable with the same random seed. We also run checkindex on every index that tests create, which do some sanity tests to ensure the index is not corrupt.
For the test we found, if you have a particular configuration: e.g. RAMDirectory + PulsingCodec + payloads stored for the field, then after it hits the compilation threshold, the enumeration loop over the postings returns incorrect calculations, in this case the number of returned documents for a term != the docFreq stored for the term.
We have a good number of stress tests, and its important to note the normal assertions in this test actually pass, its the checkindex part at the end that fails.
The big problem with this, is that lucene's incremental indexing fundamentally works by merging multiple segments into one: because of this, if these enums calculate invalid data, this invalid data is then stored into the newly merged index: aka corruption.
I'd say this bug is much sneakier than previous loop optimizer hotspot bugs we have hit (e.g. sign-flip stuff, https://issues.apache.org/jira/browse/LUCENE-2975). In that case we got wacky negative document deltas, which make it easy to catch. We also only had to manually unroll a single method to dodge it. On the other hand, the only "test" we had initially for that was a huge 10GB index of http://www.pangaea.de/, so it was painful to narrow it down to this bug.
In this case, I spent a good amount of time (e.g. every night last week) trying to manually unroll/inline various things, trying to create some workaround so we could dodge the bug and not have the possibility of corrupt indexes being created. I could dodge some cases, but there were many more cases I couldn't... and I'm sure if we can trigger this stuff in our tests there are more cases out there...

Simple way to reproduce the bug. Open eclipse (Indigo in my case), and Go to Help/Search. Enter a search string, you will notice that eclipse crashes. Have a look at the log.
# Problematic frame:
# J org.apache.lucene.analysis.PorterStemmer.stem([CII)Z
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000007b79000): JavaThread "Worker-46" [_thread_in_Java, id=264, stack(0x000000000f380000,0x000000000f480000)]
siginfo: ExceptionCode=0xc0000005, reading address 0x00000002f62bd80e
Registers:

The problem, still exist as of Dec 2, 2012
in both Oracle JDK
java -version
java version "1.7.0_09"
Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)
and openjdk
java version "1.7.0_09-icedtea"
OpenJDK Runtime Environment (fedora-2.3.3.fc17.1-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
Strange that individually any of
-XX:-UseLoopPredicate or -XX:LoopUnrollLimit=1
option prevent bug from happening,
but when used together - JDK fails
see e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=849279

Well it's two years later and I believe this bug (or a variation of it) is still present in 1.7.0_25-b15 on OSX.
Through very painful trial and error I have determined that using Java 1.7 with Solr 3.6.2 and autocommit <maxTime>30000</maxTime> seems to cause index corruption. It only seems to happen w/ 1.7 and maxTime at 30000- if I switch to Java 1.6, I have no problems. If I lower maxTime to 3000, I have no problems.
The JVM does not crash, but it causes RSolr to die with the following stack trace in Ruby:
https://gist.github.com/armhold/6354416. It does this reliably after saving a few hundred records.
Given the many layers involved here (Ruby, Sunspot, Rsolr, etc) I'm not sure I can boil this down into something that definitively proves a JVM bug, but it sure feels like that's what's happening here. FWIW I have also tried JDK 1.7.0_04, and it also exhibits the problem.

As I understand it, this bug is only found in the server jvm. If you run your program on the client jvm, you are in the clear. If you run your program on the server jvm it depends on the program how serious the problem can be.

Related

How to enable java HotSpot VM compiler

I am using java 1.8.0_05, Java HotSpot(TM) 64-Bit Server VM
I am running a java web app on tomcat 8.0.43
I recently deployed my .war file by dropping it in the webapps folder.
This resulted in the following message being logged:
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler
has been disabled. Java HotSpot(TM) 64-Bit Server VM warning: Try
increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=245760Kb used=244058Kb max_used=244079Kb free=1701Kb
bounds [...]
total_blobs=48344 nmethods=47669 adapters=584 compilation: disabled
(not enough contiguous free space left)
How can I check what the current status of the compiler is now, to see if it's still disabled?
How can I enable the compiler? Can I simply restart tomcat?
There doesn't seem to be any noticeable different in how my application is running (eg: in terms of speed).
Interestingly, I didn't get this message when deploying the same application to an identical server. This is why I would like to first just turn the compiler back on rather than changing settings (eg: ReservedCodeCacheSize) as the message recommends.
Then, if the problem persists I can see which settings I need to change.
Addressing your individual questions + 1 recommendation:
How to check if the JIT compiler is still disabled?
The easiest thing to do is to start up a jvisualvm (already shipped with JDK), then check the used codecache space. If your CodeCache is full, the JIT compiler will remain disabled. to check the Code Cache memory space:
install the MBeans JVisualVM plugin.
go to Mbeans
open java.lang/MemoryPool/Code Cache
check variable "Usage" (double-click)
This will give you an overview of where you are.
How can I enable the compiler? Can I simply restart tomcat?
Yes, a restart will certainly reset the state of the cache. The only other way to restart your compiler would be if you have already started the JVM with the right parameters. (enabling UseCodeCacheFlushing)
No difference in how my application is running?
JIT optimizes your code, but depending on your application and the way you use it, you might not see any noticable difference. Assuming you run a webapp (because of Tomcat), the network transmission speed or your browser rendering pages are likely orders of magnitude slower than what JIT gains you in terms of core Java speed.
"I didn't get this message when deploying the same application"
JIT compiling is dependent on the code that is being executed at that moment. The same application might run quite differently under the hood on the level where JIT works. When it comes to low-level functions, the more 3rd party libraries you use, the less you can be sure about what is happening on all those threads you have no control over of.
the suggestion:
Please upgrade that Java version. It is very rare (u_05) to be on such an early JDK8 version, and quite dangerous. Java8 was not the most stable release when it came out, and had easily reproducible bugs even at later releases. There have been over 1000 bugs fixed in JDK8. Many of these were directly addressing JIT issues. If you have any control over the environment you are talking to, upgrade it. If you do not, notify the responsible person.
I had this issue a while ago and this is what I cant tell you:
Once the Code Cache becomes full the compiler is automatically disabled.
Will it be automatically restarted?
No. And it will stay down until the JVM is restarted.
Can I simply restart tomcat?
Yes. But it will probably happen again.
There doesn't seem to be any noticeable different in how my application is running (eg: in terms of speed).
In the long run there will be some issues since code that could be cached and optimized can no longer be compiled and stored there.
What can you do?
You could increase a bit -XX:ReservedCodeCacheSize
You could enable -XX:+UseCodeCacheFlushing. The drawback is that if your CodeCache size is way too low, and you constantly hit the flushing threshold, the performance will be affected since you are spending CPU resources in the flushing process.
I would increase a bit the CodeCacheSize, enable the flush, and monitor the App with VisualVM or something that lets you look at the current state of the CodeCache. Monitoring will help you understand if you are reaching the thresholds once in a while or if it happens a lot.
Remember that CodeCache is separated from the Heap, so looking at HeapSize won't help you.
Edit:
Regarding VisualVM, here are the official steps to connect to a remote JVM:
https://docs.oracle.com/javase/8/docs/technotes/guides/visualvm/applications_remote.html
Just make sure JMX is enabled and it should work right away.
Regarding the issue with many apps running at the same time... Well yeah, technically Standard Tomcat starts one JVM for all the apps. Cache Space will be shared.
You could also monitor this case by Attaching VisualVM to the JVM, undeploying an app and checking if the space has been freed.
You could also consider using an Enterprise container which will let you create one JVM per App.

Java application running in eclipse, random "fatal errors"

I have written a short application that converts files from their raw data to XML (ECGs). I have about 350000 files to convert, and the convertion itself is done via a library that I got from the manufacturer of the ECG devices. To make use of multiple processors and cores in the machine I'm using to do the convertion I wrote a "wrapper application" that creates a pool of threads, which is then used to do the convertion in separate threads. It works somewhat ok, but unfortunately I do get random errors causing the whole application to stop (85k files have been converted over the past 3-4 days and I have had four of those errors):
A fatal error has been detected by the Java Runtime Environment:
EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x71160a6c, pid=1468, tid=1396
JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)
Java VM: Java HotSpot(TM) Client VM (25.20-b23 mixed mode windows-x86 )
Problematic frame:
C [msvcr100.dll+0x10a6c]
I would suspect that it's the library that I'm using that causes these, so I don't think I can do all that much to fix it. If that error happens, I run then program and let it start where it left off before crashing. Right now I have to do that manually but was hoping that there is some way to let Eclipse restart the program (with an argument of the filename where it should start). Does anyone know if there is some way to do that?
Thanks!
It is not entirely clear, but I think you are saying that you have a 3rd party Java library (with a native code component) that you are running within one JVM using multiple threads.
If so, I suspect that the problem is that the native part of the 3rd-party application is not properly multi-threaded, and that is the root cause of the crashes. (I don't expect that you want to track down the cause of the problem ...)
Instead of using one JVM with multiple converter threads, use multiple JVMs each with a single converter thread. You can spread the conversions across the JVMs either by partitioning the work statically, or by some form of queueing mechanism.
Or ... you could modify your existing wrapper so that the threads launched the converter in a separate JVMs using ProcessBuilder. If a converter JVM crashes, the wrapper thread that launched it could just launch it again. Alternatively, it could just make a note of the failed conversion and move onto the next one. (You need to be a bit careful with retrying, in case it is something about the file that you are converting that is triggering the JVM crash.)
For the record, I don't know of an existing "off the shelf" solution.
It seems that you are using the x86 (32-bit) version of Java. Maybe you could try it with the x64 (64-bit) version. That has sometimes worked for me in the past.
The problem seems to be in the native library, but maybe if you try it with 64-bit Java, it will use a 64-bit version of the native library?

Linux identifier removed

I've encountered an interesting problem when running the following piece of Java code:
File.createTempFile("temp.cnt.ent", "cnt.feat.tmp", directory);
The following exception is thrown:
Exception in thread "main" java.io.IOException: Identifier removed
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1792)
I have never had this problem before and Google doesn't seem to have much for me. The system runs Scientific Linux release 5.8 (Linux 2.6.18-274.3.1.el5 x86_64) and the Java version is
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
The file system (Lustre) has 80TB of free space.
Any suggestions are greatly appreciated.
You are encountering synchronisation errors between the various instances. Lustre doesn't support file locking, which is probably what java.io.UnixFileSystem.createFileExclusively uses to avoid concurrency woes. (I say "probably" because it doesn't appear to be documented anywhere.)
Without locking it's only a matter of time until file operations interfere with each other. Reducing the number of instances is not a solution because it just makes it less likely to occur.
I believe the solution is to insure that each instance creates files in a different sub-directory
I guess that you see an EIDRM. At least the error message looks like that. The IOException wraps an error message from the underlying native libraries.
This is not a real answer to your problem, but maybe a useful hint.
http://docs.oracle.com/cd/E19455-01/806-1075/msgs-1432/index.html has some information an additional pointers.
The problem seems to be related to having too many instances of the application at a time (each in a separate VM). For some unknown reason the OS refuses to allow the creation of a temp file. Workaround: run less instances.

How do I investigate the cause of a JVM crash?

One day ago, after a few months of normal working, our java app starts to crash occasionally with the following error:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:247), pid=2075, tid=140042095163136
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 6.0_23-b05
# Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode linux-amd64 compressed oops)
# An error report file with more information is saved as:
# /var/chat/jSocketer/build/hs_err_pid2075.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
I looked in hs_err_pid2075.log and saw that there was an active thread, that processed a network communication. However there wasn't any application or environment changes done in the last few months. Also there wasn't any load growth.
What can I do to understand, what is the reason of crash? Are there any common steps to investigate a jvm crash?
UPD
http://www.wuala.com/ubear/public
The crash is in the JVM, not in external native code. However, the operation it crashed on has been initiated by and external DLL.
This line in the hs_err_pid file explains the operation that crashed:
VM_Operation (0x00007f5e16e35450): GetAllStackTraces, mode: safepoint, requested by thread 0x0000000040796000
Now, thread 0x0000000040796000 is
0x0000000040796000 JavaThread "YJPAgent-Telemetry" daemon [_thread_blocked, id=2115, stack(0x00007f5e16d36000,0x00007f5e16e37000)]
which is a thread created by Yourkit. "GetAllStackTraces" is something that a profiler needs to call in order to do sampling. If you remove the profiler, the crash will not happen.
With this information It's not possible to say what causes the crash, but you can try the following: Remove all -XX VM parameters, -verbose:gc and the debugging VM parameters. They might interfere with the profiling interface of the JVM.
Update
Code that calls java.lang.Thread#getAllStackTraces() or java.lang.Thread#getStackTrace() may trigger the same crash
The two times I've witnessed recurring JVM crashes were both due to hardware failure, namely RAM. Running a memtest utility is the first thing I'd try.
I can see from the error report that you have the YourKit agent loaded. Its telemetry thread is mentioned as the requester for the operation that appears to fail. Try running the application without the YJP agent to see if you can still reproduce the crash.
Generally, JVM crashes are pretty hard to diagnose. They could happen due to a bug in some JNI code or in the JRE itself. If you suspect the latter, it may be worth submitting a bug report to Oracle.
Either way, I'd recommend to upgrade to the latest release of Java 6 to make sure it's not a known issue that's already been fixed. At the time of this writing the current release is Java 6 update 29.
If you're not messing with anything that would cause this directly (which basically means using native code or libraries that call native code) then it's almost always down to a bug in the JVM or hardware issue.
If it's been running fine for ages and has now started to crash then it seems to me like the hardware issue is the most likely of the two. Can you run it on another machine to rule out the issue? Of course, it definitely wouldn't hurt to upgrade to the latest Java update as well.
Switching to another version of linux-kernel "fixes" the JVM crush problem (http://forum.proxmox.com/threads/6998-Best-strategy-to-handle-strange-JVM-errors-inside-VPS?p=40286#post40286). It helped me with my real server. There was Ubuntu server 10.04 LTS OS on it with kernel 2.6.32-33 version. So kernel update resolved this issue. JVM has no crash anymore.

Troubleshooting Java process with very high CPU usage - Tomcat application

I have a java application that runs on Tomcat (which runs as a service on Windows), the java process for which continues to eat up CPU before eventually requiring me to restart the Tomcat service.
First my setup:
Windows 2003 server
Tomcat 6, running as service using Wrapper
JDK: 1.6.0_20
I was seeing catch issues here and there leading up to yesterday. I had to restart midday yesterday, then at 2:30 this morning, then today I could barely restart the application and open jconsole to monitor it before it was hitting 99% CPU usage again. Through a combination of things I'm not quite sure of, it seems like I got the JVM to cycle itself and the app was hovering in the 10-30% CPU usage range for a couple hours. However, then it started to creep up again, finally going into its 99% CPU usage breakdown. I was also having trouble with high memory usage, but that has stayed fairly normal and steady since I so-called got the JVM to "cycle" (bad terminology perhaps, but this is really what it seemed to do - and in the wrapper log there was a dump of all the classes it was reloading after).
Then I was digging around some more and found a JRE 6 Update 24 installed on the server (I didn't install it as I do thorough testing with each java update - but maybe my server admin did the update). I attempted, but can't uninstall this. Thus, I get different versions when I do a java -version versus javac -version
java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
javac -version
javac 1.6.0_20
Could this difference be causing a JVM conflict of sorts? JAVA_HOME and my PATH variables both point to the correct JDK installation.
Hoping for more stability, I decided to change my app to run on the previous JDK that was still installed - JDK 1.6.0_04. I changed the wrapper.conf, set env variables, cleaned and rebuilt, and started. This does seem more stable and has been up for about 4 hours. The CPU usage has climbed to the 90s, then it seems to clear itself out again.
I've done heapdumps then ran them through the Memory Analyzer in Eclipse (nothing new found there), I've used jconsole with jtop to look at threads - nothing jumps out, thus why I continue to be curious if it's a java/jvm issue. So, I know this is a long post - but I don't really know where to go from here. Any ideas?
(I've done exhaustive web searching on this and some articles have pointed to possibly a Quartz issue or Hibernate queries not flushing. Nothing has changed in the app since I started seeing the CPU issues, so I'm not sure where to start troubleshooting if it could indeed be linked to either.)
This isn't an easy problem. You are doing all of the basics to see if it something jumps out. It sounds like there is either a slow leak that builds up over time to the point where it can't operate. That sounds like GC is thrashing and app comes unresponsive. It could also be runaway background job(s) eating on the CPU and just doesn't complete, that might explain the long delay. You could try turning off any quartz to see if it stays up longer that might help lead you in a direction, or crank it up so it shows up sooner.
I know you've done some jconsole watching, but I think you need to revisit and watch your memory usage, the threads run time, how much time you're spending in GC, and watching what portions of memory are being eaten up (is it Eden, Tenure that's running out?).
I'd make sure you are writing out start and end messages for your background jobs running in Quartz. Then you can correlate when they start and finish with when this problem starts. Also will tell you if your jobs are finishing or not.
It's probably time to drop it into a profiler (instead of jconsole) so you can see where in the code it's spending time or what's blowing up memory. A real profiler will let you see all that data mashed up on your code and classes. My favorites is JProfiler, but YourKit is also good. You can get a 7-30 day trial so you'll have plenty of time to profile and figure your issue out without having to buy it.
Start this early in the morning so you'll hopefully see something by early night.

Categories

Resources