App hangs on futex_wait_queue_me() every a couple of minutes

App hangs on futex_wait_queue_me() every a couple of minutes - java

I am running a simple script in Groovy on an Ubuntu 11.10 machine, which takes key/value pairs and adds them to a JDBM map in a loop. Every ~3 minutes the script hangs for a couple of minutes and then resumes. When I look at the resource monitor I see that there is no CPU or Memory activity and the process is in futex_wait_queue_me().
Please suggest means to overcome this, on a Windows machine by the way the application runs without the hangs.
Could this be an OS issue? (found many similar threads about similar futex_wait_queue_me() problems in Ubuntu0
Thanks

Please check the version of the kernel. I ran into a similar problem (java and other multithreaded applications) on Centos6 and upgrading the kernel to version 2.6.32-504.16.2.el6.x86_64 solved the issue.
See the centos bug report: https://bugs.centos.org/view.php?id=8703 which contains this pointer to an explanation of the problem:
https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0 [^]
My stacktrace was:
cat /proc/23199/stack
[<ffffffff810b226a>] futex_wait_queue_me+0xba/0xf0
[<ffffffff810b33a0>] futex_wait+0x1c0/0x310
[<ffffffff810b4c91>] do_futex+0x121/0xae0
[<ffffffff810b56cb>] sys_futex+0x7b/0x170
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

For anyone interested I used those parameters when running java:
-Xms16384M -Xmx16384M
You can find additional GC optimization tips at http://randomlyrr.blogspot.it/2012/03/java-tuning-in-nutshell-part-1.html

Related

How can I debug a non-responsive server, when the profiler can't collect samples?

I have been having occasional problems with a server I wrote. It's in Clojure, but I don't think that matters, and we can pretend it's in Java. Anyway, it works fine for hours at a time, but goes into fits where it behaves very badly: all activity stops, for around fifteen seconds, and then it works normally for a few seconds, then stops for fifteen seconds...and so on for (usually) about ten minutes or so, after which it goes back to behaving normally.
I've done a lot of profiling of it with YourKit, and I've ruled out a number of plausible suspects:
It's not a garbage collection issue: I'm running it with -XX:+UseConcMarkSweepGC, and I've verified that the server continues to run just fine during both minor and major collections, due to the concurrent nature of this garbage collector. And we're not thrashing as we run out of total memory or something: the current heap size is well below its max.
I don't think it's a locking/synchronization issue, but I'm not 100% sure on that. The YourKit profiler shows threads waiting sometimes, eg competing over the lock for System.out to produce log messages, but the only long waits are for worker threads in threadpools when there's nothing to do. And of course YourKit says it's never detected any deadlocks.
It's not something caused by having the profiler attached, because it still happens even if I boot the server up and then leave it alone without ever attaching the profiler.
It's not some other process on the system taking up all the CPU time: top shows CPU usage at 100% for my java process, and basically 0% for everything else.
My biggest problem is that I can't see what the server is doing during these strange funks, because the profiler stops receiving samples. Here's a graph of the CPU usage chart:
The left side of the graph is normal operation, during which we get profiler samples every second or so. The right side is "broken", and is very spiky because the profiler is only getting samples every ten seconds or so. In the samples it does get, the server seems to be doing its usual business: responding to requests and so on; and the logs confirm that it is doing normal stuff, but only at the times the profiler has samples for: during the upward-sloping "straight lines" on the graph, for which the profiler has no samples, the server is doing nothing at all.
So, does this graph look familiar to anyone? Have you had this problem before and fixed it? Or can you point me in the direction of a tool that can figure out what my server is doing during the time when YourKit can't? In case it matters, the server machine is running Ubuntu 10.04, and
$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.10) (rhel-1.28.1.10.10.el5_8-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

Okay, from the comments it seems clear to me we are not going to be able to figure this out with the information you've given so far. The best we can do is to give suggestions on how to debug it...
I would try to use jstack during one of the spikes and see if you can use that to figure out where it hangs.

If you have no chance to measure or debug in code try to look form the outside.
I would at first to try to reproduce the problem. In other words is there a external event that produce the behavior. Try to change the load on server. Switch every thing you can to reproduce the problem.
Maybe it's also a good idea to sniff the network traffic (tcpdump) to find something interesting around the time when you server hangs.
You can also run it on another operating system to check if it depends from your installation environment.
If you can't reproduce a situation where the problem occurs, try to find situations where you don't get the problem. For instance remove the server from net. Shutdown all other services.
If you can't find with that any change of behavior of your program try to reduce the complexity of your working code and see if you can find a internal module that seems to be related with the problem.

Have you had this problem before and fixed it? Or can you point me in
the direction of a tool that can figure out what my server is doing
during the time when YourKit can't?
If you have shell access on the server and can see stdout, try taking a thread dump when the server becomes unresponsive. Not sure if this will give you anything different than what jstack (mentioned in the other answer) would give you or not.
On Ubuntu: kill -QUIT <java-pid> (will not actually kill the Java process).
http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx

Slow Oracle connection on Fedora virtual machine

We have a Java application running on Solaris which makes a connection to Oracle and checks the database for work to perform, and it runs just fine. We tried running the same code on a standalone Fedora system, its performance is good too. However, when we move it to its home on a Fedora VMWare virtual machine, it can take upwards of five minutes for the application to make the connection to the database. It ultimately DOES make the connection - it's just snail-slow. We suspect it's a configuration issue somewhere but can't find it. So far as we can tell, the two Fedora boxes have nearly identical configurations. Has anyone run into this problem before? If so, how did you get around it?
Thanks in advance for your help.
Mike Preston

Found it! When we are running under Solaris, we are running a 32-bit JVM with 32-bit extensions. We are executing through a Korn shell script and had an added -d64 flag to coerce 64-bit processing. On the Linux boxes we removed the -d64 flag from the shell script and everybody's happy. Thanks Alex for your thoughts and assistance.

Here is the solution which settled the issue...our headless development server was only occasionally getting any keyboard activity to fill the entropy pool (please read the article - I won't try to explain it here) and I assume it was blocking until there was enough "noise" to generate the requisite random numbers.Since there is only one other developer working on the system, it might take a couple of minutes to fill the buffer. Once the buffer was full it went ahead and executed the connection as expected. That also explains why we sometimes would see crisp performance followed by slow. In a nutshell, we added the string "-Djava.security.egd=file:///dev/urandom" to the Korn shell script between the call to java and the jar file name and now it works like a champ. Here's the full command string:
/usr/bin/java -Xms64m -Xmx1024m -Djava.security.egd=file:///dev/urandom -jar $1 $2 $PID
If you DO read the article, be sure to read the comments below. One of them is really funny!

Troubleshooting Java process with very high CPU usage - Tomcat application

I have a java application that runs on Tomcat (which runs as a service on Windows), the java process for which continues to eat up CPU before eventually requiring me to restart the Tomcat service.
First my setup:
Windows 2003 server
Tomcat 6, running as service using Wrapper
JDK: 1.6.0_20
I was seeing catch issues here and there leading up to yesterday. I had to restart midday yesterday, then at 2:30 this morning, then today I could barely restart the application and open jconsole to monitor it before it was hitting 99% CPU usage again. Through a combination of things I'm not quite sure of, it seems like I got the JVM to cycle itself and the app was hovering in the 10-30% CPU usage range for a couple hours. However, then it started to creep up again, finally going into its 99% CPU usage breakdown. I was also having trouble with high memory usage, but that has stayed fairly normal and steady since I so-called got the JVM to "cycle" (bad terminology perhaps, but this is really what it seemed to do - and in the wrapper log there was a dump of all the classes it was reloading after).
Then I was digging around some more and found a JRE 6 Update 24 installed on the server (I didn't install it as I do thorough testing with each java update - but maybe my server admin did the update). I attempted, but can't uninstall this. Thus, I get different versions when I do a java -version versus javac -version
java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
javac -version
javac 1.6.0_20
Could this difference be causing a JVM conflict of sorts? JAVA_HOME and my PATH variables both point to the correct JDK installation.
Hoping for more stability, I decided to change my app to run on the previous JDK that was still installed - JDK 1.6.0_04. I changed the wrapper.conf, set env variables, cleaned and rebuilt, and started. This does seem more stable and has been up for about 4 hours. The CPU usage has climbed to the 90s, then it seems to clear itself out again.
I've done heapdumps then ran them through the Memory Analyzer in Eclipse (nothing new found there), I've used jconsole with jtop to look at threads - nothing jumps out, thus why I continue to be curious if it's a java/jvm issue. So, I know this is a long post - but I don't really know where to go from here. Any ideas?
(I've done exhaustive web searching on this and some articles have pointed to possibly a Quartz issue or Hibernate queries not flushing. Nothing has changed in the app since I started seeing the CPU issues, so I'm not sure where to start troubleshooting if it could indeed be linked to either.)

This isn't an easy problem. You are doing all of the basics to see if it something jumps out. It sounds like there is either a slow leak that builds up over time to the point where it can't operate. That sounds like GC is thrashing and app comes unresponsive. It could also be runaway background job(s) eating on the CPU and just doesn't complete, that might explain the long delay. You could try turning off any quartz to see if it stays up longer that might help lead you in a direction, or crank it up so it shows up sooner.
I know you've done some jconsole watching, but I think you need to revisit and watch your memory usage, the threads run time, how much time you're spending in GC, and watching what portions of memory are being eaten up (is it Eden, Tenure that's running out?).
I'd make sure you are writing out start and end messages for your background jobs running in Quartz. Then you can correlate when they start and finish with when this problem starts. Also will tell you if your jobs are finishing or not.
It's probably time to drop it into a profiler (instead of jconsole) so you can see where in the code it's spending time or what's blowing up memory. A real profiler will let you see all that data mashed up on your code and classes. My favorites is JProfiler, but YourKit is also good. You can get a 7-30 day trial so you'll have plenty of time to profile and figure your issue out without having to buy it.
Start this early in the morning so you'll hopefully see something by early night.

Invalid memory access of location in Java

I've been working on a Java project for year. My code had been working fine for months. A few days ago I upgraded the Java SDK to the newest version 1.6.0_26 on my Mac (Snow Leopard 10.6.8). After the upgrade, something very weird happens. When I run some of the classes, I get this error:
Invalid memory access of location 0x202 rip=0x202
But, if I run them with -Xint (interpreted) they work, slow but work fine. I get that problem in classes where I use bitwise operators (bitboards for the game Othello). I can't put any code here because I don't get an error, exception or something similar. I just get that annoying message.
Is it normal that the code doesn't run without -Xint but it works with it? What should I do?
Thanks in advance

When a JVM starts crashing like that, it is a sign that something has broken the JVM's execution model.
Does your application include any native code? Does it use any 3rd-party libraries with native code components? If neither is true, then the chances are that this is a bug in the Apple port of the JVM. It could be a JIT compiler bug, or a bug in some JVM native code library.
What can you do about a bug like that?
Not a lot.
Reduce your application by progressively chopping out bits until you have a small testcase that exhibits the problem.
Based on the testcase, see if there's some empirical way to avoid the problem.
Submit a bug report to Apple with the testcase.

I just came across this situation and it turned out to be related to a piece of code that was serializing a JSON object with a cyclic reference to itself. I removed the cycle and the error went away. I suspect this is related to a memory overflow error that is now handled differently by newer JVMs on Mac OSX. In this case, I was running Mac OSX 10.7.
For completeness the errors I was receiving were:
Invalid access of stack red zone 0x10e586d30 rip=0x10daabba6
Bus error: 10
And:
Invalid memory access of location 0x10b655890 rip=0x10a8baba6
Segmentation fault: 11

Also verify that you are building the GUI on the event dispatch thread and never updating a GUI component from any other thread.
Related errors are notoriously hard to reproduce, but the change associated with altered timing is suggestive.

Please check if /etc/hosts is empty and verify that it include these configurations ：
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost

Eclipse 3.6 frequently stalls during Content Assist

The auto complete stalls so frequently and for so long, I quit using it altogether.

I've had success with the following using Eclipse (Classic) 3.6.1 on Windows 7 x64.
"A workaround, until the fix is released in 3.6.2 is summarized here: http://groups.google.com/group/android-developers/msg/0f9d2a852e661cba"
(copied for convenience)
"You can replace your /plugins/
org.eclipse.jdt.core_3.6.1.v_A68_R36x.jar plugin with one from
http://www.google.com/url?q=http://adt-addons.googlecode.com/svn/patches/org.eclipse.jdt.core_3.6.1.v_A68_R36x.zip&ei=vg5aTf2RIMrUgAeI-qTvDA&sa=X&oi=unauthorizedredirect&ct=targetlink&ust=1297749446528273&usg=AFQjCNFv7FGlTrnoVhRGE35JPjHxOwI_Bw
and restart Eclipse. Content Assists will be much better. Just try it.
Don't forget backup your original plugins. "

This solved part of my problem.
In preferences, I defaulted all the 'Java->Editor->Content assist' screens and the performance is much improved. Any lag I have now is due to system speed and is negligible. I've gone from minutes to seconds building the suggestion list.
UPDATE: This didn't completely solve my problem, but it got me close. The search continues...
UPDATE: I'm developing in Java for Android using the default packages that are included and any that might have come down during a update(in retrospect, maybe choosing update all in the SDk update might not have been wise). The timing is fairly consistent online and offline. I did a few tests and found the following:
Startup Eclipse and enter a line of code that can use a .toString(). Typing the '.' populates the auto complete within 2-3 seconds. Type a 't' and it takes 70-75 seconds. After that, 10 seconds. Diff objects do the same thing(75 the first time, 10 after that). It's the filtering process that appears to stall. My CPU does not max, Memory is OK, but the program will go not responding till it's done. Any typeahead gets cached and eventually filters the list when Eclipse starts responding.

For me the problem went away when I increased the memory for the vm.
Put this in your eclipse.ini:
-Xms512m
-Xmx1024m

on my 4GB Windows Vista system this would happen A LOT !! (as well as debug issues when looking up variables).
This all went away after I built my new PC with 8GB RAM. I can now run 4 emulators simultaneously and it doesn't have any debug problems any more either. Auto complete with huge lists also works just fine.
it would seem to be just an issue with how much RAM you've got.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.