My application run some complex threads that fetch maps in a background thread and draw them. Sometimes if I run the app for a couple hours on a slow network I seem to be getting it into a weird state where all my threads status are showing TimedWait or Wait (except the ones that are Native such as main).
What is the cause of this? How can I debug it? I am absolutely lost and I know this is a bit of a general question but I would appreciate it if someone could point me to the right direction. EG:
How to pin point the cause of the problem.
What king of issues generally cause all the threads to lock up?
Anybody seen anything similar?
Thanks
A timed wait is simply a thread which is blocked on some O/S level call which has a timeout specified, such as a simple wait primitive (Object.wait()) or a socket operation (Socket read()/write()), a thread queue etc. It's quite normal for any complex program to have several or many of these - I have an application server which routinely has hundreds, even thousands.
Your threads may be backing up on non-responsive connections and may not be misbehaving at all, per se. It may simply be that you need to program them to detect and abort an idle connection.
Click on each of the threads which you are concerned about and analyze their stack trace for how they got there.
Most decent profiling tools (and application containers) will have the option of printing a full stack trace, and more modern ones will do a dead-lock and live-lock analysis for you. The JVisualVM tool distributed with Sun's JDK and available on the net as VisualVM will do this and it's very effective. Most decent profilers will also show lock acquisition in the stack trace (yours, above, is not in that view).
Otherwise, you are looking for two or more threads contending for the same lock or acquiring the same locks in a different order. You may need to do this manually by actually examining the source and annotating your stack trace, but you should be able to whittle down likely candidates if your tool doesn't point right to the conflicting threads.
Related
We have an issue in our server at job and I'm trying to understand what is happening. It's a Java application that runs in a linux server, the application recieve inforamtion form TCP socket and analyse them and after analyse write into the database.
Sometimes the quantity of packets is too many and the Java application need to write many times into the database per second (like 100 to 500 times).
I try to reproduce the issue in my own computer and look how the application works with JProfiler.
The memory look always going up, is it a memory leak (sorry I'm not a Java programmer, i'm C++ programmer)?
After 133 minute
After 158 minute
I have many locked thread, does it means that the application did not programmed correctly?
Is it too many connection to the database (the application use BasicDataSource class to use a connection pool)?
The program don't have FIFO to manage database writing for continual information entering from TCP port. My questions are (remeber that I'm not a Java programmer and I don't know if this is way that a Java application should work or the program can be programmed more efficient)
Do you think that something is wrong with the code that are not correctly managing write, read, updates on the database and cosume too many memory and CPU time, or is it the way that it works in BasicDataSource class?
How do you think I can improve it (if you think it's an issue) this issue, by creating a FIFO and removing the part of code that create too many threads? Or the threads is not the application threads himself and thats the BasicDataSource threads?
There are several areas to dig into, but first I would try and find what is actually blocking the threads in question. I'll assume everything before the app is being looked at as well, so this is from the app down.
I know the graphs show free memory but they are just point in time so I can't see a trend. GC logging is available, I haven't used JProfiler much though so I am not sure how to point you to it in that tool. I know in DynaTrace I can see GC events and their duration as well as any other blocking events and their root cause as well. If this isn't available there are command line switches to log GC activity to see its duration and frequency. That is one area that could block.
I would also look at how many connections you have in your pool. If there are 100-500 requests/second trying to write and they are stacking up because you don't have enough connections to work them then that could be a problem as well. The image shows all transactions but doesn't speak to the pool size. Transactions blocked with nowhere to go could lead to your memory jumps as well.
There is also the flip side that your database can't handle the traffic and is pegged, and that is what is blocking the connections as well so you would want to monitor that end of things and see if that is a possible cause of the blocking.
There is also the chance that the blocking is occurring from the SQL being run as well, waiting for page locks to be released, etc.
Lots of areas to look at, but I would address and verify one layer at a time starting with the app and working down.
I am using some 3rd party code, which may hang indefinitely in some cases. This leads to hanging threads which keep holding the resources and clogging a thread pool.
Eventually thread pool becomes full of useless threads and system effectively fails.
In Java one can't forcefully kill the thread (kill -9). But how then to manage such edge cases?
Obviously fixing the bug would be better, however alternative include
only run the 3rd party code/library in a sub-process. Just killing the thread is unlikely to be enough.
you could hack the 3rd party code to check for interrupts in the sections you find run for too long. You can take a stack trace to run out where this is.
Use Thread.stop() though this has been disabled in Java 8.
when you detect there is a hung thread, increase the size of the thread pool by one. This will give you the correct number of active threads.
I am using samurai tool to analyze thread dump. It looks like it has many blocked threads. I have no clue to derive anything from the thread dump.
I have an SQL query in my Java application that runs on weblogic that takes enormous time to complete. After running this query by clicking on my Java application button several times hangs my JVM.
Thread dumps can be found # : http://www.megafileupload.com/en/file/379103/biserver2-txt.html
Can you help me understand what does the thread dump say ?
The amount of data you provide is a bit overwhelming, so let's just give you a hint how to proceed. For the analysis I use open source threadlogic application based on TDA. It takes few seconds to parse 3 MiB worth of data but in nicely shows 22 different stack trace dumps in one file:
Drilling down to reveals really disturbing list of warnings and alerts.
I don't have time to examine all of them, but here is a list of those marked as FATAL (keep in mind that false-positives are also to be expected):
Wait for SLSB Beans
Description: Waiting for Stateless Session Bean (SLSB) instance from the SLSB Free pool
Advice: Beans all in use, free pool size size insufficient
DEADLOCK
Description: Circular Lock Dependency Detected leading to Deadlock
Advice: Deadlock detected with circular dependency in locks, blocked threads will not recover without Server Restart. Fix the order of locking and or try to avoid locks or change order of locking at code level, Report with SR for Server/Product Code
Finalizer Thread Blocked
Description: Finalizer Thread Blocked
Advice: Check if the Finalizer Thread is blocked for a lock which can lead to wasted memory waiting to be reclaimed from Finalizer Queue
WLS Unicast Clustering unhealthy
Description: Unicast messaging among Cluster members is not healthy
Advice: Unicast group members are unable to communicate properly, apply latest Unicast related patches and enable Message Ordering or switch to Multicast
WLS Muxer is processing server requests
Description: WLS Muxer is handling subsystem requests
Advice: WLS Server health is unhealthy as some subsystems are overwhelmed with requests which is leading to the Muxer threads directly handling requests. instead of dispatching to relevant subsystems. There is likely a bug here.
Stuck Thread
Description: Thread is Stuck, request taking very long time to finish
Advice: Check why the thread or call is taking very long??. Is it blocked for unavailable or bad resource or contending for Lock?. Can be ignored if it is doing repeat work in a loop. (like adapter threads polling for events in a infinite loop)...
The issue was with WLDF logging information to log file. Once disabled it helped improve performance enormously. I am not a fan of ThreadLogic as a tool for thread dump analysis. It shows circular deadlock when you have stuck threads no matter how variant the issue is.
Thread dumps are the snapshot of all threads running in the application at given moment. Thread dump will have hundreds/thousands of application threads. It would be hard to scroll through every single line of the stack trace in every single thread. Call Stack Tree consolidates all the threads stack trace into one single tree and gives you one single view. It makes the thread dumps navigation much simpler and easier. Below is the sample call stack tree generated by fastThread.io.
Fig 1: Call stack Tree
You can keep drilling down to see code execution path. Fig 2 shows the drilled down version of a particular branch in the Call Stack Tree diagram.
Fig 2: Drilled down Call Stack Tree
Sample call stack tree generated by FastThread.io
Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?
Do the various JVM implementations differ in this aspect?
My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it? Or can I let it run if further requests seem to work without a problem.
It may work, but it is generally a bad idea. There is no guarantee that your application will succeed in recovering, or that it will know if it has not succeeded. For example:
There really may be not enough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.
The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.
If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.
Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.
In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.
See also my answer to a related question:
EDIT - in response to this followup question:
In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it?
No you don't have to restart. But it is probably wise to, especially if you don't have a good / automated way of checking that the service is running correctly.
The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are not designed to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)
EDIT 2
In response to this comment:
"other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?
Yes really! Consider this:
Thread #1 runs this:
synchronized(lock) {
while (!someCondition) {
lock.wait();
}
}
// ...
Thread #2 runs this:
synchronized(lock) {
// do something
lock.notify();
}
If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do something section, then Thread #2 won't make the notify() call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lock object ... but that is not sufficient!
If not the code ran by the thread is not exception safe, which is a more general problem.
"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.
You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!
(Note that you could get the above problem for just about any unexpected exception ... not just Error exceptions. There are certain kinds of Java code where attempting to recover from an unexpected exception is likely to end badly.)
The JVM will run the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.
You can however catch it and if necessary take an alternative path. Any allocations inside the try block will be GC'ed.
Since the OOME is "just" an Error which you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.
See also:
Catching java.lang.OutOfMemoryError
Is it possible to catch out of memory exception in java?
I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.
Can it recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...
It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...
You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.
You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're more likely to have some memory to use in your recovery sequence.
I don't know about different JVM implementations.
Any sane JVM will throw an OutOfMemoryError only if there is nothing the Garbage collector can do. However, if you catch the OutOfMemoryError early enough on the stack frame it can be likely enough that the cause was itself became unreachable and was garbage collected (unless the problem is not in the current thread).
Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense (as long as it can reasonably release the third-party code), but otherwise, in the general case, recovery should probably consist of bailing and telling the user why, rather than trying to go on as if nothing happened.
To answer your newly updated question: There is no reason to think you need to shut down the server if all is working well. My experience with JBoss is that as long as the OME didn't affect a deployment, things work fine. Sometimes JBoss runs out of permgen space if you do a lot of hot deployment. Then indeed the situation is hopeless and an immediate restart (which will have to be forced with a kill) is inevitable.
Of course each app server (and deployment scenario) will vary and it is really something learned from experience in each case.
You cannot fully a JVM that had OutOfMemoryError. At least with the oracle JVM you can add -XX:OnOutOfMemoryError="cmd args;cmd args" and take recovery actions, like kill the JVM or send the event somewhere.
Reference: https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
I'm chasing a production bug that's intermittent enough to be a real bastich to diagnose properly but frequent enough to be a legitimate nuisance for our customers. While I'm waiting for it to happen again on a machine set to spam the logfile with trace output, I'm trying to come up with a theory on what it could be.
Is there any way for competing file read/writes to create what amounts to a deadlock condition? For instance, let's say I have Thread A that occasionally writes to config.xml, and Thread B that occasionally reads from it. Is there a set of circumstances that would cause Thread B to prevent Thread A from proceeding?
My thanks in advance to anybody who helps with this theoretical fishing expedition.
Edit: To answer Pyrolistical's questions: the code isn't using FileLock, and is running on a WinXP machine. Not asked, but probably worth noting: The production machines are running Java 1.5.
Temporarily setup your production process to startup with debugging support, add this to how you're starting your java program or to say the tomcat startup:
-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n
Then attach to it:
jdb -connect com.sun.jdi.SocketAttach:hostname=localhost,port=8000
And take a look at your stack(s).
FileLock is an inter-process locking mechanism. It does nothing within the same JVM, so that isn't it. I would look at your synchronizations, and specifically at making sure you always acquire multiple locks in the same order.
I've gotten some useful tips for chasing the underlying bug, but based on the responses I've gotten, it would seem the correct answer to the actual question is:
No.
Damn. That was anti-climactic.
I know this is old, but to add some clarity on a "No" answer (for those of us who need to know why):
Deadlocking happens when exactly two distinct processes (transactions) update alternate dependent rows or records, but in reverse order. Basically both hang waiting for the other to complete an action which will never occur (as they are both waiting on the other). This is generally found in faulty database design.
If I recall, Wikipedia has a good definition here: http://en.wikipedia.org/wiki/Deadlock
Simple file access should not create dependencies like this. A more common issue would be your resource being used by another process and unavailable to the one trying to access it.