Can the JVM recover from an OutOfMemoryError without a restart - java

Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?
Do the various JVM implementations differ in this aspect?
My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it? Or can I let it run if further requests seem to work without a problem.

It may work, but it is generally a bad idea. There is no guarantee that your application will succeed in recovering, or that it will know if it has not succeeded. For example:
There really may be not enough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.
The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.
If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.
Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.
In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.
See also my answer to a related question:
EDIT - in response to this followup question:
In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it?
No you don't have to restart. But it is probably wise to, especially if you don't have a good / automated way of checking that the service is running correctly.
The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are not designed to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)
EDIT 2
In response to this comment:
"other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?
Yes really! Consider this:
Thread #1 runs this:
synchronized(lock) {
while (!someCondition) {
lock.wait();
}
}
// ...
Thread #2 runs this:
synchronized(lock) {
// do something
lock.notify();
}
If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do something section, then Thread #2 won't make the notify() call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lock object ... but that is not sufficient!
If not the code ran by the thread is not exception safe, which is a more general problem.
"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.
You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!
(Note that you could get the above problem for just about any unexpected exception ... not just Error exceptions. There are certain kinds of Java code where attempting to recover from an unexpected exception is likely to end badly.)

The JVM will run the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.
You can however catch it and if necessary take an alternative path. Any allocations inside the try block will be GC'ed.
Since the OOME is "just" an Error which you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.
See also:
Catching java.lang.OutOfMemoryError
Is it possible to catch out of memory exception in java?

I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.

Can it recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...
It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...
You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.

You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're more likely to have some memory to use in your recovery sequence.
I don't know about different JVM implementations.

Any sane JVM will throw an OutOfMemoryError only if there is nothing the Garbage collector can do. However, if you catch the OutOfMemoryError early enough on the stack frame it can be likely enough that the cause was itself became unreachable and was garbage collected (unless the problem is not in the current thread).
Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense (as long as it can reasonably release the third-party code), but otherwise, in the general case, recovery should probably consist of bailing and telling the user why, rather than trying to go on as if nothing happened.
To answer your newly updated question: There is no reason to think you need to shut down the server if all is working well. My experience with JBoss is that as long as the OME didn't affect a deployment, things work fine. Sometimes JBoss runs out of permgen space if you do a lot of hot deployment. Then indeed the situation is hopeless and an immediate restart (which will have to be forced with a kill) is inevitable.
Of course each app server (and deployment scenario) will vary and it is really something learned from experience in each case.

You cannot fully a JVM that had OutOfMemoryError. At least with the oracle JVM you can add -XX:OnOutOfMemoryError="cmd args;cmd args" and take recovery actions, like kill the JVM or send the event somewhere.
Reference: https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Related

How to handle OutOfMemoryError in multi threading? [duplicate]

This question already has answers here:
What is an OutOfMemoryError and how do I debug and fix it
(4 answers)
Closed 5 years ago.
We are using heavy multi threading in a Swing application or extensive calculations. From time to time it can happen that the application runs against an OOME and can not create any native threads any more. I absolutely understand that the application has to be aware of this and it is bad by design then, however it can not be avoided 100%. The problem is that in such a case the JVM is absolutely lost because it can not handle the error and the system is behaving non predictable. Usually we log every memory error and restart the application by -XX:OnOutOfMemoryError="kill -9 %p", however this does not work for obvious reason. On the other hand it is a bit frustrating the JVM has no control any more. So what might be a good way to come around this kind of problem?
PS: I do not search for a solution like extending systems process limits or reducing thread stack size via Xss. I am looking for an approach how to handle in general.
The JVM has perfect control over OutOfMemoryErrors and handles it gracefully, what does not handle it gracefully is your program. You can catch and handle an OutOfMemoryError in the same way as every other error, just that most programs never do that.
To solve your problem you should first try to pinpoint the root of those memory errors, for example by logging them, or by using performance/memory analysis tools. Also enforcing a core-dump in these cases can be useful, which then allows you to analyze the root cause at the moment it happened.
In the end redesigning the application will be necessary to avoid OOM errors by limiting the amount of memory used. This can either be done by testing how many threads the program can gracefully handle and then enforcing that limit, or by checking free memory before creating a new thread. Also architectural changes might help, but you posted no details about the internals, so I can't give any advise here.

What is the best way to handle out of memory conditions in Java?

We have an application that spawns new JVMs and executes code on behalf of our users. Sometimes those run out of memory, and in that case behave in very different ways. Sometimes they throw an OutOfMemoryError, sometimes they freeze. I can detect the latter by a very lightweight background thread that stops to send heartbeat signals when running low on memory. In that case, we kill the JVM, but we can never be absolutely sure what the real reason for failing to receive the heartbeat was. (It could as well have been a network issue or a segmentation fault.)
What is the best way to reliably detect out of memory conditions in a JVM?
In theory, the -XX:OnOutOfMemoryError option looks promising, but it is effectively unusable due to this bug: https://bugs.openjdk.java.net/browse/JDK-8027434
Catching an OutOfMemoryError is actually not a good alternative for well-known reasons (e.g. you never know where it happens), though it does work in many cases.
The cases that remain are those where the JVM freezes and does not throw an OutOfMemoryError. I'm still sure the memory is the reason for this issue.
Are there any alternatives or workarounds? Garbage collection settings to make the JVM terminate itself rather than freezing?
EDIT: I'm in full control of both the forking and the forked JVM as well as the code being executed within those, both are running on Linux, and it's ok to use OS specific utilities if that helps.
The only real option is (unfortunately) to terminate the JVM as soon as possible.
Since you probably cant change all your code to catch the error and respond. If you don't trust the OnOutOfMemoryError (I wonder why it should not use vfork which is used by Java 8, and it works on Windows), you can at least trigger an heapdump and monitor externally for those files:
java .... -XX:+HeapDumpOnOutOfMemoryError "-XX:OnOutOfMemoryError=kill %p"
After experimenting with this for quite some time, this is the solution that worked for us:
In the spawned JVM, catch an OutOfMemoryError and exit immediately, signalling the out of memory condition with an exit code to the controller JVM.
In the spawned JVM, periodically check the amount of consumed memory of the current Runtime. When the amount of memory used is close to critical, create a flag file that signals the out of memory condition to the controller JVM. If we recover from this condition and exit normally, delete that file before we exit.
After the controlling JVM joins the forked JVM, it checks the exit code generated in step (1) and the flag file generated in step (2). In addition to that, it checks whether the file hs_err_pidXXX.log exists and contains the line "Out of Memory Error". (This file is generated by java in case it crashes.)
Only after implementing all of those checks were we able to handle all cases where the forked JVM ran out of memory. We believe that since then, we have not missed a case where this happened.
The java flag -XX:OnOutOfMemoryError was not used because of the fork problem, and -XX:+HeapDumpOnOutOfMemoryError was not used because a heap dump is more than we need.
The solution is certainly not the most elegant piece of code ever written, but did the job for us.
In case you do have control both over the application and configuration, the best solution would be to find the underlying cause for the OutOfMemoryError being thrown and fix this, instead of trying to hide the symptoms either by catching the error or just restarting JVMs.
From what you describe, it definitely looks that either the application running on the JVM is leaking memory, is just running using under-provisioned resources (memory in your case) or is occasionally processing transactions requiring abnormally large chunks of heap. Solutions for those cases would be different:
In case of a memory leak, find the underlying cause and have engineers fix it. Tools for this include heap dump analyzers, profilers or leak detectors
In case of under-provisioned resources you need to monitor the application memory consumption, for example via garbage collection logs and adjust the sizes of different memory pools based on what you face.
In case of surge allocations during user transactions, you need to trace down the code causing the surge it and having engineers to fix it - via disabling certain user inputs or loading and processing the data in smaller batches. Either thread dumps or heap dumps from the processes can guide you towards the solution.

Thread status TimedWait. How to debug?

My application run some complex threads that fetch maps in a background thread and draw them. Sometimes if I run the app for a couple hours on a slow network I seem to be getting it into a weird state where all my threads status are showing TimedWait or Wait (except the ones that are Native such as main).
What is the cause of this? How can I debug it? I am absolutely lost and I know this is a bit of a general question but I would appreciate it if someone could point me to the right direction. EG:
How to pin point the cause of the problem.
What king of issues generally cause all the threads to lock up?
Anybody seen anything similar?
Thanks
A timed wait is simply a thread which is blocked on some O/S level call which has a timeout specified, such as a simple wait primitive (Object.wait()) or a socket operation (Socket read()/write()), a thread queue etc. It's quite normal for any complex program to have several or many of these - I have an application server which routinely has hundreds, even thousands.
Your threads may be backing up on non-responsive connections and may not be misbehaving at all, per se. It may simply be that you need to program them to detect and abort an idle connection.
Click on each of the threads which you are concerned about and analyze their stack trace for how they got there.
Most decent profiling tools (and application containers) will have the option of printing a full stack trace, and more modern ones will do a dead-lock and live-lock analysis for you. The JVisualVM tool distributed with Sun's JDK and available on the net as VisualVM will do this and it's very effective. Most decent profilers will also show lock acquisition in the stack trace (yours, above, is not in that view).
Otherwise, you are looking for two or more threads contending for the same lock or acquiring the same locks in a different order. You may need to do this manually by actually examining the source and annotating your stack trace, but you should be able to whittle down likely candidates if your tool doesn't point right to the conflicting threads.

java first encounter with heap space error server data logger

I built my first Java program which is built on top of the Interactive Brokers Java API. That may or may not be important. I just extended the main API classes with a couple new classes.
The program is making data queries to a remote server. When the server responds, I log the received data to a local MySQL data base. Once the program finishes logging the data, the program will make the next data request.
I am having a problem after leaving the program running for some time, after making a couple hundred server requests. I will see this error, then the program doesn't continue to execute:
java.lang.OutOfMemoryError: Java heap space
I did some research, and from what I read, I conclude that the program is creating many new variables, and not destroying old worthless ones. Since I am using Netbeans for development, I used the Netbeans profiler to inspect if this was the case. See the picture here:
After running the program for quite some time, more and more of the memory is used up by Byte. So it seems that my theory is still true.
I don't really know where to go from here. There is no reference to a class or specific variable, just a variable type. How can pinpoint where the problem is coming from?
UPDATE
I corrected a specific problem that was mentioned by BigMike in the comments. Previiously, I was creating many Statements in the JDBC MySQL Connector API, and I was calling .execute() to execute the statements, but I wasn't closing the statement with .close().
I made sure the add the statement.close() call after each execution, and the program runs much better now. By looking at the RAM usage for this program, it seems to solved the problem. I am also not seeing the Java heap space error anymore, which is nice.
Thanks!
It's very hard to say what might be wrong by simply that.
It might have to do with Streams that you are opening that aren't being closed when you no longer need them.
Double check methods that allocate resources (reading from files, database, etc), especially if they read data into streams, and make sure you close those streams in a finally clause.
Apart from that, you can try and profile what methods are being called more often, etc, to try and narrow down the problem to a specific part of your code.
I found a site with a reasonable explanation of how Garbage Collection works, and what can cause OutOfMemoryErrors:
http://www.kdgregory.com/index.php?page=java.outOfMemory
If you read through that, there's a specific reference to high allocation of Object[] and byte[], that might point you in the right direction.
Generally speaking, this comes about for one of two reasons:
There is a memory leak in the application, such that the application fails to release items for garbage collection, leading to the JVM running out of memory over time.
The application attempted a one-off operation that would require more memory than is available, leading to the JVM running out of memory due to the operation.
Since your output seems to indicate that the bulk of the memory is consumed by literally a million plus small byte arrays, my guess is that #1 is probably the culprit; however, to verify this, restart your application and watch it's memory consumption over time. It will bounce up and down, but really you only need to watch the trend of consumption. If the consumption average continues to climb over time, you have a memory leak.
To solve this issue, you typically need the source code, and need to find the parts of the code where the troubling objects are being created, used, and then "stored" far beyond the last time that they will ever be used. The solution is to correct the code to no longer store them. HashMaps, Lists, and other Collections are often accomplices in memory leak problems.
If you lack the source code, you can attempt to measure the trend of the memory consumption, and schedule shutdowns and restarts of the application to effectively "reset the clock" such that you choose your downtime instead of watching the application choose it for you.
If it is a one-off operation (not likely considering your data) then you won't see an upward trend in memory consumption until the triggering event occurs. In such a case, with access to the source code, you should protect your application from processing data that grows very far outside of normal operating parameters. For example, reading a message from the network typically takes only a few KB, but in exceptional circumstances a client might transmit forever. In such a case, kill the message processing and throw the message away with an error if you exceed a maximum message size limit of 10 MB.
Without access to the source code in the latter scenario, the only hope is to identify the incoming upset, hunt down the source of the errant transmission, and attempt to manipulate it to prevent the overload of output.
The variations on how to approach these techniques are vast, but now you have a few ideas.

Handling Errors (e.g. OutOfMemoryError) within servers

What is the best practice when dealing with Errors within a server application?
In particular, how do you think an application should handle errors like OutOfMemoryError?
I'm particularly interested in Java applications running within Tomcat, but I think that is a more general problem.
The reason I'm asking is because I am reviewing a web application that frequently throws OOME, but usually it just logs them and then proceeds with execution. That results, obviously, in more OOMEs.
While that is certainly bad practice, in my opinion, I'm not entirely sure that stopping the Server would be the best solution.
There is not much you can do to fix OutOfMemoryError except to clean up the code and adjust JVM memory (but if you have a leak somewhere it's just a bandaid)
If you don't have access to the source code and/or are not willing to fix it, an external solution is to use some sort of watch dog program that will monitor java application and restart it automatically when it detects OOMEs. Here is a link to one such program.
Of course it assumes that the application will survive restarts.
The application shouldn't handle OOM at all - that should be the server's responsibility.
Next step: Check if memory settings are appropriate. If they aren't, fix them; if they are, fix the application. :)
Well, if you have OOME then the best way would be to release as many resources (especially cached ones) as possible. Restarting the web-app (in case it's web-apps fault) or the web server itself (in case something else in the server does this) would do for recovering from this state. On the development front though it'd be good to profile the app and see what is taking up the space, sometimes there are resources that are attached to a class variable and hence not collected, sometimes something else. In the past we had problems where Tomcat wouldn't release the classes of previous versions of the same app when you replace the app with a newer version. Somewhat solved the problem by nullifying class variables or re-factoring not to use them at all but some leaks still remained.
An OutOfMemoryError is by no means always unrecoverable - it may well be the result of a single bad request, and depending on the app's structure it may just abandon processing the request and continue processing others without any problems.
So if your architecture supports it, catch the Error at a point where you have a chance to stop doing what caused it and continue doing something else - for an app server, this would be at the point that dispatches requests to individual app instances.
Of course, you should also make sure that this does not go unnoticed and a real fix can be implemented as soon as possible, so the app should log the error AND send out some sort of warning (e.g. email, but preferably something harder to ignore or get lost). If something goes wrong during that, then shutting down is the only sensible thing left to do.
#Michael Borgwardt, You can't recover from an OutOfMemoryError in Java. For other errors, it might not stop the application, but OutOfMemoryError literally hangs applications.
In our application which deals with Documents heavily, we do catch OOM errors where one bad request can result in OOM but we dont want to bring down the application because of this. We catch OOM and log it.
Not sure if this is best practice but seems like its working
I'm not an expert in such things, but I'll take a chance to give my vague opinion on this problem.
Generally, I think that there's two main ways:
Server is stopped.
Resources are thus gracefully degrading throughput, reducing memory consumption, but staying alive. For this case application must have appropriate architecture, I think.
According to the javadoc about a java.lang.Error:
An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a "normal" condition, is also a subclass of Error because most applications should not try to catch it.
A method is not required to declare in its throws clause any subclasses of Error that might be thrown during the execution of the method but not caught, since these errors are abnormal conditions that should never occur.
So, the best practice when dealing with subclasses of Error is to fix the problem that is causing them, not to "handle" them. As it's clearly stated, they should never occur.
In the case of an OutOfMemoryError, maybe you have a process that consumes lots of memory (e.g. generating reports) and your JVM is not well sized, maybe you have a memory leak somewhere in your application, etc. Whatever it is, find the problem and fix it, don't handle it.
I strongly disagree with the notion that you should never handle an OutOfMemoryError.
Yes, it tends to be unrecoverable most of the time. However, one of my servers got one a few days ago and the server was still mostly working for more than an hour and a half. Nobody complained so I didn't notice until my monitoring software got a failure and hour and a half after the first OutOfMemoryError. I need to know as soon as possible when there is an OutOfMemoryError on my server. I need to handle it so that I can set up a notification so that I can know to restart my server ASAP.
I'm still trying to figure out how to get Tomcat to do something when it gets an Error. error-page doesn't seem to be working for it.

Categories

Resources