Handling Errors (e.g. OutOfMemoryError) within servers

Handling Errors (e.g. OutOfMemoryError) within servers - java

What is the best practice when dealing with Errors within a server application?
In particular, how do you think an application should handle errors like OutOfMemoryError?
I'm particularly interested in Java applications running within Tomcat, but I think that is a more general problem.
The reason I'm asking is because I am reviewing a web application that frequently throws OOME, but usually it just logs them and then proceeds with execution. That results, obviously, in more OOMEs.
While that is certainly bad practice, in my opinion, I'm not entirely sure that stopping the Server would be the best solution.

There is not much you can do to fix OutOfMemoryError except to clean up the code and adjust JVM memory (but if you have a leak somewhere it's just a bandaid)
If you don't have access to the source code and/or are not willing to fix it, an external solution is to use some sort of watch dog program that will monitor java application and restart it automatically when it detects OOMEs. Here is a link to one such program.
Of course it assumes that the application will survive restarts.

The application shouldn't handle OOM at all - that should be the server's responsibility.
Next step: Check if memory settings are appropriate. If they aren't, fix them; if they are, fix the application. :)

Well, if you have OOME then the best way would be to release as many resources (especially cached ones) as possible. Restarting the web-app (in case it's web-apps fault) or the web server itself (in case something else in the server does this) would do for recovering from this state. On the development front though it'd be good to profile the app and see what is taking up the space, sometimes there are resources that are attached to a class variable and hence not collected, sometimes something else. In the past we had problems where Tomcat wouldn't release the classes of previous versions of the same app when you replace the app with a newer version. Somewhat solved the problem by nullifying class variables or re-factoring not to use them at all but some leaks still remained.

An OutOfMemoryError is by no means always unrecoverable - it may well be the result of a single bad request, and depending on the app's structure it may just abandon processing the request and continue processing others without any problems.
So if your architecture supports it, catch the Error at a point where you have a chance to stop doing what caused it and continue doing something else - for an app server, this would be at the point that dispatches requests to individual app instances.
Of course, you should also make sure that this does not go unnoticed and a real fix can be implemented as soon as possible, so the app should log the error AND send out some sort of warning (e.g. email, but preferably something harder to ignore or get lost). If something goes wrong during that, then shutting down is the only sensible thing left to do.

#Michael Borgwardt, You can't recover from an OutOfMemoryError in Java. For other errors, it might not stop the application, but OutOfMemoryError literally hangs applications.

In our application which deals with Documents heavily, we do catch OOM errors where one bad request can result in OOM but we dont want to bring down the application because of this. We catch OOM and log it.
Not sure if this is best practice but seems like its working

I'm not an expert in such things, but I'll take a chance to give my vague opinion on this problem.
Generally, I think that there's two main ways:
Server is stopped.
Resources are thus gracefully degrading throughput, reducing memory consumption, but staying alive. For this case application must have appropriate architecture, I think.

According to the javadoc about a java.lang.Error:
An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a "normal" condition, is also a subclass of Error because most applications should not try to catch it.
A method is not required to declare in its throws clause any subclasses of Error that might be thrown during the execution of the method but not caught, since these errors are abnormal conditions that should never occur.
So, the best practice when dealing with subclasses of Error is to fix the problem that is causing them, not to "handle" them. As it's clearly stated, they should never occur.
In the case of an OutOfMemoryError, maybe you have a process that consumes lots of memory (e.g. generating reports) and your JVM is not well sized, maybe you have a memory leak somewhere in your application, etc. Whatever it is, find the problem and fix it, don't handle it.

I strongly disagree with the notion that you should never handle an OutOfMemoryError.
Yes, it tends to be unrecoverable most of the time. However, one of my servers got one a few days ago and the server was still mostly working for more than an hour and a half. Nobody complained so I didn't notice until my monitoring software got a failure and hour and a half after the first OutOfMemoryError. I need to know as soon as possible when there is an OutOfMemoryError on my server. I need to handle it so that I can set up a notification so that I can know to restart my server ASAP.
I'm still trying to figure out how to get Tomcat to do something when it gets an Error. error-page doesn't seem to be working for it.

Related

How to handle OutOfMemoryError in multi threading? [duplicate]

This question already has answers here:
What is an OutOfMemoryError and how do I debug and fix it
(4 answers)
Closed 5 years ago.
We are using heavy multi threading in a Swing application or extensive calculations. From time to time it can happen that the application runs against an OOME and can not create any native threads any more. I absolutely understand that the application has to be aware of this and it is bad by design then, however it can not be avoided 100%. The problem is that in such a case the JVM is absolutely lost because it can not handle the error and the system is behaving non predictable. Usually we log every memory error and restart the application by -XX:OnOutOfMemoryError="kill -9 %p", however this does not work for obvious reason. On the other hand it is a bit frustrating the JVM has no control any more. So what might be a good way to come around this kind of problem?
PS: I do not search for a solution like extending systems process limits or reducing thread stack size via Xss. I am looking for an approach how to handle in general.

The JVM has perfect control over OutOfMemoryErrors and handles it gracefully, what does not handle it gracefully is your program. You can catch and handle an OutOfMemoryError in the same way as every other error, just that most programs never do that.
To solve your problem you should first try to pinpoint the root of those memory errors, for example by logging them, or by using performance/memory analysis tools. Also enforcing a core-dump in these cases can be useful, which then allows you to analyze the root cause at the moment it happened.
In the end redesigning the application will be necessary to avoid OOM errors by limiting the amount of memory used. This can either be done by testing how many threads the program can gracefully handle and then enforcing that limit, or by checking free memory before creating a new thread. Also architectural changes might help, but you posted no details about the internals, so I can't give any advise here.

Is exception handling different in an IDE than it is during runtime as an exe or jar?

I'm working on a physics program and as time goes on I continue to tie up more and more loose ends with exception handling to prevent freezing and locking up. For Now I am getting certain exceptions in my console such as StringFormatException(s), but this error does not freeze the program nor does it affect run-time in any way, it simply shows up in the IDE's terminal (Eclipse IDE, JRE 7). When dealing with errors such as this one, that (seemingly) do not affect run-time, is it still important to handle the exception even if the program works fine? If I were to export the program as a .jar with only these types of errors not handled, would users even notice? Why do some errors cause large problems while others don't?
Side Note:
I'm asking this primarily because in the future, when taking on projects much larger than this one I believe there are going to be times when a vast number of spots in my code may exist where exceptions are thrown but not yet handled. On very large projects involving tens of thousands of lines of code do many programmers dismiss these types of errors that do not affect logic or run-time and conclude that to go back and fix all of them would not be necessary or worth the time? Or is it essential to develop a habit of committing possible error throwing situations to memory and staying on top of all possible errors.
Examples of how software with similar errors to this, ones that allow the program to function as if nothing has happened would be greatly appreciated if you have the source code available and have taken note of this sort of thing.

is it still important to handle the exception even if the program works fine?
Even if it appears to work fine it may not be. If you get into the habit of ignoring Exceptions you can easily ignore an Exception which is causing a problem.
If I were to export the program as a .jar with only these types of errors not handled, would users even notice?
It depends if they see the output.
Why do some errors cause large problems while others don't?
Some errors are more serious than others. Ideally you only want serious error or none. Errors which are not too serious are more likely to have subtle problems which are harder/less likely to get fixed.
On very large projects involving tens of thousands of lines of code do many programmers dismiss these types of errors that do not affect logic or run-time and conclude that to go back and fix all of them would not be necessary or worth the time?
Its best not to handle or catch an exception unless you are going to do something useful with it. Otherwise it is better to re-throw it provided it will always get logged at the very least.
ones that allow the program to function as if nothing has happened would be greatly appreciated
FileNotFoundException, EOFException, NumberFormatException are common exceptions which are handled in code (but rarely just ignored) Google for examples.
Its almost always a bad idea to pretend an exception didn't occur. If you don't care if something worked or not, you usually don't need it in the first place.

There can be runtime errors that don't affect the execution of a program but I would strongly urge you to handle or fix the cause of every error you possibly can.
If you don't fix or handle runtime errors your logs (if you have any!) are going to be full of them and you won't be able to see the error you actually should care about from all the noise.

Is there a way to selectively debug a single application (or a few applications) in a JVM?

When i start a JVM in debug mode things naturally slow down.
Is there a way to state that i am interested in only debugging a single application instead of the 15 (making up a number here) applications that run on this JVM.
An approach that facilitates this might make things faster particularly when we already know from the logs and other trace facilities that the likely issue with a single application
Appreciate thoughts and comments
Thanks
Manglu

I am going to make a lot of assumptions here, especially as your question is missing a lot of contextual information.
Is there a way to state that i am interested in only debugging a single application instead of the 15 (making up a number here) applications that run on this JVM.
Firstly, I will assume that you are attempting to do this in production. If so, step back and think what could go wrong. You might be putting a single breakpoint, but that will queue up all the requests arriving at that breakpoint, and by doing so you've thrown any SLA requirements out of the window. And, if your application is handling any sensitive data, you must have seen something that you were not supposed to be seeing.
Secondly, even if you were doing this on a shared development or testing environment this is a bad idea. Especially if are unsure of what you are looking for. If you are hunting a synchronization bug, then this is possibly the wrong way to do so; other threads will obviously be sharing data that you are reading and make it less likely to find the culprit.
The best alternative to this is to switch on trace logging in your application. This will, of course be useless, unless you have embedded the appropriate logger calls in your application (especially to trace method arguments and return values). With trace logs at your disposal, you should be able to create an integration or unit test that will reproduce the exact conditions of failure on your local developer installation; this is where you ought to be doing your debugging. Sometimes, even a functional test will suffice.
There is no faster approach in general, as it is simply not applicable to all situations. It is possible for you to establish a selected number of breakpoints in any of the other environments, but it simply isn't worth the trouble, unless you know that only your requests are being intercepted by the debuggee process.

Can the JVM recover from an OutOfMemoryError without a restart

Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?
Do the various JVM implementations differ in this aspect?
My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it? Or can I let it run if further requests seem to work without a problem.

It may work, but it is generally a bad idea. There is no guarantee that your application will succeed in recovering, or that it will know if it has not succeeded. For example:
There really may be not enough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.
The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.
If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.
Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.
In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.
See also my answer to a related question:
EDIT - in response to this followup question:
In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it?
No you don't have to restart. But it is probably wise to, especially if you don't have a good / automated way of checking that the service is running correctly.
The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are not designed to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)
EDIT 2
In response to this comment:
"other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?
Yes really! Consider this:
Thread #1 runs this:
synchronized(lock) {
while (!someCondition) {
lock.wait();
}
}
// ...
Thread #2 runs this:
synchronized(lock) {
// do something
lock.notify();
}
If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do something section, then Thread #2 won't make the notify() call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lock object ... but that is not sufficient!
If not the code ran by the thread is not exception safe, which is a more general problem.
"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.
You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!
(Note that you could get the above problem for just about any unexpected exception ... not just Error exceptions. There are certain kinds of Java code where attempting to recover from an unexpected exception is likely to end badly.)

The JVM will run the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.
You can however catch it and if necessary take an alternative path. Any allocations inside the try block will be GC'ed.
Since the OOME is "just" an Error which you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.
See also:
Catching java.lang.OutOfMemoryError
Is it possible to catch out of memory exception in java?

I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.

Can it recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...
It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...
You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.

You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're more likely to have some memory to use in your recovery sequence.
I don't know about different JVM implementations.

Any sane JVM will throw an OutOfMemoryError only if there is nothing the Garbage collector can do. However, if you catch the OutOfMemoryError early enough on the stack frame it can be likely enough that the cause was itself became unreachable and was garbage collected (unless the problem is not in the current thread).
Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense (as long as it can reasonably release the third-party code), but otherwise, in the general case, recovery should probably consist of bailing and telling the user why, rather than trying to go on as if nothing happened.
To answer your newly updated question: There is no reason to think you need to shut down the server if all is working well. My experience with JBoss is that as long as the OME didn't affect a deployment, things work fine. Sometimes JBoss runs out of permgen space if you do a lot of hot deployment. Then indeed the situation is hopeless and an immediate restart (which will have to be forced with a kill) is inevitable.
Of course each app server (and deployment scenario) will vary and it is really something learned from experience in each case.

You cannot fully a JVM that had OutOfMemoryError. At least with the oracle JVM you can add -XX:OnOutOfMemoryError="cmd args;cmd args" and take recovery actions, like kill the JVM or send the event somewhere.
Reference: https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Trying to cause java.lang.OutOfMemoryException

I am trying to reproduce java.lang.OutOfMemoryException in Jboss4, which one of our client got, presumably by running the J2EE applications over days/weeks.
I am trying to find a way for the webapp to spitout java.lang.OutOfMemoryException in a matter of minutes (instead of days/weeks).
One thing come into mind is to write a selenium script and has the script bombards the webapps.
One other thing that we can do is to reduce JVM heap size, but we would prefer not to do this, as we want to see the limit of our system.
Any suggestions?
ps: I don't have access to the source code, as we just provide a hosting service (of course I could decompile the class files...)

If you don't have access to the source code of the J2EE app in question, the options that come to mind are:
Reduce the amount of RAM available to the JVM. You've already identified this one and said you don't want to do it.
Create a J2EE app (it could probably just be a JSP) and configure it to run within the same JVM as the target app, and have that app allocate a ridiculous amount of memory. That will reduce the amount of memory available to the target app, hopefully such that it fails in the way you're trying to force.

Try to use some profiling tools to investigate memory leakage. Also good to investigate memory damps that was taken after OOM happens and logs. IMHO: reducing memory is not the rightest way to investigate cose you can get issues not connected with real production one.

Do both, but in a controlled fashion :
Reduce the available memory to the absolute minimum (using -Xms1M -Xmx2M, as an example, but I fear your app won't even load with such limitations)
Do controlled "nuclear irradiation" : do Selenium scripts or each of the known working urls before to attack the presumed guilty one.
Finally, unleash the power that shall not be raised : start VisualVM and any other monitoring software you can think of (DB execution is a usual suspect).

If you are using Sun Java 6, you may want to consider attaching to the application with jvisualvm in the JDK. This will allow you to do in-place profiling without needing to alter anything in your scenario, and may possibly immediately reveal the culprit.

If you don't have the source use decompile it, at least if you think the terms of usage allows this and you live in a free country. You can use:
Java Decompiler or JAD.

In addition to all the others I must say that even if you can reproduce an OutOfMemory error, and find out where it occurred, you probably haven't found out anything worth knowing.
The trouble is that an OOM occurs when an allocation can not take place. The real problem however is not that allocation, but the fact that other allocations, in other parts of the code, have not been de-allocated (de-referenced and garbage collected). The failed allocation here might have nothing to do with the source of the trouble (no pun intended).
This problem is larger in your case as it might take weeks before trouble starts, suggesting either a sparsely used application, or an abnormal code path, or a relatively HUGE amount of memory in relation to what would be necessary if the code was OK.
It might be a good idea to ask around why this amount of memory is configured for JBoss and not something different. If it's recommended by the supplier than maybe they already know about the leak and require this to mitigate the effects of the bug.
For these kind of errors it really pays to have some idea in which code path the problem occurs so you can do targeted tests. And test with a profiler so you can see during run-time which objects (Lists, Maps and such) are growing without shrinking.
That would give you a chance to decompile the correct classes and see what's wrong with them. (Closing or cleaning in a try block and not a finally block perhaps).
In any case, good luck. I think I'd prefer to find a needle in a haystack. When you find the needle you at least know you have found it:)

The root of the problem is most likely a memory leak in the webapp that the client is running. In order to track it down, you need to run the app with a representative workload with memory profiling enabled. Take some snapshots, and then use the profiler to compare the snapshots to see where objects are leaking. While source-code would be ideal, you should be able to at least figure out where the leaking objects are being allocated. Then you need to track down the cause.
However, if your customer won't release binaries so that you can run an identical system to what he is running, you are kind of stuck, and you'll need to get the customer to do the profiling and leak detection himself.
BTW - there is not a lot of point causing the webapp to throw an OutOfMemoryError. It won't tell you why it is happening, and without understanding "why" you cannot do much about it.
EDIT
There is not point "measuring the limits", if the root cause of the memory leak is in the client's code. Assuming that you are providing a servlet hosting service, the best thing to do is to provide the client with instructions on how to debug memory leaks ... and step out of the way. And if they have a support contract that requires you to (in effect) debug their code, they ought to provide you with the source code to do your job.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.