Related
Recently I came across a situation where I was not able to understand what a variable is getting assigned or whether a thread has got any exception since it was in a catch block and no logging was implemented for the stacktrace as it is a old codebase I can obviously find out such critical sections then implement logging but since it is a large codebase it gets difficult also I cant take remote debug since code is running in production is there any solution available for this!
You could use a Java Profiler for example YourKit.
The YourKit profiler is one of the most established leaders among Java
profilers. As a mature and a versatile profiler, YourKit can do both
CPU and memory profiling for you, with integrations across major Java
application servers, JDBC and other frameworks for high-level
performance analysis like finding synchronization issues and excessive
database accesses. YourKit can run in both sampling and tracing
profiling modes, and the mixed approach helps it make the most from
both worlds: the precision of tracing the actual code execution while
being able to precisely control the profiling overhead.
source
It seems like you need to apply tracing. Typically, mission critical applications require that application code is covered with tracing statements, e.g.:
logger.entering(className, methodName, args);
Each method needs to trace its entry / return as well as intermediate work. Implementing tracing manually is error prone and costly, it may easily account for 20% of all code. For production code, applications apply tracing specification to filter which code is logged, so performance impact is minimal: =all:com.company.MyClass.=finer
To easily enable your Java application for tracing and avoid writing tracing code, use the enterprise-aspects AOP Java library:
https://github.com/akovac35/enterprise-aspects
A great deal has been written about the wisdom of exceptions in general and the use of checked vs. unchecked exceptions in Java in particular, but I'm interested in seeing a defense of the decision to make thread termination the default policy instead of application termination the way it is in C++. This choice seems extremely dangerous to me: some condition that the programmer didn't plan for randomly causes some part of the program to die after logging a stack trace but the rest of the program soldiers resolutely on, what could go wrong? My intuition and experience say that a lot of things can go wrong here and that the default policy is the sort of thing that should only be selected specifically by someone who has a specific reason for choosing it, so what's the upside to this strategy which has such a seemingly large downside? Am I overestimating the risk?
EDIT: based on the answers so far, I feel that I need to be more focused in my description of the dangers that I perceive; I'm talking about the case of an application which uses multiple thread (e.g. in a thread pool), to update shared state. I recognize that this policy does not present a problem for single-threaded applications.
EDIT2: You can see that there is an awareness of these risks among the language maintainers from the explanation for why the Thread.stop() method was deprecated (found here: http://docs.oracle.com/javase/7/docs/technotes/guides/concurrency/threadPrimitiveDeprecation.html). The exact same issues apply when a thread dies unexpectedly due to uncaught exceptions. They must have designed the JVM so that all monitors are automatically unlocked when a thread dies, which seems like a poor implementation choice; having a thread die while it has a monitor locked should be an indication that the entire program should die because the alternative is almost certain to be internal inconsistency in some shared state.
#BD, Not sure what your what your experience says about this because you haven't explained it here. But, here is what I have experienced as a developer:
Generally, it is a bad idea to make application fail if one of its component has failed (temporarily or permanently) due to any reason like DB restart or some file being replaced. for example if I introduced a new type of trade in system and some issue comes in, it shouldn't shutdown my application.
Applications like web/application servers should be able to continue to work and responding to users even if any of its deployment is throwing any weird exception/s.
As per your worry on exceptions, generally all applications have a health monitoring system which monitors their health like CPU/Disk/RAM usage or errors in logs etc. and fire alerts accordingly.
I hope this should resolve your confusion.
From discussing this issue with a co-worker and also reviewing the answers received so far, I have formed a supposition here, and would like to get some feedback.
I suspect that the decision to make this behavior the default has its roots in the philosophy that defined the early development of language, as well as its early environment.
As part of the original philosophy, programmers/designers were expected to use checked exceptions, and the language enforces that checked exceptions which may be emitted by a method call (i.e. have been declared in the method definition) must be handled in the calling method, or else be declared by it to "officially" pass the responsibility to higher-level callers. Common practice has moved sharply away from the use of checked exceptions, not to mention the fact that one of the most commonly occurring exceptions in practice, NullPointerException, is unchecked. As a result of this, programmers must now assume that any method call can generate an unchecked exception, and the corollary to this is that any code which updates shared data in a concurrent context must implement transactional semantics for these updates in order to be fully correct. My experience is that most developers don't really understand this even if they do understand the basics of multi-threaded development, such as avoiding deadlock when managing critical sections with synchronization. The default uncaught exception handler behavior exacerbates the problem by masking its effects: in C++, it wouldn't matter if an uncaught exception would result in corrupted shared state because the program is dead anyway, but in Java the program will continue to limp along as best it can despite the fact that it is very likely to no longer be operating correctly.
The environmental factor is that single-threaded programs were likely the norm when the language was first developed, so the default behavior masqueraded as the correct one. The rise of multi-core architectures and increased usage of thread pools exposes the threat more broadly, and commonly applied approaches such as use of immutable objects can only go so far to solve it (hint for some: ConcurrentMap is probably not as safe as you think it is). My experience so far is that people who deny this risk are not being paranoid enough relative to the actual safety requirements of their code, but I would love to be proved wrong.
I suspect that modifying uncaught exception handlers to terminate the program should be the standard procedure required by most development organizations; at the very least this should be done for thread pools which are known to update shared state based on incoming inputs.
Normal (no GUI, no container) applications exit on uncaught exceptions - the default behavior is fine - as same as you wanted.
A GUI based application, it will be nice to to show error message and able to handle the error more usefully - for example we could submit a defect report with some additional information.
The behavior is fully changeable by providing thread specific exception handler- that could include application exit.
Here are some useful notes
We are considering development of a mission critical application in Java EE, and one thing that really impressed me is the lack of session isolation in the platform. Let me explain the scenario.
We have a native Windows application (a complete ERP solution) that receives about 2k LoC and 50 bug-fixes per month from sparse contributors. It also supports scripting, so the costumer can add their own logic and we have no clue about what such logic does. Instead of using a thread pool, each server node has a broker and a process pool. The broker receives a client request, enqueues it until a pooled instance is free, sends request to that instance, delivers response to client, and releases the instance back to the process pool.
This architecture is robust because with so many sparse contributions and custom scripting, it's not uncommon for a deployed version to have some serious bug such as an infinite loop, a long-waiting pessimistic lock, a memory corruption or memory leakage. We implemented a memory limit, a timeout for requests, and a simple watchdog. Whenever some process fails to answer correctly and on time, the broker simply kills it, so the watchdog detects and starts another instance. If a process crashes before it started to answer a request, the broker sends the same request to another pooled instance, and the user doesn't know about any failure on the server side (except in admin logs). This is nice because some instances are slowly trashed by bogus code as they work on requests. Because most session data is held at the client or (in rare cases) at a shared storage, it seems to work perfectly.
Now considering a move to Java EE, I couldn't find anything similar on the spec or popular application servers such as Glassfish and JBoss. Yes, I know that most cluster implementations do transparent fail-over with session replication, but we have small companies that use our system on a simple 2-node cluster (and we also have adventurers that use the system on a 1-node server). With a thread pool, I understand that a buggy thread can bring an entire node down, because the server cannot detect and safely kill it. Bringing an entire node down is much worst than killing a single process - we have deployments where each node has about 100 pooled process instances.
I know that IBM and SAP are aware of this problem, based on
http://www.trl.ibm.com/people/kawatiya/pub/Kawachiya07vee.pdf,
and
http://java.sys-con.com/node/47362
, respectively. But based on recent JSRs, forums and open-source tools, there isn't much activity on the community.
Now comes the questions!
If you have a similar scenario and
use Java EE, how did you solve?
Do you know about an upcoming
open-source product or change in
Java EE spec that can address this
issue?
Does .NET have the same problem? Can
you explain or cite references?
Do you know about some modern and
open platform that can address this
issue and is worth the task doing
ERP business logic?
Please, I have to ask you not tell about making more testing or any kind of QA investment, because we cannot force our costumers to make this on their own scripts. We also have cases where urgent bug-fixes must bypass QA, and while we force the customer to accept this, we cannot make him accept that a buggy software part can affect a range of unrelated features. This is issue is about robust architectures, not development process.
Thanks for your attention!
What you have stumbled upon is a fundamental issue regarding the use of Java and "hostile" applications.
It's a fundamental issue not just at the Java EE level, but at the core JVM level. The typical JVMs available have all sorts of issues with loading "unsafe code". From memory leaks, class loader leaks, resource exhaustion, and unclean thread kills, the typical JVM is simply not robust enough to handle badly behaving code well in a shared environment.
A simple example is memory exhaustion of the Java heap. As a basic rule, NOBODY (and by nobody, I specifically mean the core java library and just about every other 3rd party library out there) catches OutOfMemory exceptions. There are the rare few who do, but even they can do little about it. Typical code handles the exceptions they "expect" to handle, but let others fall through. Runtime exceptions (of which OOM is one) will happily bubble up through the call stack all the way to the top, leaving behind a wreckage of unchecked critical path code, leaving all sort of things in unknown state.
Things such as Constructors or static initializers which "can't fail" leaving behind uninitialized class members which are "never null". These damaged classes simply don't know they're damaged. Nobody knows they're damaged, and there's no way to clean them up. A Heap that hits OOM is an unsafe image and pretty much needs to be restarted (unless, of course, you wrote or audited ALL of the code yourself, which, naturally, you won't -- who would?).
Now, there may well be vendor specific JVMs which are better behaved and give you better control. The ones based on the Sun/Oracle JVM (i.e. most of them) do not.
So, it's not necessarily a Java EE issue, it's a JVM issue.
Hosting hostile code in the JVM is a bad idea. The only way it's practical is if you host a scripting language, and that scripting language implements some kind of resource control. That could be done, and you can tweak the existing ones as a start (JavaScript, Groovy, JPython, JRuby). The fact that these languages give users direct access to Java libraries makes them potentially dangerous, so you may have to restrict that as well to only aspects wrapped by script handlers. At this point, though, the "why use Java at all" question floats up.
You'll note Google App Engine does none of these. It spools up a separate JVM for each application that's being run, but even then it greatly restricts what can be done within those JVMs, notably through the existing Java security model. The distinction here is that these instances tend to be "long lived" so as not to endure the processing costs of startup and shutdown. I should say, they SHOULD be long lived, and those that are not do incur those costs.
You can make several instances of the JVM yourself, give them a bit of infrastructure to handle requests for logic, give them custom class loader logic to try and protect from class loader leaks, and minimally let you kill the instances off (they're simply a process) if you want. That can work, and probably work "ok" depending on the granularity of the calls, and the "start up" time for your logic. The start up time will minimally be the loading of the classes for the logic from run to run, that alone may make this a bad idea. And it certainly WON'T be "Java EE". Java EE is not set up to do this kind of thing. But you're not clear what Java EE features you're looking at either.
Effectively, this is what Apache and "mod_php" does. Several instances, as processes, individually handling requests, with badly behaving once being killed off as necessary. This is why PHP is common in the shared hosting business. In this structure, it's basically "safe".
I believe your scenario is highly untypical, thus it is improbable that there is a ready made framework/platform addressing this need. Java EE sort of assumes that the request processing code is written by the same team as the rest of the app, thus it need not be isolated, watched and reset that often, and bug fixes would be handled the same way in all parts of the system. This assumption greatly simplifies development, deployment, testing etc. for most of the projects, not forcing them to pay for something they don't need, And yes, it isn't suitable for everyone. If you want something fundamentally different, you probably need to implement a fair amount of failover logic yourself. Java EE does provide the fundamental building blocks for this though.
I believe (although have no concrete experience to prove it) that .NET or other platforms are basically built on similar assumptions.
We had a similar - though not so severe - port of a really enormous Perl site to Java. On receiving an HTTP request we instantiate a class and call its processRequest method. surrounded by try-catch and time measurement. Adding a timer and thread would suffice to be able to kill the thread. This probably is sufficient in real life.
A Java EE server like glassfish is an OSGi container you might have more isolating means.
Also you could run an array of (web or local) applications on which you dispatch your request via a central web applications. Those applications then are isolated.
Even more isolated are serialized sessions and operating system processes starting a new JVM.
Can the JVM recover from an OutOfMemoryError without a restart if it gets a chance to run the GC before more object allocation requests come in?
Do the various JVM implementations differ in this aspect?
My question is about the JVM recovering and not the user program trying to recover by catching the error. In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it? Or can I let it run if further requests seem to work without a problem.
It may work, but it is generally a bad idea. There is no guarantee that your application will succeed in recovering, or that it will know if it has not succeeded. For example:
There really may be not enough memory to do the requested tasks, even after taking recovery steps like releasing block of reserved memory. In this situation, your application may get stuck in a loop where it repeatedly appears to recover and then runs out of memory again.
The OOME may be thrown on any thread. If an application thread or library is not designed to cope with it, this might leave some long-lived data structure in an incomplete or inconsistent state.
If threads die as a result of the OOME, the application may need to restart them as part of the OOME recovery. At the very least, this makes the application more complicated.
Suppose that a thread synchronizes with other threads using notify/wait or some higher level mechanism. If that thread dies from an OOME, other threads may be left waiting for notifies (etc) that never come ... for example. Designing for this could make the application significantly more complicated.
In summary, designing, implementing and testing an application to recover from OOMEs can be difficult, especially if the application (or the framework in which it runs, or any of the libraries it uses) is multi-threaded. It is a better idea to treat OOME as a fatal error.
See also my answer to a related question:
EDIT - in response to this followup question:
In other words if an OOME is thrown in an application server (jboss/websphere/..) do I have to restart it?
No you don't have to restart. But it is probably wise to, especially if you don't have a good / automated way of checking that the service is running correctly.
The JVM will recover just fine. But the application server and the application itself may or may not recover, depending on how well they are designed to cope with this situation. (My experience is that some app servers are not designed to cope with this, and that designing and implementing a complicated application to recover from OOMEs is hard, and testing it properly is even harder.)
EDIT 2
In response to this comment:
"other threads may be left waiting for notifies (etc) that never come" Really? Wouldn't the killed thread unwind its stacks, releasing resources as it goes, including held locks?
Yes really! Consider this:
Thread #1 runs this:
synchronized(lock) {
while (!someCondition) {
lock.wait();
}
}
// ...
Thread #2 runs this:
synchronized(lock) {
// do something
lock.notify();
}
If Thread #1 is waiting on the notify, and Thread #2 gets an OOME in the // do something section, then Thread #2 won't make the notify() call, and Thread #1 may get stuck forever waiting for a notification that won't ever occur. Sure, Thread #2 is guaranteed to release the mutex on the lock object ... but that is not sufficient!
If not the code ran by the thread is not exception safe, which is a more general problem.
"Exception safe" is not a term I've heard of (though I know what you mean). Java programs are not normally designed to be resilient to unexpected exceptions. Indeed, in a scenario like the above, it is likely to be somewhere between hard and impossible to make the application exception safe.
You'd need some mechanism whereby the failure of Thread #1 (due to the OOME) gets turned into an inter-thread communication failure notification to Thread #2. Erlang does this ... but not Java. The reason they can do this in Erlang is that Erlang processes communicate using strict CSP-like primitives; i.e. there is no sharing of data structures!
(Note that you could get the above problem for just about any unexpected exception ... not just Error exceptions. There are certain kinds of Java code where attempting to recover from an unexpected exception is likely to end badly.)
The JVM will run the GC when it's on edge of the OutOfMemoryError. If the GC didn't help at all, then the JVM will throw OOME.
You can however catch it and if necessary take an alternative path. Any allocations inside the try block will be GC'ed.
Since the OOME is "just" an Error which you could just catch, I would expect the different JVM implementations to behave the same. I can at least confirm from experience that the above is true for the Sun JVM.
See also:
Catching java.lang.OutOfMemoryError
Is it possible to catch out of memory exception in java?
I'd say it depends partly on what caused the OutOfMemoryError. If the JVM truly is running low on memory, it might be a good idea to restart it, and with more memory if possible (or a more efficient app). However, I've seen a fair amount of OOMEs that were caused by allocating 2GB arrays and such. In that case, if it's something like a J2EE web app, the effects of the error should be constrained to that particular app, and a JVM-wide restart wouldn't do any good.
Can it recover? Possibly. Any well-written JVM is only going to throw an OOME after it's tried everything it can to reclaim enough memory to do what you tell it to do. There's a very good chance that this means you can't recover. But...
It depends on a lot of things. For example if the garbage collector isn't a copying collector, the "out of memory" condition may actually be "no chunk big enough left to allocate". The very act of unwinding the stack may have objects cleaned up in a later GC round that leave open chunks big enough for your purposes. In that situation you may be able to restart. It's probably worth at least retrying once as a result. But...
You probably don't want to rely on this. If you're getting an OOME with any regularity, you'd better look over your server and find out what's going on and why. Maybe you have to clean up your code (you could be leaking or making too many temporary objects). Maybe you have to raise your memory ceiling when invoking the JVM. Treat the OOME, even if it's recoverable, as a sign that something bad has hit the fan somewhere in your code and act accordingly. Maybe your server doesn't have to come down NOWNOWNOWNOWNOW, but you will have to fix something before you get into deeper trouble.
You can increase your odds of recovering from this scenario although its not recommended that you try. What you do is pre-allocate some fixed amount of memory on startup thats dedicated to doing your recovery work, and when you catch the OOM, null out that pre-allocated reference and you're more likely to have some memory to use in your recovery sequence.
I don't know about different JVM implementations.
Any sane JVM will throw an OutOfMemoryError only if there is nothing the Garbage collector can do. However, if you catch the OutOfMemoryError early enough on the stack frame it can be likely enough that the cause was itself became unreachable and was garbage collected (unless the problem is not in the current thread).
Generally frameworks that run other code, like application servers, attempting to continue in the face of an OME makes sense (as long as it can reasonably release the third-party code), but otherwise, in the general case, recovery should probably consist of bailing and telling the user why, rather than trying to go on as if nothing happened.
To answer your newly updated question: There is no reason to think you need to shut down the server if all is working well. My experience with JBoss is that as long as the OME didn't affect a deployment, things work fine. Sometimes JBoss runs out of permgen space if you do a lot of hot deployment. Then indeed the situation is hopeless and an immediate restart (which will have to be forced with a kill) is inevitable.
Of course each app server (and deployment scenario) will vary and it is really something learned from experience in each case.
You cannot fully a JVM that had OutOfMemoryError. At least with the oracle JVM you can add -XX:OnOutOfMemoryError="cmd args;cmd args" and take recovery actions, like kill the JVM or send the event somewhere.
Reference: https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
What is the best practice when dealing with Errors within a server application?
In particular, how do you think an application should handle errors like OutOfMemoryError?
I'm particularly interested in Java applications running within Tomcat, but I think that is a more general problem.
The reason I'm asking is because I am reviewing a web application that frequently throws OOME, but usually it just logs them and then proceeds with execution. That results, obviously, in more OOMEs.
While that is certainly bad practice, in my opinion, I'm not entirely sure that stopping the Server would be the best solution.
There is not much you can do to fix OutOfMemoryError except to clean up the code and adjust JVM memory (but if you have a leak somewhere it's just a bandaid)
If you don't have access to the source code and/or are not willing to fix it, an external solution is to use some sort of watch dog program that will monitor java application and restart it automatically when it detects OOMEs. Here is a link to one such program.
Of course it assumes that the application will survive restarts.
The application shouldn't handle OOM at all - that should be the server's responsibility.
Next step: Check if memory settings are appropriate. If they aren't, fix them; if they are, fix the application. :)
Well, if you have OOME then the best way would be to release as many resources (especially cached ones) as possible. Restarting the web-app (in case it's web-apps fault) or the web server itself (in case something else in the server does this) would do for recovering from this state. On the development front though it'd be good to profile the app and see what is taking up the space, sometimes there are resources that are attached to a class variable and hence not collected, sometimes something else. In the past we had problems where Tomcat wouldn't release the classes of previous versions of the same app when you replace the app with a newer version. Somewhat solved the problem by nullifying class variables or re-factoring not to use them at all but some leaks still remained.
An OutOfMemoryError is by no means always unrecoverable - it may well be the result of a single bad request, and depending on the app's structure it may just abandon processing the request and continue processing others without any problems.
So if your architecture supports it, catch the Error at a point where you have a chance to stop doing what caused it and continue doing something else - for an app server, this would be at the point that dispatches requests to individual app instances.
Of course, you should also make sure that this does not go unnoticed and a real fix can be implemented as soon as possible, so the app should log the error AND send out some sort of warning (e.g. email, but preferably something harder to ignore or get lost). If something goes wrong during that, then shutting down is the only sensible thing left to do.
#Michael Borgwardt, You can't recover from an OutOfMemoryError in Java. For other errors, it might not stop the application, but OutOfMemoryError literally hangs applications.
In our application which deals with Documents heavily, we do catch OOM errors where one bad request can result in OOM but we dont want to bring down the application because of this. We catch OOM and log it.
Not sure if this is best practice but seems like its working
I'm not an expert in such things, but I'll take a chance to give my vague opinion on this problem.
Generally, I think that there's two main ways:
Server is stopped.
Resources are thus gracefully degrading throughput, reducing memory consumption, but staying alive. For this case application must have appropriate architecture, I think.
According to the javadoc about a java.lang.Error:
An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a "normal" condition, is also a subclass of Error because most applications should not try to catch it.
A method is not required to declare in its throws clause any subclasses of Error that might be thrown during the execution of the method but not caught, since these errors are abnormal conditions that should never occur.
So, the best practice when dealing with subclasses of Error is to fix the problem that is causing them, not to "handle" them. As it's clearly stated, they should never occur.
In the case of an OutOfMemoryError, maybe you have a process that consumes lots of memory (e.g. generating reports) and your JVM is not well sized, maybe you have a memory leak somewhere in your application, etc. Whatever it is, find the problem and fix it, don't handle it.
I strongly disagree with the notion that you should never handle an OutOfMemoryError.
Yes, it tends to be unrecoverable most of the time. However, one of my servers got one a few days ago and the server was still mostly working for more than an hour and a half. Nobody complained so I didn't notice until my monitoring software got a failure and hour and a half after the first OutOfMemoryError. I need to know as soon as possible when there is an OutOfMemoryError on my server. I need to handle it so that I can set up a notification so that I can know to restart my server ASAP.
I'm still trying to figure out how to get Tomcat to do something when it gets an Error. error-page doesn't seem to be working for it.