I'm really out of my comfortzone when I have to monitor memory. But I'm the only one here, and I'm just left clueless: I have a Java8 application (CMS) on a Tomcat application-server running, and where in some trouble. After a while the server crashes.
After some research, I found out it is memory related, so I attached visualVM on my environment, and started monitoring.
I see that the memory is slowly filling up. Garbage Collection does it's job, but not thoroughly. It always leaves some more memory in the used heap. When I do a manual 'Perform Garbage Collection' in visual VM, the garbage collection is performed much better. (See screenshots)
It'll take several hours, but the used heap will grow larger and larger again after each garbage collection. The moment that I will manually preform GC again, the minimal used heap will be 'normal' again.
I have noticed that the heap fills with byte[]. Those will fill the most of the space. Someone could help me out on this?
I see that the memory is slowly filling up. Garbage Collection does
it's job, but not thoroughly. It always leaves some more memory in the
used heap. When I do a manual 'Perform Garbage Collection' in visual
VM, the garbage collection is performed much better.
Full GC gets triggered when JVM feel its necessary (as its costly . For example Its stop the world for parallel GC . Similarly stop the world for two sub phases for concurrent mark sweep collector) . It depends on various factors like Xms and Xmx parameters see JVM heap parameters. So you should not be worried about until and unless you get out of mem exception as JVM will trigger when necessary
For server crash :- I can think of two problems
Memory leak. In that case memory footprints will be increasing even after each GC
May be you are constructing some cache without eviction algorithm if its near to full
If both does not apply, i see a usecase of increasing heap and give it a try
I've had a few problems like this before. One was our app's fault, one was the app server's fault, and one I wasn't able to figure out but was able to mitigate.
In each case I used JProfiler to watch memory usage on a local server and ran a variety of happy-path and exception tests to try to figure out what was causing the problem. Doing this testing wasn't a quick and easy process - on average I spent about a week each time.
In the first case (our app's fault), I found that we were not closing SQL connections for a web service when exceptions were thrown. Testing the happy paths showed no problems, but when I started testing exceptions I could exhaust the server's memory with about 100 consecutive exceptions. Adding code to manually clean up resources in the exception handler solved the problem.
In the second case (WebSphere's fault), I verified that our app was closing all resources correctly, but the problem persisted. So I started reading through WebSphere documentation and found that it was a known issue with JAX-WS clients. Luckily there was a patch to WebSphere which fixed the problem.
In the third case (couldn't determine the cause), I was unable to find any reason why it was happening. So the problem was mitigated by increasing JVM memory allocation to an amount where the OOM exceptions would take greater than 1 week to happen, and configuring the servers to restart every weekend.
There might be some simple technical workarounds to mitigate the problem; like: simply adding more memory to the JVM and/or the underlying machine.
Or if you really can prove that running System.gc() manually helps (as the comments indicates that most people think: it will not) you could automate that step.
If any of that is good enough to prevent those crashes; you are buying yourself more time to work on a real solution.
Beyond that, a meta, non-technical perspective. Actually there are two options:
You turn to your manager; and you tell him that you will need anything beetween 2, 4, 8 weeks in order to learn this stuff; so that you can identify the root cause and get it fixed.
You turn to your manager; and you tell him that he should look for some external consulting service to come, identify the root cause and help fixing it.
In other words: your team / product is in need for some "expert" knowledge. Either you invest in building that knowledge internally; or you have to buy it from somewhere.
Related
I'm running a somewhat classical postgres/hibernate/spring mvc webapp, with quite usual layers/frameworks.
Everything looks fine, except when i look at the memory graph in javamelody :
i periodically it seems like it grows, gc is called, then it grows again :
memory graph
When i dump the memory, it's always a 60/80 Mo file, showing that the total memory used is around 60/80 Mo, and no leak is detected
If i remove javamelody and use jconsole, it kinda shows the same problem, the memory keeps growing (a bit slower tho)
How can i see what are these +100Mo objects, constantly growing then gc'ed ? How can i fix this problem ?
Any help or explanations regarding this kind of problem would be greatly appreciated !
Thanxs in advance
EDIT : i forgot to mention that the graph comes from an isolated env, with absolutely NO user activity on it (no http request / no scheduled job)
That is the expected behavior of the Java garbage collector. Short-lived objects are accumulated in the memory until the garbage collection algorithm determines that it is worth spending time in reclaiming that memory.
You can analyze the memory dump (for instance, with Eclipse Memory Analyzer) in order to discover where are those objects, but remember that this situation is not a problem (unless they eat all of your memory and an OutOfMemoryError is thrown).
It seems that the application server or web container which the application is deployed is running some background process (JBoss has a batch process that try to recovery the distributed transaction). Enable logging trace and see it says something. But it's nothing that you need to worry about.
I am having a really weird issue with a Java application.
Essentially it is a web page that uses magnolia (a cms system), there are 4 instances available on production environment. Sometimes the CPU goes to 100% in a java process.
So, first approach was to make a thread dump, and check the offending thread, what I found was weird:
"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000000ce37800 nid=0x7dcb runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000000ce39000 nid=0x7dcc runnable
Ok, that is pretty weird, I have never had a problem with the garbage collector like that, so the next thing we did was to activate JMX and using jvisualvm inspect the machine: the heap memory usage was really high (95%).
Naive approach: Increase memory, so the problem takes more time to appear, result, on the restarted server with increased memory (6 GB!) the problem appeared 20 hours after restart while on other servers with less memory (4GB!) that had been running for 10 days, the problem took still a few more days to reappear. Also, I tried to use the apache access log from the server failing and use JMeter to replay the requests into a local server in an attemp to reproduce the error... it did not work either.
Then I investigated the logs a little bit more to find this errors
info.magnolia.module.data.importer.ImportException: Error while importing with handler [brightcoveplaylist]:GC overhead limit exceeded
at info.magnolia.module.data.importer.ImportHandler.execute(ImportHandler.java:464)
at info.magnolia.module.data.commands.ImportCommand.execute(ImportCommand.java:83)
at info.magnolia.commands.MgnlCommand.executePooledOrSynchronized(MgnlCommand.java:174)
at info.magnolia.commands.MgnlCommand.execute(MgnlCommand.java:161)
at info.magnolia.module.scheduler.CommandJob.execute(CommandJob.java:91)
at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Another example
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StackTraceElement.toString(StackTraceElement.java:175)
at java.lang.String.valueOf(String.java:2838)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at java.lang.Throwable.printStackTrace(Throwable.java:529)
at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
at info.magnolia.module.templatingkit.functions.STKTemplatingFunctions.getReferencedContent(STKTemplatingFunctions.java:417)
at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLinkNode(InternalLinkModel.java:90)
at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLink(InternalLinkModel.java:66)
at sun.reflect.GeneratedMethodAccessor174.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at freemarker.ext.beans.BeansWrapper.invokeMethod(BeansWrapper.java:866)
at freemarker.ext.beans.BeanModel.invokeThroughDescriptor(BeanModel.java:277)
at freemarker.ext.beans.BeanModel.get(BeanModel.java:184)
at freemarker.core.Dot._getAsTemplateModel(Dot.java:76)
at freemarker.core.Expression.getAsTemplateModel(Expression.java:89)
at freemarker.core.BuiltIn$existsBI._getAsTemplateModel(BuiltIn.java:709)
at freemarker.core.BuiltIn$existsBI.isTrue(BuiltIn.java:720)
at freemarker.core.OrExpression.isTrue(OrExpression.java:68)
Then I find out that such problem is due to the garbage collector using a ton of CPU but not able to free much memory
Ok, so it is a problem with the MEMORY that manifests itself in the CPU, so If the memory usage problem is solved, then the CPU should be fine, so I took a heapdump, unfortunatelly it was just too big to open it (the file was 10GB), anyway I run the server locallym loaded it a little bit and took a heapdump, after opening it, I found something interesting:
There are a TON of instances of
AbstractReferenceMap$WeakRef ==> Takes 21.6% of the memory, 9 million instances
AbstractReferenceMap$ReferenceEntry ==> Takes 9.6% of the memory, 3 million instances
In addition, I have found a Map which seems to be used as a "cache" (horrible but true), the problem is that such map is NOT synchronized and it is shared among threads (being static), the problem could be not only concurrent writes but also the fact that with lack of synchronization, there is no guarantee that thread A will see the changes done to the map by thread B, however, I am unable to figure out how to link this suspicious map using the memory eclipse analyzer, as it does not use the AbstracReferenceMap, it is just a normal HashMap.
Unfortunately, we do not use those classes directly (obviously the code uses them, but not directly), so I have seem to hit a dead end.
Problems for me are
I cannot reproduce the error
I cannot figure out where the hell the memory is leaking (if that is the case)
Any ideas at all?
The 'no-op' finalize() methods should definitely be removed as they are likely to make any GC performance problems worse. But I suspect that you have other memory leak issues as well.
Advice:
First get rid of the useless finalize() methods.
If you have other finalize() methods, consider getting rid of them. (Depending on finalization to do things is generally a bad idea ...)
Use memory profiler to try to identify the objects that are being leaked, and what is causing the leakage. There are lots of SO Questions ... and other resources on finding leaks in Java code. For example:
How to find a Java Memory Leak
Troubleshooting Guide for Java SE 6 with HotSpot VM, Chapter 3.
Now to your particular symptoms.
First of all, the place where the OutOfMemoryErrors were thrown is probably irrelevant.
However, the fact that you have huge numbers of AbstractReferenceMap$WeakRef and AbstractReferenceMap$ReferenceEntry objects is a string indication that something in your application or the libraries it is using is doing a huge amount of caching ... and that that caching is implicated in the problem. (The AbstractReferenceMap class is part of the Apache Commons Collections library. It is the superclass of ReferenceMap and ReferenceIdentityMap.)
You need to track down the map object (or objects) that those WeakRef and ReferenceEntry objects belong to, and the (target) objects that they refer to. Then you need to figure out what is creating it / them and figure out why the entries are not being cleared in response to the high memory demand.
Do you have strong references to the target objects elsewhere (which would stop the WeakRefs from being broken)?
Is / are the map(s) being used incorrectly so as to cause a leak. (Read the javadocs carefully ...)
Are the maps being used by multiple threads without external synchronization? That could result in corruption, which potentially could manifest as a massive storage leak.
Unfortunately, these are only theories and there could be other things causing this. And indeed, it is conceivable that this is not a memory leak at all.
Finally, your observation that the problem is worse when the heap is bigger. To me, this is still consistent with a Reference / cache-related issue.
Reference objects are more work for the GC than regular references.
When the GC needs to "break" a Reference, that creates more work; e.g. processing the Reference queues.
Even when that happens, the resulting unreachable objects still can't be collected until the next GC cycle at the earliest.
So I can see how a 6Gb heap full of References would significantly increase the percentage of time spent in the GC ... compared to a 4Gb heap, and that could cause the "GC Overhead Limit" mechanism to kick in earlier.
But I reckon that this is an incidental symptom rather than the root cause.
With a difficult debugging problem, you need to find a way to reproduce it. Only then will you be able to test experimental changes and determine if they make the problem better or worse. In this case, I'd try writing loops that rapidly create & delete server connections, that create a server connection and rapidly send it memory-expensive requests, etc.
After you can reproduce it, try reducing the heap size to see if you can reproduce it faster. But do that second since a small heap might not hit the "GC overhead limit" which means the GC is spending excessive time (98% by some measure) trying to recover memory.
For a memory leak, you need to figure out where in the code it's accumulating references to objects. E.g. does it build a Map of all incoming network requests?
A web search https://www.google.com/search?q=how+to+debug+java+memory+leaks shows many helpful articles on how to debug Java memory leaks, including tips on using tools like the Eclipse Memory Analyzer that you're using. A search for the specific error message https://www.google.com/search?q=GC+overhead+limit+exceeded is also helpful.
The no-op finalize() methods shouldn't cause this problem but they may well exacerbate it. The doc on finalize() reveals that having a finalize() method forces the GC to twice determine that the instance is unreferenced (before and after calling finalize()).
So once you can reproduce the problem, try deleting those no-op finalize() methods and see if the problem takes longer to reproduce.
It's significant that there are many AbstractReferenceMap$WeakRef instances in memory. The point of a weak reference is to refer to an object without forcing it to stay in memory. AbstractReferenceMap is a Map that lets one make the keys and/or values be weak references or soft references. (The point of a soft reference is to try to keep an object in memory but let the GC free it when memory gets low.) Anyway, all those WeakRef instances in memory are probably exacerbating the problem but shouldn't keep the referenced Map keys/values in memory. What are they referring to? What else is referring to those objects?
Try a tool that locates the leaks in your source code such as plumbr
There are a number of possibilities, perhaps some of which you've explored.
It's definitely a memory leak of some sort.
If your server has user sessions, and your user sessions aren't expiring or being disposed of properly when the user is inactive for more than X minutes/hours, you will get a buildup of used memory.
If you have one or more maps of something that your program generates, and you don't clear the map of old/unneeded entries, you could again get a buildup of used memory. For example, I once considered adding a map to keep track of process threads so that a user could get info from each thread, until my boss pointed out that at no point were finished threads getting removed from the map, so if the user stayed logged in and active, they would hold onto those threads forever.
You should try doing a load test on a non-production server where you simulate normal usage of your app by large numbers of users. Maybe even limit the server's memory even lower than usual.
Good luck, memory issues are a pain to track down.
You say that you have already tried jvisualvm, to inspect the machine. Maybe, try it again, like this:
This time look at the "Sampler -> Memory" tab.
It should tell you which (types of) objects occupy the most memory.
Then find out where such objects are usually created and removed.
A lot of times 'weird' errors can be caused by java agents plugged into the JVM. If you have any agents running (e.g. jrebel/liverebel, newrelic, jprofiler), try running without them first.
Weird things can also happen when running JVM with non-standard parameters (-XX); certain combinations are known to cause problems; which parameters are you using currently?
Memory leak can also be in Magnolia itself, have you tried googling "magnolia leak"? Are you using any 3rd-party magnolia modules? If possible, try disabling/removing them.
The problem might be connected to just one part of your You can try reproducing the problem by "replaying" your access logs on your staging/development server.
If nothing else works, if it were me, I would do the following:
- trying to replicate the problem on an "empty" Magnolia instance (without any of my code)
- trying to replicate the problem on an "empty" Magnolia instance (without 3rd party modules)
- trying to upgrade all software (magnolia, 3rd-party modules, JVM)
- finally try to run the production site with YourKit and try to find the leak
My guess is that you have automated import running which invokes some instance of ImportHandler. That handler is configured to make a backup of all the nodes it's going to update (I think this is default option), and since you have probably a lot of data in your data type, and since all of this is done in session you run out of memory. Try to find out which import job it is and disable backup for it.
HTH,
Jan
It appears that your memory leaks are emanating from your arrays. The garbage collector has trouble identifying object instances that were removed from arrays, therefore would not be collected for releasing of memory. My advice is when you do remove an object from an array, assign the former object's position to null, therefore the garbage collector can realize that it is a null object, and remove it. Doubt this will be your exact problem, but it is always good to know these things, and check if this is your problem.
It is also good to assign an object instance to null when you need to remove it/clean it up. This is because the finalize() method is sketchy and evil, and sometimes will not be called by the garbage collector. The best workaround for this is to call it (or another similar method) yourself. That way, you are assured that garbage cleanup was performed successfully. As Joshua Bloch said in his book: Effective Java, 2nd edition, Item 7, page 27: Avoid finalizers. "Finalizers are unpredictable, often dangerous and generally unnecessary". You can see the section here.
Because there is no code displayed, I cannot see if any of these methods can be useful, but it is still worth knowing these things. Hope these tips help you!
As recommended above, I'd get in touch with the devs of Magnolia, but meanwhile:
You are getting this error because the GC doesn't collect much on a run
The concurrent collector will throw an OutOfMemoryError if too much
time is being spent in garbage collection: if more than 98% of the
total time is spent in garbage collection and less than 2% of the heap
is recovered, an OutOfMemoryError will be thrown.
Since you can't change the implementation, I would recommend changing the config of the GC, in a way that runs less frequently, so it would be less likely to fail in this way.
Here is a example config just to get you started on the parameters, you would have to figure out your sweet spot. The logs of the GC will probably be of help for that
My VM params are as follow:
-Xms=6G
-Xmx=6G
-XX:MaxPermSize=1G
-XX:NewSize=2G
-XX:MaxTenuringThreshold=8
-XX:SurvivorRatio=7
-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled
-XX:+CMSPermGenSweepingEnabled
-XX:CMSInitiatingOccupancyFraction=60
-XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution
-Xloggc:logs/gc.log
I'm experiencing a very odd problem with a Java application running under Tomcat.
We tried to update the production code from a fresh newly produced in a 1-week sprint, the application has been running over months without hiccups and then this new code makes our Linux servers start swapping after some time.
The very strange thing is that when looking at VisualVM for memory usage it never exceeds the maximum heap size, the JVM does not throw an OutOfMemory, the machine only starts swapping and the JVM keeps running even after that.
So, it seems that's leaking memory from somewhere, it seems like it's from the new code but it's odd that it's not inside the JVM, any ideas in how to debug that?
Thanks!
Swapping is not a conclusive indicator of leakage. It results from low physical memory. Use vmstat on Linux to get swap usage. Try using a different machine, experiment with configurations --swap size, physical memory size, address space.
If you are confident that the problem is in your program try this:
Estimate the median and peak memory that your program should use. You must be able to account for all deviations from these metrics. If you cannot, proceed to step 3.
Assuming you did step 1 correctly and were able to account for all deviations, you can rule out the leak (sorry about such vague suggestions but debugging is only as good as the detective). You should now focus on GC tuning. First, enable GC logging. See if your heap is actually full and where the GC is spending most of its time collecting. This may be a good starting point to start optimizations. Try to see if adjusting GC options helps. Try experimenting with collection algorithms, max/min heap sizes, gen ratios etc. Only experiment when you have ruled out a leak (step 1).
Assuming you did step 1 correctly and were not able to account for all deviations, you can assume that you have a leak somwhere. Use a memory profiler to see what objects contribute to the heap size growth most. Leave a profiler running for an extended period of time --have your program handle some requests it routinely expects to get and then leave it relatively isolated after that. If the memory level keeps on growing you may have a leak somewhere. If not, then it is probably not a memory leak. Can you pin point the part of your program that may be creating them? If yes, try sending several requests that only target that part of your program. Does it replicate the problem deterministically? If no, repeat step 3. If yes, use divide and conquer and reapply step 3 till you can find the class/method that are the culprits. It can be a certain combination of multiple portions as well (meaning that individually they may look innocent but together they may form a brilliant crime syndicate).
Hope this helps, if not then please leave a comment to my post.
All the very best on your exercise!
I would suggest you look into creating heap dumps without using jvisualvm. For Unix-based Oracle JVM's this is normally done by sending a signal 3 to the JVM using kill.
For full details see http://www.startux.de/index.php/java/45-java-heap-dumpyvComment45
You can then see if the patterns changes.
If you do not get an idea from this, then this might be because you are storing a sub-string from a very large original string (which carries the underlying string array around), or because you hold on to operating system resources like open database connections etc.
You have checked your connection pool looks good?
If you aren't using it, I'd recommend using visual VM version 1.3.2 and all the plug-ins. It's a big jump up from earlier versions.
What happens to the perm gen space?
What are the memory settings you're using? Min and max, of course, but what about perm space size?
I have a mobile application that is suffering from slow-down over time. My hunch, (In part fed by this article,) is that this is due to fragmentation of memory slowing the app down, but I'm not sure. Here's a pretty graph of the app's memory use over time:
fraggle rock http://kupio.com/image-dump/fragmented.png
The 4 peaks on the graph are 4 executions of the exact same task on the app. I start the task, it allocates a bunch of memory, it sits for a bit (The flat line on top) and then I stop the task. At that point it calls System.gc(); and the memory gets cleaned up.
As can be seen, each of the 4 runs of the exact same task take longer to execute. The low-points in the graph all return to the same level so there do not seem to be any memory leaks between task runs.
What I want to know is, is memory fragmentation a feasible explanation or should I look elsewhere first, bearing in mind that I've already done a lot of looking? The low-points on the graph are relatively low so my assumption is that in this state the memory would not be very fragmented since there can't be a lot of small memory holes to be causing problems.
I don't know how the j2me memory allocator works though, so I really don't know. Can anyone advise? Has anyone else had problems with this and recognises the memory profile of the app?
If you've got a little bit of time, you could test your theory by re-using the memory by using Memory Pool techniques: each run of the task uses the 'same' chunks of memory by getting them from the pool and returning them at release time.
If you're still seeing the degrading performance after doing this investigation, it's not memory fragmentation causing the problem. Let us all know your results and we can help troubleshoot further.
Memory fragmentation would account for it... what is not clear is whether the Apps use of memory is causing paging? this would also slow things up.... and could cause the same issues.
It the problem really is memory fragmentation, there is not much you can do about it.
But before you give up in despair, try running your app with a execution profiler to see if it is spending a lot of time executing in an unexpected place. It is possible that the slow down is actually due to a problem in your algorithms, and nothing to do with memory fragmentation. As people have already said, J2ME garbage collectors should not suffer from fragmentation issues.
Consider looking at garbage collection statistics. You should have a lot more on the last run than the first, if your theory is to hold. Another thought might be that something else eats your memory so your application has less.
In other words, profiler time :)
What OS are you running this on? I have some experience with Windows CE5 (or Windows Mobile) devices. CE5's operating system level memory architecture is quite broken and will fail soon for memory intensive applications. Your graph does not have any scales, but every process only gets 32MB of address space on CE5. The VM and shared libraries will take their fair share of that as well, leaving you with quite little left.
The only way around this is to re-use the memory you allocated instead of giving it back to the collector and re-allocating later. This is, of course, much more low-level programming than you would usually want to do in Java, but on this platform you might be out of luck.
I've been tasked with debugging a Java (J2SE) application which after some period of activity begins to throw OutOfMemory exceptions. I am new to Java, but have programming experience. I'm interested in getting your opinions on what a good approach to diagnosing a problem like this might be?
This far I've employed JConsole to get a picture of what's going on. I have a hunch that there are object which are not being released properly and therefor not being cleaned up during garbage collection.
Are there any tools I might use to get a picture of the object ecosystem? Where would you start?
I'd start with a proper Java profiler. JConsole is free, but it's nowhere near as full featured as the ones that cost money. I used JProfiler, and it was well worth the money. See https://stackoverflow.com/questions/14762/please-recommend-a-java-profiler for more options and opinions.
Try the Eclipse Memory Analyzer, or any other tool that can process a java heap dump, and then run your app with the flap that generates a heap dump when you run out of memory.
Then analyze the heap dump and look for suspiciously high object counts.
See this article for more information on the heap dump.
EDIT: Also, please note that your app may just legitimately require more memory than you initially thought. You might try increasing the java minimum and maximum memory allocation to something significantly larger first and see if your application runs indefinitely or simply gets slightly further.
The latest version of the Sun JDK includes VisualVM which is essentially the Netbeans profiler by itself. It works really well.
http://www.yourkit.com/download/index.jsp is the only tool you'll need.
You can take snapshots at (1) app start time, and (2) after running app for N amount of time, then comparing the snapshots to see where memory gets allocated. It will also take a snapshot on OutOfMemoryError so you can compare this snapshot with (1).
For instance, the latest project I had to troubleshoot threw OutOfMemoryError exceptions, and after firing up YourKit I realised that most memory were in fact being allocated to some ehcache "LFU " class, the point being that we specified loads of a certain POJO to be cached in memory, but us not specifying enough -Xms and -Xmx (starting- and max- JVM memory allocation).
I've also used Linux's vmstat e.g. some Linux platforms just don't have enough swap enabled, or don't allocate contiguous blocks of memory, and then there's jstat (bundled with JDK).
UPDATE see https://stackoverflow.com/questions/14762/please-recommend-a-java-profiler
You can also add an "UnhandledExceptionHandler" to your Application's Thread. This will catch 'uncaught' exception, like an out of memory error, and you will at least have an idea where the exception was thrown. Usually this not were the problem is but the 'new' that couldn't be satisfied. As a rule I always add the UnhandledExceptionHandler to a Thread if nothing else to add logging.