Repeated Full GC with available Heap

Repeated Full GC with available Heap - java

I'm experiencing repeated Full GCs even when the heap is not fully used.
This is how the gc logs look like: http://d.pr/i/iFug (the blue line is the used heap and the grey rectangles are Full GCs).
It seems to be a problem similar to the one posted in this question: Frequent full GC with empty heap
However, that thread didn't provide any actual answers to the problem. My application does uses RMI, and the production servers are indeed using 1.6 before the Upgrade 45 that increased GC intervals from 1 min to 1 hour (http://docs.oracle.com/javase/7/docs/technotes/guides/rmi/relnotes.html). However, from the rest of the log, I can't see that Full-GC-every-1-min pattern.
What could possibly be causing this?

Most likely the cause is you have reached the current size of the heap. The size of the heap is smaller than the maximum you set and is adjusted as the program runs.
e.g. Say you set a maximum of 1 GB, the initial heap size might be 256 MB, and when you reach 256 MB it performs a full GC, after this GC it might decide that 400 MB would be a better size and when this is reach a full GC is performed etc.
You get a major collection when the tenured space fills or fails to find free space. Eg if it is fragmented.
You also get full collections if your survivor spaces are too small.
In short, the most likely cause is the gc tuning parameters you used. I suggest you simplify your tuning parameters until your system behaves in a manner you expect.

As noted in the linked thread, disable explicit GC and see if the FullGC pattern occurs again : -XX:+DisableExplicitGC. The RMI code may be triggering explicit GC in given interval, which may not be desirable in some cases.
If the FullGCs still occur, I would take thread dumps and possibly a heap dump to analyze the issue.
Also, use jstat to see the occupation of Eden, Survivor, OldGen spaces.

Related

-XX:G1ReservePercent and to-space exhausted

I'm trying to understand what the -XX:G1ReservePercent actually does. The descriptions I found in the official documentation are not really comprehensive:
Sets the percentage of reserve memory to keep free so as to reduce the
risk of to-space overflows. The default is 10 percent. When you
increase or decrease the percentage, make sure to adjust the total
Java heap by the same amount.
and the desciption of to-space exhausted log entry is this:
When you see to-space overflow/exhausted messages in your logs, the G1
GC does not have enough memory for either survivor or promoted
objects, or for both.
[...]
To alleviate the problem, try the following adjustments:
Increase the value of the -XX:G1ReservePercent option (and the total
heap accordingly) to increase the amount of reserve memory for "to-space".
[...]
Judging by the quote the to-space exhausted means that when performing mixed evacuation we do not have enough a free region to move survivors to.
But this then contradicts to the following official tuning advice in case of Full GC (emphasize mine):
Force G1 to start marking earlier. G1 automatically determines the
Initiating Heap Occupancy Percent (IHOP) threshold based on earlier
application behavior. If the application behavior changes, these
predictions might be wrong. There are two options: Lower the target
occupancy for when to start space-reclamation by increasing the buffer
used in an adaptive IHOP calculation by modifying
-XX:G1ReservePercent;
So what is the buffer and what does setting -XX:G1ReservePercent do (From the first glance AdaptiveIHOP has nothing to do with it...) ?
Does it keep some heap space always reserved so when mixed evacuation occur we always have free regiong to move survivors to?
Or the space is used for G1 internal housekeeping tasks? If so it is not clear what data the to-space contains so it exhausted?

To me, understanding what it really does, means to go the source code. I chose jdk-15.
The best description of this flag is here:
It determines the minimum reserve we should have in the heap to minimize the probability of promotion failure.
Excellent, so this has to do with "promotion failures" (whatever those are). But according to you quote in bold, this has something to do with AdaptiveIHOP also? Well yes, this parameter matters only in AdaptiveIHOP (which is on by default).
Another thing to notice is that G1ReservePercentage has no guarantee to be maintained all the time, it is a best effort.
If you look at how it is used in the very next line:
if (_free_regions_at_end_of_collection > _reserve_regions) {
absolute_max_length = _free_regions_at_end_of_collection - _reserve_regions;
}
things start to make some sense (for that bold statement). Notice how _reserve_regions are extracted for some computation. G1 will reserve that space for "promotion failures" (we will get to that).
If G1 reserves that space, it means less space is available for actual allocations. So if you "increase this buffer" (you make G1ReservePercent bigger as the quote suggests), your space for new objects becomes smaller and as such the time when GC needs to kick in will come sooner, so the time for when space reclamation needs to happen will come sooner too ("Lower the target occupancy for when to start space-reclamation..."). It is one complicated sentence, but in simple words it means that :
If you increase G1ReservePercentage, space will be needed to be reclaimed faster (more often GC calls).
Which, to be fair, is obvious. Not that I agree that you should increase that value, but this is what that sentence says.
After a certain GC cycle, be that minor, mixed or major, G1 knows how many regions are free:
_free_regions_at_end_of_collection = _g1h->num_free_regions();
Which of course, is the length of "free list":
return _free_list.length();
Based on this value and _reserve_regions (G1ReservePercent) plus the target pause time (200 ms by default) it can compute how many regions it needs for the next cycle. By the time next cycle ends, there might be a case when there are no empty regions (or the ones that are empty can not take all the live Objects that are supposed to be moved). Where is Eden supposed to move live Objects (Survivor), or, if old region is fragmented - where are live objects supposed to be moved to defragmente? This is what this buffer is for.
It acts as a safety-net (the comments in the code make this far more easier to understand). This is needed in the hopes that it will avoid a Full GC. Because if there are no free regions (or enough) to move live Objects a Full GC needs to happen (most probably followed by a young GC also).
This value is usually known to be small when this message is present in logs. Either increase it, or much better give more heap to the application.

Java heap size growing too big with Infinispan cache

I am using an Infinispan cache to store values. The code writes to the cache every 10 minutes and the cache reaches a size of about 400mb.
It has a time to live of about 2 hours, and the maximum entries is 16 million although currently in my tests the number of entries doesn't go above 2 million or so (I can see this by checking the mbeans/metrics in jconsole).
When I start jboss the java heap size is 1.5Gb to 2Gb. The -Xmx setting for the maximum allocated memory to jboss is 4Gb.
When I disable the Infinispan cache the heap memory usage stays flat at around 1.5Gb to 2Gb. It is very constant and stays at that level.
=> The problem is: when I have the Infinispan cache enabled the java heap size grows to about 3.5Gb/4Gb which is way more than expected.
I have done a heap dump to check the size of the cache in Eclipse MAT and it is only 300 or 400mb (which is ok).
So I would expect the memory usage to go to 2.5Gb and stay steady at that level, since the initial heap size is 2Gb and the maximum cache size should only be around 500mb.
However it continues to grow and grow over time. Every 2 or 3 hours a garbage collection is done and that brings the usage down to about 1 or 1.5Gb but it then increases again within 30 minutes up to 3.5Gb.
The number of entries stays steady at about 2 million so it is not due to just more entries going in to the cache. (Also the number of evictions stays at 0).
What could be holding on to this amount of memory if the cache is only 400-500mb?
Is it a problem with my garbage collection settings? Or should I look at Infinispan settings?
Thanks!
Edit: you can see the heap size over time here.
What is strange is that even after what looks like a full GC, the memory shoots back up again to 3Gb. This corresponds to more entries going into the cache.
Edit: It turns out this has nothing to do with Infinispan. I narrowed down the problem to a single line of code that is using a lot of memory (about 1Gb more than without the call).
But I do think more and more memory is being taken by the Infinispan cache, naturally because more entries are being added over the 2 hour time to live.
I also need to have upwards of 50 users query on Infinispan. When the heap reaches a high value like this (even without the memory leak mentioned above), I know it's not an error scenario in java however I need as much memory available as possible.
Is there any way to "encourage" a heap dump past a certain point? I have tried using GC options to collect at a given proportion of heap for the old gen but in general the heap usage tends to creep up.

Probably what you're seeing is the JVM not collecting objects which have been evicted from the cache. Cache's in general have a curious relationship with the prevailing idea of generational GC.
The generational GC idea is that, broadly speaking, there are two types of objects in the JVM - short lived ones, which are used and thrown away quickly, and longer lived ones, which are usually used throughout the lifetime of the application. In this model you want to tune your GC so that you put most of your effort attempting to identify the short lived objects. This means that you avoid looking at the long-lived objects as much as possible.
Cache's disrupt this pattern by having some intermediate-length object lifespans (i.e. a few seconds / minutes / hours, depending on your cache). These objects often get promoted to the tenured generation, where they're not usually looked at until a full GC becomes necessary, even after they've been evicted from the cache.
If this is what's happening then you've a couple of choices:
ignore it, let the full GC semantics do its thing and just be aware that this is what's happening.
try to tune the GC so that it takes longer for objects to get promoted to the tenured generation. There are some GC flags which can help with that.

Java heap size not entirely used

I'm currently monitoring my running java application with Visual VM: http://visualvm.java.net/
I'm stressing the memory usage by with -Xmx128m.
When running I see the heap size increasing to 128m (as expected) however the used heap converges to approximately 105m before I run into a java heap space error.
Why are these remaining 20m, not used?

You need to understand a central fact about garbage collector ergonomics:
The costly part of garbage collection is finding and dealing with the objects that are NOT garbage.
This means: as the heap gets close to its maximum capacity, the GC will spend more and more time for less and less return in reclaimed space. If the GC was to try and use every last byte of memory, the net result would be that your JVM would spend more and more time garbage collecting, until ... eventually ... almost no useful work was being done.
To avoid this pathological situation, the JVM monitors the ratio of time is spent GC'ing and doing useful work. When the ratio exceeds a configurable threshold value, the GC raises an OutOfMemoryError ... even though (technically) there is free memory available. This is probably what you are seeing, though the other explanations are equally plausible.
You can change the GC thresholds, generation sizes, etc via JVM options, but it is probably better not to. A better idea is to figure out why your application's memory usage is continually creeping upwards. There are most likely memory leaks ... i.e. a bugs ... in your code that are causing this. Spend your effort finding and fixing those bugs, rather than worrying about why you are not using all of the memory.
(In fact, you are using it ... but not all of the time.)

The heap is split up in Young-Generation (Eden-Space, and two Survivor-Spaces of identical size usually called From and To), Old Generation (Tenured) and Permanent Space.
The Xmx/Xms option sets the overall heap size. So a region (with a default size) is actually the Permanent Space - and maybe, we don't know details about your stress test, no objects are actually moved from eden to tenured or permanent, so those regions remain empty while Eden runs out of space.

Java splits its memory into generations. You can get a heap space error if the tenured generation fills. Normally, they resize dynamically but if you have set a fixed size it won't.

Heap memory behaviour

I always had a question about heap memory behaviour.
Profiling my app i get the above graph. Seems all fine. But what i don't understand why,at GC time, the heap grows a litle bit, even there is enough memory (red circle).
That means for a long running app that it will run out of heap space at some time ?

Not necessarily. The garbage collector is free to use up to the maximum allocated heap in any way it sees fit. Extrapolating future GC behaviour based on current behaviour (but with different memory conditions) is in no way guaranteed to be accurate.
This does have the unfortunate side effect that it's very difficult to determine whether an OutOfMemoryError is going to happen unless it does. A legal (yet probably quite inefficient) garbage collector could simply do nothing until the memory ceiling was hit, then do a stop-the-world mark and sweep of the entire heap. With this implementation, you'd see your memory constantly increasing, and might be tempted to say that an OOME was imminent, but you just can't tell.
With such small heap sizes, the increase here is likely just due to bookkeeping/cache size alignment/etc. You're talking about less than 50KB or so looking at the resolution on the scale, so I shouldn't be worried.
If you do think there's a legitimate risk of OutOfMemoryErrors, the only way to show this is to put a stress test together and show that the application really does run out of heap space.

The HotSpot garbage collectors decide to increase the total heap size immediately after a full GC has completed if the ratio of free space to total heap size falls below a certain threshold. This ratio can be tuned using one of the many -XX options for the garbage collector(s).
Looking at the memory graph, you will see that the heap size increases occur at the "saw points"; i.e. the local maxima. Each of these correspond to running a full GC. If you look really carefully at the "points" where the heap gets expanded, you will see that in each case the amount of free space immediately following the full GC is a bit higher than the previous such "point".
I image that what is happening is that you application's memory usage is cyclical. If the GC runs at or near a high point of the cycle, it won't be able to free as much memory as if the GC runs at or near a low point. This variability may be enough to cause the GC to expand the heap.
(Another possibility is that your application has a slow memory leak.)
That means for a long running app that it will run out of heap space at some time ?
No. Assuming that your application's memory usage (i.e. the integral of space occupied by reachable objects) is cyclic, the heap size will approach a fixed high limit and never exceed it. Certainly OOME's are not inevitable.

How to reduce java concurrent mode failure and excessive gc

In Java, the concurrent mode failure means that the concurrent collector failed to free up enough memory space form tenured and permanent gen and has to give up and let the full stop-the-world gc kicks in. The end result could be very expensive.
I understand this concept but never had a good comprehensive understanding of
A) what could cause a concurrent mode failure and
B) what's the solution?.
This sort of unclearness leads me to write/debug code without much of hints in mind and often has to shop around those performance flags from Foo to Bar without particular reasons, just have to try.
I'd like to learn from developers here how your experience is? If you had encountered such performance issue, what was the cause and how you addressed it?
If you have coding recommendations, please don't be too general. Thanks!

The first thing about CMS that I have learned is it needs more memory than the other collectors, about 25 to 50% more is a good starting point. This helps you avoid fragmentation, since CMS does not do any compaction like the stop the world collectors would. Second, do things that help the garbage collector; Integer.valueOf instead of new Integer, get rid of anonymous classes, make sure inner classes are not accessing inaccessible things (private in the outer class) stuff like that. The less garbage the better. FindBugs and not ignoring warnings will help a lot with this.
As far as tuning, I have found that you need to try several things:
-XX:+UseConcMarkSweepGC
Tells JVM to use CMS in tenured gen.
Fix the size of your heap: -Xmx2048m -Xms2048m This prevents GC from having to do things like grow and shrink the heap.
-XX:+UseParNewGC
use parallel instead of serial collection in the young generation. This will speed up your minor collections, especially if you have a very large young gen configured. A large young generation is generally good, but don't go more than half of the old gen size.
-XX:ParallelCMSThreads=X
set the number of threads that CMS will use when it is doing things that can be done in parallel.
-XX:+CMSParallelRemarkEnabled remark is serial by default, this can speed you up.
-XX:+CMSIncrementalMode allows application to run more by pasuing GC between phases
-XX:+CMSIncrementalPacing allows JVM to figure change how often it collects over time
-XX:CMSIncrementalDutyCycleMin=X Minimm amount of time spent doing GC
-XX:CMSIncrementalDutyCycle=X Start by doing GC this % of the time
-XX:CMSIncrementalSafetyFactor=X
I have found that you can get generally low pause times if you set it up so that it is basically always collecting. Since most of the work is done in parallel, you end up with basically regular predictable pauses.
-XX:CMSFullGCsBeforeCompaction=1
This one is very important. It tells the CMS collector to always complete the collection before it starts a new one. Without this, you can run into the situation where it throws a bunch of work away and starts again.
-XX:+CMSClassUnloadingEnabled
By default, CMS will let your PermGen grow till it kills your app a few weeks from now. This stops that. Your PermGen would only be growing though if you make use of Reflection, or are misusing String.intern, or doing something bad with a class loader, or a few other things.
Survivor ratio and tenuring theshold can also be played with, depending on if you have long or short lived objects, and how much object copying between survivor spaces you can live with. If you know all your objects are going to stick around, you can configure zero sized survivor spaces, and anything that survives one young gen collection will be immediately tenured.

Quoted from "Understanding Concurrent Mark Sweep Garbage Collector Logs"
The concurrent mode failure can either
be avoided by increasing the tenured
generation size or initiating the CMS
collection at a lesser heap occupancy
by setting
CMSInitiatingOccupancyFraction to a
lower value
However, if there is really a memory leak in your application, you're just buying time.
If you need fast restart and recovery and prefer a 'die fast' approach I would suggest not using CMS at all. I would stick with '-XX:+UseParallelGC'.
From "Garbage Collector Ergonomics"
The parallel garbage collector
(UseParallelGC) throws an
out-of-memory exception if an
excessive amount of time is being
spent collecting a small amount of the
heap. To avoid this exception, you can
increase the size of the heap. You can
also set the parameters
-XX:GCTimeLimit=time-limit and -XX:GCHeapFreeLimit=space-limit

Sometimes OOM pretty quick and got killed, sometime suffers long gc period (last time was over 10 hours).
It sounds to me like a memory leak is at the root of your problems.
A CMS failure won't (as I understand it) cause an OOM. Rather a CMS failure happens because the JVM needs to do too many collections too quickly, and CMS could not keep up. One situation where lots of collection cycles happen in a short period is when your heap is nearly full.
The really long GC time sounds weird ... but is theoretically possible if your machine was thrashing horribly. However, a long period of repeated GCs is quite plausible if your heap is very nearly full.
You can configure the GC to give up when the heap is 1) at max size and 2) still close to full after a full GC has completed. Try doing this if you haven't done so already. It won't cure your problems, but at least your JVM will get the OOM quickly, allowing a faster service restart and recovery.
EDIT - the option to do this is -XX:GCHeapFreeLimit=nnn where nnn is a number between 0 and 100 giving the minimum percentage of the heap that must be free after the GC. The default is 2. The option is listed in the aptly titled "The most complete list of -XX options for Java 6 JVM" page. (There are lots of -XX options listed there that don't appear in the Sun documentation. Unfortunately the page provides few details on what the options actually do.)
You should probably start looking to see if your application / webapp has memory leaks. If it has, your problems won't go away unless those leaks are found and fixed. In the long term, fiddling with the Hotspot GC options won't fix memory leaks.

I've found using -XX:PretenureSizeThreshold=1m to make 'large' object go immediately to tenured space greatly reduced my young GC and concurrent mode failures since it tends not to try to dump the young + 1 survivor amount of data (xmn=1536m survivorratio=3 maxTenuringThreashould=5) before a full CMS cycle can complete. Yes my survivor space is large, but about once ever 2 days something comes in the app that will need it (and we run 12 app servers each day for 1 app).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.