What is the best way to measure GC pause times?

What is the best way to measure GC pause times? - java

What is the best way to track the GC pause/stall times in a Java instance?
can it be retrieved from the Garbage Collector GarbageCollectorMXBean?
can it be read from gc.log?
Does gc.log have any additional information that the MXBean doesn't have? I would prefer option 1 because I want the instance to emit that metric to our monitoring system.
I have read through a few posts like this on SO, but I don't seem to be getting the right answer. I am specifically looking for the GC stall times and not the total time spent on GC.

Garbage Collection is not the only reason of JVM stop-the-world pauses.
You may want to count other reasons, too.
The first way to monitor safepoint pauses is to parse VM logs:
for JDK 8 and earlier add -XX:+PrintGCApplicationStoppedTime JVM option;
starting from JDK 9 add -Xlog:safepoint.
Then look for Total time for which application threads were stopped messages in the log file.
The second way is to use undocumented Hotspot internal MXBean:
sun.management.HotspotRuntimeMBean runtime =
sun.management.ManagementFactoryHelper.getHotspotRuntimeMBean();
System.out.println("Safepoint time: " + runtime.getTotalSafepointTime() + " ms");
System.out.println("Safepoint count: " + runtime.getSafepointCount());
It gives you the cumulative time of all JVM pauses. See the discussion in this answer.

I am specifically looking for the GC stall times
There are more to stall times than the GC itself. Time to acquire the safepoint is also an application stall and is only available as part logging, not through MXBeans
But really, if you're concerned about application stalls then neither GC pauses nor over safepoint time is what you should actually measure. You should measure the stalls themselves, e.g. via jhiccup

Related

How to deal with long Full Garbage Collection cycle in Java

We inherited a system which runs in production and started to fail every 10 hours recently. Basically, our internal software marks the system that is has failed if it is unresponsive for a minute. We found that our problem that our Full GC cycles last for 1.5 minutes, we use 30 GB heap. Now the problem is that we cannot optimize a lot in a short period of time and we cannot partition of our service quickly but we need to get rid of 1.5 minutes pauses as soon as possible as our system fails because of these pauses in production. For us, an acceptable delay is 20 milliseconds but not more. What will be the quickest way to tweak the system? Reduce the heap to trigger GCs frequently? Use System.gc() hints? Any other solutions? We use Java 8 default settings and we have more and more users - i.e. more and more objects created.
Some GC stat

You have a lot of retained data. There is a few options which are worth considering.
increase the heap to 32 GB, this has little impact if you have free memory. Looking again at your totals it appears you are using 32 GB rather than 30 GB, so this might not help.
if you don't have plenty of free memory, it is possible a small portion of your heap is being swapped as this can increase full GC times dramatically.
there might be some simple ways to make the data structures more compact. e.g. use compact strings, use primitives instead of wrappers e.g. long for a timestamp instead of Date or LocalDateTime. (long is about 1/8th the size)
if neither of these help, try moving some of the data off heap. e.g. Chronicle Map is a ConcurrentMap which uses off heap memory can can reduce you GC times dramatically. i.e. there is no GC overhead for data stored off heap. How easy this is to add highly depends on how your data is structured.
I suggest analysing how your data is structured to see if there is any easy ways to make it more efficient.

There is no one-size-fits-all magic bullet solution to your problem: you'll need to have a good handle on your application's allocation and liveness patterns, and you'll need to know how that interacts with the specific garbage collection algorithm you are running (function of version of Java and command line flags passed to java).
Broadly speaking, a Full GC (that succeeds in reclaiming lots of space) means that lots of objects are surviving the minor collections (but aren't being leaked). Start by looking at the size of your Eden and Survivor spaces: if the Eden is too small, minor collections will run very frequently, and perhaps you aren't giving an object a chance to die before its tenuring threshold is reached. If the Survivors are too small, objects are going to be promoted into the Old gen prematurely.
GC tuning is a bit of an art: you run your app, study the results, tweak some parameters, and run it again. As such, you will need a benchmark version of your application, one which behaves as close as possible to the production one but which hopefully doesn't need 10 hours to cause a full GC.
As you stated that you are running Java 8 with the default settings, I believe that means that your Old collections are running with a Serial collector. You might see some very quick improvements by switching to a Parallel collector for the Old generation (-XX:+UseParallelOldGC). While this might reduce the 1.5 minute pause to some number of seconds (depending on the number of cores on your box, and the number of threads you specify for GC), this will not reduce your max pause to to 20ms.

When this happened to me, it was due to a memory leak caused by a static variable eating up memory. I would go through all recent code changes and look for any possible memory leaks.

Explanation needed about Parallel Full GC for G1

As part of the java JDK10 JEP307 was Parallel Full GC for G1 realsed.
I've tried to grasp its description, but I am still not confident that I got the idea properly.
my doubt was is it related to Concurrent Garbage

As a simplified explanation - garbage collectors have two possible collection types, "incremental" and "full". Incremental collection is the better of the two for staying out the way, as it'll do a little bit of work every so often. Full collection is often more disruptive, as it takes longer and often has to halt the entire program execution while it's running.
Because of this, most modern GCs (including G1) will generally try to ensure that in normal circumstances, the incremental collection will be enough and a full collection will never be required. However, if lots of objects across different generations are being made eligible for garbage collection in unpredictable ways, then occasionally a full GC may be inevitable.
At present, the G1 full collection implementation is only single threaded. And that's where that JEP comes in - it aims to parallelize it, so that when a full GC does occur it's faster on systems that can support parallel execution.

Finally I understood about Parallel Full GC for G1
Made the default in JDK 9 and Introduced in JDK 7
Efficiently and concurrently deal with heaps fails on Full Garbage Collection
some times full Garbage Collection is inevitable.It efficiently and concurrently deal with very large heaps Normal GC would divide the heap into young (eden and survivor) and
old generation (logical separation) G1 splits heap into many small regions. This splitting enables G1 to select a small region to collect and finish quickly.
In JDK9 uses single thread for full GC
In JDK 10 uses multi thread(parallel) for Garbage Collection'

The G1 garbage collector was infamous for doing a single-threaded full GC cycle. At the time when you need all the hardware that you can muster to scrounge for unused objects, we bottlenecked on a single thread. In Java 10 they fixed this. The full GC now runs with all the resources that we throw at it.
To demonstrate this, I wrote ObjectChurner, which creates a bunch of different sized byte arrays. It holds onto the objects for some time. The sizes are randomized, but in a controlled, repeatable way.

Java 10 reduces Full GC pause times by iteratively improving on its existing algorithm. Until Java 10 G1 Full GCs ran in a single thread. That’s right - your 32 core server and it’s 128GB will stop and pause until a single thread takes out the garbage. In Java 10 this has been improved to run in Parallel. This means that the Full GCs will be on multiple threads in parallel, albeit still pausing the JVM’s progress whilst it completes. The number of threads can be optionally configured using -XX:ParallelGCThreads.
This is a nice improvement to the G1 algorithm in Java 10 that should reduce worst case pause times for most users. Does that mean that Java GC pauses are a thing of the past? No - it reduces the problem but since G1 doesn’t run its collection cycles concurrently with your application it will still pause the application periodically and Full GC pauses still increase with larger heap sizes. We’ve talked about some other Garbage Collectors in our last blog post that may solve this problem in future.

G1 Collector
Another beautiful optimization which was just out with Java 8 update 20 for is the G1 Collector String deduplication. Since strings (and their internal char[] arrays) takes much of our heap, a new optimization has been made that enables the G1 collector to identify strings which are duplicated more than once across your heap and correct them to point into the same internal char[] array, to avoid multiple copies of the same string from residing inefficiently within the heap. You can use the -XX:+UseStringDeduplicationJVM argument to try this out.
Parallel Full GC in G1GC
The G1 garbage collector was infamous for doing a single-threaded full GC cycle. At the time when you need all the hardware that you can muster to scrounge for unused objects, we bottlenecked on a single thread. In Java 10 they fixed this. The full GC now runs with all the resources that we throw at it.
JVM parameters: -Xlog:gc,gc+cpu::uptime -Xmx4g -Xms4g -Xlog:gc*:details.vgc This will output each GC event and its CPU usage to stdout, showing only the uptime as a tag. The setting -Xlog:gc* is like the -XX:+PrintGCDetails of previous Java versions.

Java 10 reduces Full GC pause times by iteratively improving on its existing algorithm. Until Java 10 G1 Full GCs ran in a single thread. That’s right - your 32 core server and it’s 128GB will stop and pause until a single thread takes out the garbage. In Java 10 this has been improved to run in Parallel. This means that the Full GCs will be on multiple threads in parallel, albeit still pausing the JVM’s progress whilst it completes. The number of threads can be optionally configured using -XX:ParallelGCThreads.

What is approximate worst case garbage collection duration on big heaps

I need a rule of thumb regarding the maximal time of full garbage collection. The motivation is to be able to distinguish between a faulty JVM process and a process which is under GC.
Suppose I have a regular general-purpose server hardware, HotSpot JVM 8, the heap size is 20G-40G, no specific GC and memory options are set. What is a reasonable time limit for GC to complete? Is it 5 minutes, 20 minutes or up to hours?
Update:
My application is a memory intensive offline job dealing with big data structures. I don't need to tune GC at all. 10 seconds and 10 minutes pauses are bot fine if this limit is known.

It is very difficult to quantify how long GC "should" take because it depends on a number of factors:
How big the heap is.
How full the heap is; i.e. the ratio of garbage to non-garbage when you run the GC.
How many pointers / references there are to traverse.
Which GC you are using.
Whether this is minor "new generation" collection, a major "old generation collection" or a "full" collection. The last is typically performed by the fall-back collector when a low-latency collector cannot keep up with the rate of garbage generation.
Whether there is physical <-> virtual memory thrashing occurring.
There are a couple of pathological situations that can cause excessive GC times:
If the heap is nearly full, the GC uses an increasing proportion of time trying to reclaim the last bit of free space.
If you have a heap that is larger than the physical memory available, you can get into virtual memory "thrashing" behavior. This is most pronounced during a major or full GC.
If you do need to pick a number, I suggest that you use one that "feels" right to you, and make it a configuration parameter so that it is easy to tweak. Also, turn on GC logging and look at the typical GC times that are reported there. (Particularly when the server load is high.)

Firstly, the gc pause time counts on millisecond in most of the time. If a gc take more than one second, I think your application must be tuned anyway.
And then just as the comment said, the gc pause time depends on the characteristics of your application. So if you want a rule of thumb regarding the maximal time of full garbage collection for your application, I advice you to collect the gc.log and make statistics on it, then you will know how long is the pause time in a poorly gc.

For batch jobs where latency does not matter much there are better measures than pause times:
a) MB of garbage collected / time / cpu core
Low collection rates usually indicate some pathological cases like swapping, transparent huge page consolidation or some edge-cases in the GC such as humongous reference arrays being scanned.
b) application throughput - the ratio of wall time spent in application code vs. wall time spent in GC.
Long GCs aren't a big issue if they occur infrequently.
Both can be obtained by running GC logs through GCViewer

My recommendation is:
1.configure on JVM parameter and turn on GC logs,check the GC logs, you will see how long the GC takes
2. GC won't be minutes, I saw around 13second pause time which customer was very badly affected.

jHiccup analysis doesn't add up

I have the following jHiccup result.
Obviously there are huge peaks of few secs in the graph. My app outputs logs every 100 ms or so. When I read my logs I never see such huge pauses. Also I can check the total time spent in GC from the JVM diagnostics and it says the following:
Time: 
2013-03-12 01:09:04
Used: 
 1,465,483 kbytes
Committed: 
 2,080,128 kbytes
Max: 
 2,080,128 kbytes
GC time: 
     2 minutes on ParNew (4,329 collections)
8.212 seconds on ConcurrentMarkSweep (72 collections)
The total big-GC time is around 8 seconds spread over 72 separate collections. All of them are below 200ms per my JVM hint to limit the pauses.
On the other hand I observed exactly one instance of network response time of 5 seconds in my independent network logs (wireshark). That implies the pauses exist, but they are not GC and they are not blocked threads or something that can be observed in profiler or thread dumps.
My question is what would be the best way to debug or tune this behavior?
Additionally, I'd like to understand how jHiccup does the measurement. Obviously it is not GC pause time.

Glad to see you are using jHiccup, and that it seems to show reality-based hiccups.
jHiccup observes "hiccups" that would also be seen by application threads running on the JVM. It does not glean the reason - just reports the fact. Reasons can be anything that would cause a process to not run perfectly ready-to-run code: GC pauses are a common cause, but a temporary ^Z at the keyboard, or one of those "live migration" things across virtualized hosts would be observed just as well.. There are a multitude of possible reasons, including scheduling pressure at the OS or hypervisor level (if one exists), power management craziness, swapping, and many others. I've seen Linux file system pressure and Transparent Huge Page "background" defragmentation cause multi-second hiccups as well...
A good first step at isolating the cause of the pause is to use the "-c" option in jHiccup: It launches a separate control process (with an otherwise idle workload). If both your application and the control process show hiccups that are roughly correlated in size and time, you'll know you are looking for a system-level (as opposed to process-local) reason. If they do not correlate, you'll know to suspect the insides of your JVM - which most likely indicates your JVM paused for something big; either GC or something else, like a lock debiasing or a class-loading-deriven-deoptimization which can take a really long [and often unreported in logs] time on some JVMs if time-to-safepoint is long for some reason (and on most JVMs, there are many possible causes for a long time-to-safepoint).
jHiccup's measurement is so dirt-simple that it's hard to get wrong. The entire thing is less than 650 lines of java code, so you can look at the logic for yourself. jHiccup's HiccupRecorder thread repeatedly goes to sleep for 1msec, and when it wakes up it records any difference in time (from before the sleep) that is greater that 1msec as a hiccup. The simple assumption is that if one ready-to-run thread (the HiccupRecorder) did not get to run for 5 seconds, other threads in the same process also saw a similar sized hiccup.
As you note above, jHiccups observations seem to be corroborated in your independent network logs, where you saw a 5 seconds response time, Note that not all hiccups would have been observed by the network logs, as only requests actually made during the hiccups would have been observed by a network logger. In contrast, no hiccup larger than ~1msec can hide from jHiccup, since it will attempt a wakeup 1,000 times per second even with no other activity.
This may not be GC, but before you rule out GC, I'd suggest you look into the GC logging a bit more. To start with, a JVM hint to limit pauses to 200msec is useless on all known JVMs. A pause hint is the equivalent of saying "please". In addition, don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options (and Suspect them even then). There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.
-- Gil. [jHiccup's author]

How to tune a jvm to crash instead of heroically GC till 100% CPU utilization?

We have a JVM process that infrequently pegs the CPU at 100%, with what appears to be (according to visualgc) a very nearly exhausted heap. Our supposition is that the process is heroically GC'ing causing a CPU spike, which is affecting the overall health of the entire system (consisting of other JVMs doing different things).
This process is not critical and can be restarted. Is there a way to tune the JVM via the command line which starts it to make it fall on its own sword rather than it keep GC'ing and causing the entire box to suffer?
Of note is that we are not getting OOMExceptions, so the heap isn't TOTALLY exhausted, but just barely not, we think.
Alternatively, something to give us some insight as to what in the JVM is actually using the CPU in the way that it is to confirm/deny our GC supposition?

We can get the statistics from
1):The option -XX:+PrintGCTimeStamps will add a time stamp at the start of each collection. This is useful to see how frequently garbage collections occur.
With the above option we can get rough estimation whether you supposition that the process is heroically GC'ing causing a CPU spike or not .
If your suppossition is right then start tunig your GC .
Both parallel collector and Concurrent Collector will throw an OutOfMemoryError if too much time is being
spent in garbage collection: if more than 98% of the total time is spent in garbage collection and
less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. the option X:-UseGCOverheadLimit
is enabled by default for both Parallel and concurrent collector . Check whether this option is disabled in
your system .
For more information about Gc tuning in JVM refer this and for vm debugging options check this

The parallel and concurrent collectors have an "overhead limit" that might do what you want:
if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown
See http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html for more information.

The best thing to do is to find out the memory leak and fix it.
A simple way to exit on high memory usage:
if(Runtime.getRuntime().totalMemory()>100*1024*1024)
System.exit(0);

Try to look what processes are currently running in the JVM.
with jstack you can make a thread dump (there are other ways to do that as well)
with jvisualvm you could peek into the current state of the JVM (takes some resources)
also turn on verbosegc (to prove your assumption that GC is frequent)

You need to find a way how to gather some statistic about GC work. Actually there are some methods to do this. I will not do copy-paste, just give you the link to similar question:
Can you get basic GC stats in Java?
I believe, you will think of how to analyze this statistic and decide when GC is constantly active.
Because this question contains some new idea of applying GC statistic, I don't think, that it is duplicate.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.