jHiccup analysis doesn't add up

jHiccup analysis doesn't add up - java

I have the following jHiccup result.
Obviously there are huge peaks of few secs in the graph. My app outputs logs every 100 ms or so. When I read my logs I never see such huge pauses. Also I can check the total time spent in GC from the JVM diagnostics and it says the following:
Time: 
2013-03-12 01:09:04
Used: 
 1,465,483 kbytes
Committed: 
 2,080,128 kbytes
Max: 
 2,080,128 kbytes
GC time: 
     2 minutes on ParNew (4,329 collections)
8.212 seconds on ConcurrentMarkSweep (72 collections)
The total big-GC time is around 8 seconds spread over 72 separate collections. All of them are below 200ms per my JVM hint to limit the pauses.
On the other hand I observed exactly one instance of network response time of 5 seconds in my independent network logs (wireshark). That implies the pauses exist, but they are not GC and they are not blocked threads or something that can be observed in profiler or thread dumps.
My question is what would be the best way to debug or tune this behavior?
Additionally, I'd like to understand how jHiccup does the measurement. Obviously it is not GC pause time.

Glad to see you are using jHiccup, and that it seems to show reality-based hiccups.
jHiccup observes "hiccups" that would also be seen by application threads running on the JVM. It does not glean the reason - just reports the fact. Reasons can be anything that would cause a process to not run perfectly ready-to-run code: GC pauses are a common cause, but a temporary ^Z at the keyboard, or one of those "live migration" things across virtualized hosts would be observed just as well.. There are a multitude of possible reasons, including scheduling pressure at the OS or hypervisor level (if one exists), power management craziness, swapping, and many others. I've seen Linux file system pressure and Transparent Huge Page "background" defragmentation cause multi-second hiccups as well...
A good first step at isolating the cause of the pause is to use the "-c" option in jHiccup: It launches a separate control process (with an otherwise idle workload). If both your application and the control process show hiccups that are roughly correlated in size and time, you'll know you are looking for a system-level (as opposed to process-local) reason. If they do not correlate, you'll know to suspect the insides of your JVM - which most likely indicates your JVM paused for something big; either GC or something else, like a lock debiasing or a class-loading-deriven-deoptimization which can take a really long [and often unreported in logs] time on some JVMs if time-to-safepoint is long for some reason (and on most JVMs, there are many possible causes for a long time-to-safepoint).
jHiccup's measurement is so dirt-simple that it's hard to get wrong. The entire thing is less than 650 lines of java code, so you can look at the logic for yourself. jHiccup's HiccupRecorder thread repeatedly goes to sleep for 1msec, and when it wakes up it records any difference in time (from before the sleep) that is greater that 1msec as a hiccup. The simple assumption is that if one ready-to-run thread (the HiccupRecorder) did not get to run for 5 seconds, other threads in the same process also saw a similar sized hiccup.
As you note above, jHiccups observations seem to be corroborated in your independent network logs, where you saw a 5 seconds response time, Note that not all hiccups would have been observed by the network logs, as only requests actually made during the hiccups would have been observed by a network logger. In contrast, no hiccup larger than ~1msec can hide from jHiccup, since it will attempt a wakeup 1,000 times per second even with no other activity.
This may not be GC, but before you rule out GC, I'd suggest you look into the GC logging a bit more. To start with, a JVM hint to limit pauses to 200msec is useless on all known JVMs. A pause hint is the equivalent of saying "please". In addition, don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options (and Suspect them even then). There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.
-- Gil. [jHiccup's author]

Related

A use case for a manual GC invocation?

I've read why is it bad practice to call System.gc(), and many others, e.g. this one describing a really disastrous misuse of System.gc(). However, there are cases when the GC takes too long and avoiding long pauses, e.g., by avoiding garbage is not exactly trivial and makes the code harder to maintain.
IMHO calling GC manually is fine in the following common scenario:
There are multiple interchangeable webserves with a failover in front of them.
Every server uses a few gigabytes of heap and the STW pauses take much longer than an average request.
The failover has no idea when GC is going to happen.
The failover can exempt a server when told to.
The algorithm seems to be trivial: Periodically select a server, let no more requests be send to it, let it finished its running requests, let it do its GC, and re-activate the server.
I wonder if I am missing something?1,2
What are the alternatives?
Long running requests could be a problem, but let's assume there are none. Or simply limit waiting to some period comparable with what GC takes. Making a slow request even slower doesn't sound too bad.
An option like -XX:+DisableExplicitGC could make the algorithm useless, but just don't use it (my use case includes dedicated servers I'm in charge of).

For low latency trading systems I use the GC in an atypical manner.
You want to avoid any collection, even minor ones during the trading day. A way to do this is to create less than 300 KB of garbage a second. This is around 1 GB per hour, or up to 24 GB per day. When you use a 24 GB Eden space it means there is no minor/major GCs. However to ensure a GC occurs at a time which is planned and acceptable, a System.gc() is called at say 5 AM each morning and you have a clean Eden space for the next day.
There are times, when you create more garbage than expected e.g. failing to reconnect to a data source, and you might get a small number of minor collections. However this only happens when something is going wrong.
For more details http://vanillajava.blogspot.co.uk/2011/06/how-to-avoid-garbage-collection.html
by avoiding garbage is not exactly trivial and makes the code harder to maintain.
Avoiding garbage entirely is near impossible. However 300 KB/s is not so hard for a JVM. (You can have more than one JVM these days on one machine with 24 GB Eden spaces)
Note if you can keep below 50 KB/s of garbage you can run all week with out a GC.
Periodically select a server, let no more requests be send to it, let it finished its running requests, let it do its GC, and re-activate the server.
You can treat a GC as a failure to meet your SLA condition. In this case you can remove a server when you determine this is about to happen from your cluster, Full GC it and return it to the cluster.

However, there are cases when the GC takes too long and avoiding long pauses
You have to distinguish between pauses caused by young-only, mixed/concurrent phase and full GCs.
In most cases it's the full GCs that you want to avoid while the other ones are acceptable, which can often be achieved with some GC-tuning and optimizing the code to avoid large allocation bursts.
In principle G1 should be able to run forever on young/mixed cycles and a full GC could be considered a soft failure. CMS can at least do so for many days with careful tuning, but may eventually succumb to fragmentation and require a full GC for compacting.
In cases where even the young GC pauses are not acceptable or garbage piles up too fast for the concurrent phase to handle with acceptable pause times then the strategy you outline may be a viable workaround.
Note that there also other use-cases for manually triggering GCs, such as GC-managed native resources, e.g. direct byte buffers, although those are fairly troublesome in general.
Also note that not all System.gc() calls are created equal, there is the ExplicitGCInvokesConcurrent option too.

How to measure Java GC Stop The World time?

I know we can get the GC duration from the GarbageCollectionNotificationInfo object, but the duration there seems to be the entire duration (e.g., I found 5+ seconds once) which could be much larger than the actual stop the world pause (typically less than 1 seconds from my experience), is there anyway we can get the actual stop the world pause duration? Either somehow calculated from the available sources (I do not think GarbageCollectionNotificationInfo provide us with those details? but I could be wrong) or any other ways? I know jstat tool prints the FGCT column which seems to be reflecting exactly the stop the world pause time, how do they do that then? Thanks in advance!

To get all STW pauses in the VM log output you need to pass the following two options. This includes non-GC safepoints.
-XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1
Alternatively there's -XX:+PrintGCApplicationStoppedTime
Keep in mind that non-safepoint things can induce pauses too (e.g. the kernel's thread scheduler). There's jHiccup to measure those.

Java's Serial garbage collector performing far better than other garbage collectors?

I'm testing an API, written in Java, that is expected to minimize latency in processing messages received over a network. To achieve these goals, I'm playing around with the different garbage collectors that are available.
I'm trying four different techniques, which utilize the following flags to control garbage collection:
1) Serial: -XX:+UseSerialGC
2) Parallel: -XX:+UseParallelOldGC
3) Concurrent: -XX:+UseConcMarkSweepGC
4) Concurrent/incremental: -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
I ran each technique over the course of five hours. I periodically used the list of GarbageCollectorMXBean provided by ManagementFactory.getGarbageCollectorMXBeans() to retrieve the total time spent collecting garbage.
My results? Note that "latency" here is "Amount of time that my application+the API spent processing each message plucked off the network."
Serial: 789 GC events totaling 1309 ms; mean latency 47.45 us, median latency 8.704 us, max latency 1197 us
Parallel: 1715 GC events totaling 122518 ms; mean latency 450.8 us, median latency 8.448 us, max latency 8292 us
Concurrent: 4629 GC events totaling 116229 ms; mean latency 707.2 us, median latency 9.216 us, max latency 9151 us
Incremental: 5066 GC events totaling 200213 ms; mean latency 515.9 us, median latency 9.472 us, max latency 14209 us
I find these results to be so improbable that they border on absurd. Does anyone know why I might be having these kinds of results?
Oh, and for the record, I'm using Java HotSpot(TM) 64-Bit Server VM.

I'm working on a Java application that is expected to maximize throughput and minimize latency
Two problems with that:
Those are often contradictory goals, so you need to decide how important each is against the other (would you sacrifice 10% latency to get 20% throughput gain or vice versa? Are you aiming for some specific latency target, beyond which it doesn't matter whether it's any faster? Things like that.)
Your haven't given any results around either of these
All you've shown is how much time is spent in the garbage collector. If you actually achieve more throughput, you would probably expect to see more time spent in the garbage collector. Or to put it another way, I can make a change in the code to minimize the values you're reporting really easily:
// Avoid generating any garbage
Thread.sleep(10000000);
You need to work out what's actually important to you. Measure everything that's important, then work out where the trade-off lies. So the first thing to do is re-run your tests and measure latency and throughput. You may also care about total CPU usage (which isn't the same as CPU in GC of course) but while you're not measuring your primary aims, your results aren't giving you particularly useful information.

I don't find this surprising at all.
The problem with serial garbage collection is that while it's running, nothing else can run at all (aka "stops the world"). That has a good point though: it keeps the amount of work spent on garbage collection to just about its bare minimum.
Almost any sort of parallel or concurrent garbage collection has to do a fair amount of extra work to ensure that all modifications to the heap appear atomic to the rest of the code. Instead of just stopping everything for a while, it has to stop just those things that depend on a particular change, and then for just long enough to carry out that specific change. It then lets that code start running again, gets to the next point that it's going to make a change, stops other pieces of code that depend on it, and so on.
The other point (though in this case, probably a fairly minor one) is that as you process more data, you generally expect to generate more garbage, and therefore spend more time doing garbage collection. Since the serial collector does stop all other processing while it does its job, that not only makes the garbage collection fast, but also prevents any more garbage from being generated during that time.
Now, why do I say that's probably a minor contributor in this case? That's pretty simple: the serial collector only used up a little over a second out of five hours. Even though nothing else got done during that ~1.3 seconds, that's such a small percentage of five hours that it probably didn't make any much (if any) real difference to your overall throughput.
Summary: the problem with serial garbage collection isn't that it uses excessive time overall -- it's that it can be very inconvenient if it stops the world right when you happen to need fast response. At the same time, I should add that as long as your collection cycles are short, this can still be fairly minimal. In theory, the other forms of GC mostly limit your worst case, but in fact (e.g., by limiting the heap size) you can often limit your maximum latency with a serial collector as well.

There was an excellent talk by a Twitter engineer at the 2012 QCon Conference on this topic - you can watch it here.
It discussed the various "generations" in the Hotspot JVM memory and garbage collection (Eden, Survivor, Old). In particular note that the "Concurrent" in ConcurrentMarkAndSweep only applies to the Old generation, i.e. objects that hang around for a while.
Short-lived objects are GCd from the "Eden" generation - this is cheap, but is a "stop-the-world" GC event regardless of which GC algorithm you have chosen!
The advice was to tune the young generation first e.g. allocate lots of new Eden so there's more chance for objects to die young and be reclaimed cheaply. Use +PrintGCDetails, +PrintHeapAtGC, +PrintTenuringDistribution... If you get more than 100% survivor then there wasn't room, so objects get quickly promoted to Old - this is Bad.
When tuning for the Old generatiohn, if latency is top priority, it was recommended to try ParallelOld with auto-tune first (+AdaptiveSizePolicy etc), then try CMS, then maybe the new G1GC.

You can not say one GC is better than the other. it depends on your requirements and your application.
but if u want to maximize throughput and minimize latency: GC is your enemy! you should not call GC at all and also try to prevent JVM from calling GC.
go with serial and use object pools.

With serial collection, only one thing happens at a time. For example, even when multiple CPUs are
available, only one is utilized to perform the collection. When parallel collection is used, the task of
garbage collection is split into parts and those subparts are executed simultaneously, on different
CPUs. The simultaneous operation enables the collection to be done more quickly, at the expense of
some additional complexity and potential fragmentation.
While the serial GC uses only one thread to process a GC, the parallel GC uses several threads to process a GC, and therefore, faster. This GC is useful when there is enough memory and a large number of cores. It is also called the "throughput GC."

Methods of limiting emulated cpu speed

I'm writing a MOS 6502 processor emulator as part of a larger project I've undertaken in my spare time. The emulator is written in Java, and before you say it, I know its not going to be as efficient and optimized as if it was written in c or assembly, but the goal is to make it run on various platforms and its pulling 2.5MHZ on a 1GHZ processor which is pretty good for an interpreted emulator. My problem is quite to the contrary, I need to limit the number of cycles to 1MHZ. Ive looked around but not seen many strategies for doing this. Ive tried a few things including checking the time after a number of cycles and sleeping for the difference between the expected time and the actual time elapsed, but checking the time slows down the emulation by a factor of 8 so does anyone have any better suggestions or perhaps ways to optimize time polling in java to reduce the slowdown?

The problem with using sleep() is that you generally only get a granularity of 1ms, and the actual sleep that you will get isn't necessarily even accurate to the nearest 1ms as it depends on what the rest of the system is doing. A couple of suggestions to try (off the top of my head-- I've not actually written a CPU emulator in Java):
stick to your idea, but check the time between a large-ish number of emulated instructions (execution is going to be a bit "lumpy" anyway especially on a uniprocessor machine, because the OS can potentially take away the CPU from your thread for several milliseconds at a time);
as you want to execute in the order of 1000 emulated instructions per millisecond, you could also try just hanging on to the CPU between "instructions": have your program periodically work out by trial and error how many runs through a loop it needs to go between instructions to "waste" enough CPU to make the timing work out at 1 million emulated instructions / sec on average (you may want to see if setting your thread to low priority helps system performance in this case).

I would use System.nanoTime() in a busy wait as #pst suggested earlier.
You can speed up the emulation by generating byte code. Most instructions should translate quite well and you can add a busy wait call so each instruction takes the amount of time the original instruction would have done. You have an option to increase the delay so you can watch each instruction being executed.
To make it really cool you could generate 6502 assembly code as text with matching line numbers in the byte code. This would allow you to use the debugger to step through the code, breakpoint it and see what the application is doing. ;)
A simple way to emulate the memory is to use direct ByteBuffer or native memory with the Unsafe class to access it. This will give you a block of memory you can access as any data type in any order.

You might be interested in examining the Java Apple Computer Emulator (JACE), which incorporates 6502 emulation. It uses Thread.sleep() in its TimedDevice class.

Have you looked into creating a Timer object that goes off at the cycle length you need it? You could have the timer itself initiate the next loop.
Here is the documentation for the Java 6 version:
http://download.oracle.com/javase/6/docs/api/java/util/Timer.html

Minimum size for a piece of work to be benefically executed on another thread?

I have a low latency system that receives UDP messages. Depending on the message, the system responds by sending out 0 to 5 messages. Figuring out each possible response takes 50 us (microseconds), so if we have to send 5 responses, it takes 250 us.
I'm considering splitting the system up so that each possible response is calculated by a different thread, but I'm curious about the minimum "work time" needed to make that better. While I know I need to benchmark this to be sure, I'm interested in opinions about the minimum piece of work that should be done on a separate thread.
If I have 5 threads waiting on a signal to do 50 us of work, and they don't contend much, will the total time before all 5 are done be more or less than 250 us?

Passing data from one thread to another is very fast 1-4 us provided the thread is already running on the core. (and not sleep/wait/yielding) If your thread has to wake it can take 15 us but the task will also take longer as the cache is likely to have loads of misses. This means the task can take 2-3x longer.

Is that 50us compute-bound, or IO-bound ? If compute-bound, do you have multiple cores available to run these in parallel ?
Sorry - lots of questions, but your particular environment will affect the answer to this. You need to profile and determine what makes a difference in your particular scenario (perhaps run tests with differently size Threadpools ?).
Don't forget (also) that threads take up a significant amount of memory by default for their stack (by default, 512k, IIRC), and that could affect performance too (through paging requests etc.)

If you have more cores than threads, and if the threads are truly independent, then I would not be surprised if the multi-threaded approach took less than 250 us. Whether it does or not will depend on the overhead of creating and destroying threads. Your situation seems ideal, however.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.