I'm wondering what is the real/typical overhead when NMT is enabled via ‑XX:NativeMemoryTracking=summary (the full command options I'm after are -XX:+UnlockDiagnosticVMOptions ‑XX:NativeMemoryTracking=summary ‑XX:+PrintNMTStatistics)
I could not find much information anywhere - either on SO, blog posts or the official docs.
The docs say:
Note: Enabling NMT causes a 5% -10% performance overhead.
But they do not say which mode is expected to have this performance overhead (both summary and detail?)
and what this overhead really is (CPU, memory, ...).
In Native Memory Tracking guide they claim:
Enabling NMT will result in a 5-10 percent JVM performance drop, and memory usage for NMT adds 2 machine words to all malloc memory as a malloc header. NMT memory usage is also tracked by NMT.
But again, is this true for both summary and detail mode?
What I'm after is basically whether it's safe to add ‑XX:NativeMemoryTracking=summary permanently for a production app (similar to continuous JFR recording) and what are potential costs.
So far, when testing this on our app, I didn't spot a difference but it's difficult to
Is there an authoritative source of information containing more details about this performance overhead?
Does somebody have experience with enabling this permanently for production apps?
Disclaimer: I base my answer on JDK 18. Most of what I write should be valid for older releases. When in doubt, you need to measure for yourself.
Background:
NMT tracks hotspot VM memory usage and memory usage done via Direct Byte Buffers. Basically, it hooks into malloc/free and mmap/munmap calls and does accounting.
It does not track other JDK memory usage (outside hotspot VM) or usage by third-party libraries. That matters here, since it makes NMT behavior somewhat predictable. Hotspot tries to avoid fine-grained allocations via malloc. Instead, it relies on custom memory managers like Arenas, code heap or Metaspace.
Therefore, for most applications, mallocs/mmap from hotspot are not that "hot" and the number of mallocs is not that large.
I'll focus on malloc/free for the rest of this writeup since they outnumber the number of mmaps/munmaps by far:
Memory cost:
(detail+summary): 2 words per malloc()
(detail only): Malloc site table
(detail+summary): mapping lists for vm mappings and thread stacks
Here, (1) completely dwarfs (2) and (3). Therefore the memory overhead between summary and detail mode is not significant.
Note that even (1) may not matter that much, since the underlying libc allocator already dictates a minimal allocation size that may be larger than (pure allocation size + 16 byte malloc header). So, how much of the NMT memory overhead actually translates into RSS increase needs to be measured.
How much memory overhead in total this means cannot be answered since JVM memory costs are usually dominated by heap size. So, comparing RSS with NMT cost is almost meaningless. But just to give an example, for spring petclinic with 1GB pretouched heap NMT memory overhead is about 0.5%.
In my experience, NMT memory overhead only matters in pathological situations or in corner cases that cause the JVM to do tons of fine-grained allocations. E.g. if one does insane amount of class loading. But often these are the cases where you want NMT enabled, to see what's going on.
Performance cost:
NMT does need some synchronization. It atomically increases counters on each malloc/free in summary mode.
In detail mode, it does a lot more:
capture call stack on malloc site
look for call stack in hash map
increase hash map entry counters
This requires more cycles. Hash map is lock free, but still modified with atomic operations. It looks expensive, especially if hotspot does many mallocs from different threads. How bad is it really?
Worst case example
Micro-benchmark, 64 mio malloc allocations (via Unsafe.allocateMemory()) done by 100 concurrent threads, on a 24 core machine:
NMT off: 6 secs
NMT summary: 34 secs
NMT detail: 46 secs
That looks insane. However, it may not matter in practice, since this is no real-life example.
Spring petclinic bootup, average of ten runs:
NMT off: 3.79 secs
NMT summary: 3.79 secs (+0%)
NMT detail: 3.91 secs (+3%)
So, here, not so bad. Cost of summary mode actually disappeared in test noise.
Renaissance, philosophers benchmark
Interesting example, since this does a lot of synchronization, leads to many object monitors being inflated, and those are malloced:
Average benchmark scores:
NMT off: 4697
NMT summary: 4599 (-2%)
NMT detail: 4190 (-11%)
Somewhat in the middle between the two other examples.
Conclusion
There is no clear answer.
Both memory- and performance cost depend on how many allocations the JVM does.
This number is small for normal well-behaved applications, but it may be large in pathological situations (e.g. JVM bugs) as well as in some corner case scenarios caused by user programs, e.g., lots of class loading or synchronization. To be really sure, you need to measure yourself.
The overhead of Native Memory Tracking obviously depends on how often the application allocates native memory. Usually, this is not something too frequent in a Java application, but cases may differ. Since you've already tried and didn't notice performance difference, your application is apparently not an exception.
In the summary mode, Native Memory Tracking roughly does the following things:
increases every malloc request in the JVM by 2 machine words (16 bytes);
records the allocation size and flags in these 2 words;
atomically increments (or decrements on free) the counter corresponding to the given memory type;
besides malloc and free, it also handles changes in virtual memory reservation and allocations of new arenas, but these are even less frequent than malloc/free calls.
So, to me, the overhead is quite small; 5-10% is definitely a large overestimation (the numbers would make sense for detail mode which collects and stores stack traces, which is expensive, but summary doesn't do that).
When many threads concurrently allocate/free native memory, the update of an atomic counter could become a bottleneck, but again, that's more like an extreme case. In short, if you measured a real application and didn't notice any degradation, you're likely safe to enable NMT summary in production.
Related
We inherited a system which runs in production and started to fail every 10 hours recently. Basically, our internal software marks the system that is has failed if it is unresponsive for a minute. We found that our problem that our Full GC cycles last for 1.5 minutes, we use 30 GB heap. Now the problem is that we cannot optimize a lot in a short period of time and we cannot partition of our service quickly but we need to get rid of 1.5 minutes pauses as soon as possible as our system fails because of these pauses in production. For us, an acceptable delay is 20 milliseconds but not more. What will be the quickest way to tweak the system? Reduce the heap to trigger GCs frequently? Use System.gc() hints? Any other solutions? We use Java 8 default settings and we have more and more users - i.e. more and more objects created.
Some GC stat
You have a lot of retained data. There is a few options which are worth considering.
increase the heap to 32 GB, this has little impact if you have free memory. Looking again at your totals it appears you are using 32 GB rather than 30 GB, so this might not help.
if you don't have plenty of free memory, it is possible a small portion of your heap is being swapped as this can increase full GC times dramatically.
there might be some simple ways to make the data structures more compact. e.g. use compact strings, use primitives instead of wrappers e.g. long for a timestamp instead of Date or LocalDateTime. (long is about 1/8th the size)
if neither of these help, try moving some of the data off heap. e.g. Chronicle Map is a ConcurrentMap which uses off heap memory can can reduce you GC times dramatically. i.e. there is no GC overhead for data stored off heap. How easy this is to add highly depends on how your data is structured.
I suggest analysing how your data is structured to see if there is any easy ways to make it more efficient.
There is no one-size-fits-all magic bullet solution to your problem: you'll need to have a good handle on your application's allocation and liveness patterns, and you'll need to know how that interacts with the specific garbage collection algorithm you are running (function of version of Java and command line flags passed to java).
Broadly speaking, a Full GC (that succeeds in reclaiming lots of space) means that lots of objects are surviving the minor collections (but aren't being leaked). Start by looking at the size of your Eden and Survivor spaces: if the Eden is too small, minor collections will run very frequently, and perhaps you aren't giving an object a chance to die before its tenuring threshold is reached. If the Survivors are too small, objects are going to be promoted into the Old gen prematurely.
GC tuning is a bit of an art: you run your app, study the results, tweak some parameters, and run it again. As such, you will need a benchmark version of your application, one which behaves as close as possible to the production one but which hopefully doesn't need 10 hours to cause a full GC.
As you stated that you are running Java 8 with the default settings, I believe that means that your Old collections are running with a Serial collector. You might see some very quick improvements by switching to a Parallel collector for the Old generation (-XX:+UseParallelOldGC). While this might reduce the 1.5 minute pause to some number of seconds (depending on the number of cores on your box, and the number of threads you specify for GC), this will not reduce your max pause to to 20ms.
When this happened to me, it was due to a memory leak caused by a static variable eating up memory. I would go through all recent code changes and look for any possible memory leaks.
I am performing analysis of different sort algorithms. Currently I am analysing the Insertion Sort and Quick Sort. And as part of the analysis, I need to measure the memory consumption.
I am using Visual VM for profiling. However when I execute the Insertion Sort for a random data set of, let's say 70,000, I get different range of Heap Memory usage. For example, in the first run the heap memory consumption was 75 kbytes and then in the next round it drops to 35 kbytes. And if I execute it few more times then this value fluctuates randomly.
Is this normal or am I missing something here ? I have plot a graph of data size versus the memory consumption and with this fluctuation I won't be able to draw a chart.
java version "1.8.0_65"
This is Java's garbage collector at work, it kicks in at its own pace and does its job. Perhaps, it would be best for you to measure the amount of memory used after explicitly calling System.gc(), so that you're not taking notes of the garbage.
EDIT:
System.gc() should be called after you perform your tests, to explicitly request that garbage collector kicks in. While it is true that System.gc() is treated only as a request and it is not mathematically 100% sure that JVM will respect your request, it is most probably safe for your analysis, especially if you perform several runs of it.
With regards to measuring memory usage, it is quite tricky, especially for low values. Please see this answer which contains some nice details:
You may find JMH useful for running benchmarks while isolating side effects from the JVM.
Read through the code samples to understand how to use it.
I just want to understand more about current popular garbage collection, malloc / free and counter.
From my understanding, GC is the most popular because it relieves the burden of managing memory manually from the developers and also it is more bullet proof. malloc / free is easy to make mistake and cause memory leaks.
From http://ocaml.org/learn/tutorials/garbage_collection.html:
Why would garbage collection be faster than explicit memory allocation
as in C? It's often assumed that calling free costs nothing. In fact
free is an expensive operation which involves navigating over the
complex data structures used by the memory allocator. If your program
calls free intermittently, then all of that code and data needs to be
loaded into the cache, displacing your program code and data, each
time you free a single memory allocation. A collection strategy which
frees multiple memory areas in one go (such as either a pool allocator
or a GC) pays this penalty only once for multiple allocations (thus
the cost per allocation is much reduced).
Is it true that GC faster than malloc / free?
Also, what if the counter style memory management (objective-c is using it) joins the party?
I hope someone can summary the comparisons with deeper insights.
Is is true that GC faster than malloc / free?
It can be. It depends on the memory usage patterns. It also depends on how you measure "faster". (For example, are you measuring overall memory management efficiency, individual calls to malloc / free, or ... pause times.)
But conversely, malloc / free typically makes better use of memory than a modern copying GC ... provided that you don't run into heap fragmentation problems. And malloc / free "works" when the programming language doesn't provide enough information to allow a GC to distinguish heap pointers from other values.
Also, what if the counter style memory management (objective-c is using it) joins the party?
The overheads of reference counting make pointer assignment more expensive, and you have to somehow deal with reference cycles.
On the other hand, reference counting does offer a way to control memory management pauses ... which can be a significant issue for interactive games / apps. And memory usage is also better; see above.
FWIW, the points made in the source that you quoted are true. But it is not the whole picture.
The problem is that the whole picture is ... too complicated to be covered properly in a StackOverflow answer.
In case of Java there is no competition for any lock when the object is small enough to fit into the Thread Local Allocation Buffer.
TLAB.
This is an internal design and it has proven to work really good. From my understanding, allocating a new Object is just a pointer bump
TLAB Bump The Pointer
which is pretty fast.
I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.
It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.
So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).
I am using Oracle JDK 1.7, if that's of any importance.
While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).
A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".
There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.
I don't think there's anything in the JVM which could cause such problems.
The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).
Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.
The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.
The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.
You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.
I'm testing an API, written in Java, that is expected to minimize latency in processing messages received over a network. To achieve these goals, I'm playing around with the different garbage collectors that are available.
I'm trying four different techniques, which utilize the following flags to control garbage collection:
1) Serial: -XX:+UseSerialGC
2) Parallel: -XX:+UseParallelOldGC
3) Concurrent: -XX:+UseConcMarkSweepGC
4) Concurrent/incremental: -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
I ran each technique over the course of five hours. I periodically used the list of GarbageCollectorMXBean provided by ManagementFactory.getGarbageCollectorMXBeans() to retrieve the total time spent collecting garbage.
My results? Note that "latency" here is "Amount of time that my application+the API spent processing each message plucked off the network."
Serial: 789 GC events totaling 1309 ms; mean latency 47.45 us, median latency 8.704 us, max latency 1197 us
Parallel: 1715 GC events totaling 122518 ms; mean latency 450.8 us, median latency 8.448 us, max latency 8292 us
Concurrent: 4629 GC events totaling 116229 ms; mean latency 707.2 us, median latency 9.216 us, max latency 9151 us
Incremental: 5066 GC events totaling 200213 ms; mean latency 515.9 us, median latency 9.472 us, max latency 14209 us
I find these results to be so improbable that they border on absurd. Does anyone know why I might be having these kinds of results?
Oh, and for the record, I'm using Java HotSpot(TM) 64-Bit Server VM.
I'm working on a Java application that is expected to maximize throughput and minimize latency
Two problems with that:
Those are often contradictory goals, so you need to decide how important each is against the other (would you sacrifice 10% latency to get 20% throughput gain or vice versa? Are you aiming for some specific latency target, beyond which it doesn't matter whether it's any faster? Things like that.)
Your haven't given any results around either of these
All you've shown is how much time is spent in the garbage collector. If you actually achieve more throughput, you would probably expect to see more time spent in the garbage collector. Or to put it another way, I can make a change in the code to minimize the values you're reporting really easily:
// Avoid generating any garbage
Thread.sleep(10000000);
You need to work out what's actually important to you. Measure everything that's important, then work out where the trade-off lies. So the first thing to do is re-run your tests and measure latency and throughput. You may also care about total CPU usage (which isn't the same as CPU in GC of course) but while you're not measuring your primary aims, your results aren't giving you particularly useful information.
I don't find this surprising at all.
The problem with serial garbage collection is that while it's running, nothing else can run at all (aka "stops the world"). That has a good point though: it keeps the amount of work spent on garbage collection to just about its bare minimum.
Almost any sort of parallel or concurrent garbage collection has to do a fair amount of extra work to ensure that all modifications to the heap appear atomic to the rest of the code. Instead of just stopping everything for a while, it has to stop just those things that depend on a particular change, and then for just long enough to carry out that specific change. It then lets that code start running again, gets to the next point that it's going to make a change, stops other pieces of code that depend on it, and so on.
The other point (though in this case, probably a fairly minor one) is that as you process more data, you generally expect to generate more garbage, and therefore spend more time doing garbage collection. Since the serial collector does stop all other processing while it does its job, that not only makes the garbage collection fast, but also prevents any more garbage from being generated during that time.
Now, why do I say that's probably a minor contributor in this case? That's pretty simple: the serial collector only used up a little over a second out of five hours. Even though nothing else got done during that ~1.3 seconds, that's such a small percentage of five hours that it probably didn't make any much (if any) real difference to your overall throughput.
Summary: the problem with serial garbage collection isn't that it uses excessive time overall -- it's that it can be very inconvenient if it stops the world right when you happen to need fast response. At the same time, I should add that as long as your collection cycles are short, this can still be fairly minimal. In theory, the other forms of GC mostly limit your worst case, but in fact (e.g., by limiting the heap size) you can often limit your maximum latency with a serial collector as well.
There was an excellent talk by a Twitter engineer at the 2012 QCon Conference on this topic - you can watch it here.
It discussed the various "generations" in the Hotspot JVM memory and garbage collection (Eden, Survivor, Old). In particular note that the "Concurrent" in ConcurrentMarkAndSweep only applies to the Old generation, i.e. objects that hang around for a while.
Short-lived objects are GCd from the "Eden" generation - this is cheap, but is a "stop-the-world" GC event regardless of which GC algorithm you have chosen!
The advice was to tune the young generation first e.g. allocate lots of new Eden so there's more chance for objects to die young and be reclaimed cheaply. Use +PrintGCDetails, +PrintHeapAtGC, +PrintTenuringDistribution... If you get more than 100% survivor then there wasn't room, so objects get quickly promoted to Old - this is Bad.
When tuning for the Old generatiohn, if latency is top priority, it was recommended to try ParallelOld with auto-tune first (+AdaptiveSizePolicy etc), then try CMS, then maybe the new G1GC.
You can not say one GC is better than the other. it depends on your requirements and your application.
but if u want to maximize throughput and minimize latency: GC is your enemy! you should not call GC at all and also try to prevent JVM from calling GC.
go with serial and use object pools.
With serial collection, only one thing happens at a time. For example, even when multiple CPUs are
available, only one is utilized to perform the collection. When parallel collection is used, the task of
garbage collection is split into parts and those subparts are executed simultaneously, on different
CPUs. The simultaneous operation enables the collection to be done more quickly, at the expense of
some additional complexity and potential fragmentation.
While the serial GC uses only one thread to process a GC, the parallel GC uses several threads to process a GC, and therefore, faster. This GC is useful when there is enough memory and a large number of cores. It is also called the "throughput GC."