Comparisons between GC and two other memory management methods - java

I just want to understand more about current popular garbage collection, malloc / free and counter.
From my understanding, GC is the most popular because it relieves the burden of managing memory manually from the developers and also it is more bullet proof. malloc / free is easy to make mistake and cause memory leaks.
From http://ocaml.org/learn/tutorials/garbage_collection.html:
Why would garbage collection be faster than explicit memory allocation
as in C? It's often assumed that calling free costs nothing. In fact
free is an expensive operation which involves navigating over the
complex data structures used by the memory allocator. If your program
calls free intermittently, then all of that code and data needs to be
loaded into the cache, displacing your program code and data, each
time you free a single memory allocation. A collection strategy which
frees multiple memory areas in one go (such as either a pool allocator
or a GC) pays this penalty only once for multiple allocations (thus
the cost per allocation is much reduced).
Is it true that GC faster than malloc / free?
Also, what if the counter style memory management (objective-c is using it) joins the party?
I hope someone can summary the comparisons with deeper insights.

Is is true that GC faster than malloc / free?
It can be. It depends on the memory usage patterns. It also depends on how you measure "faster". (For example, are you measuring overall memory management efficiency, individual calls to malloc / free, or ... pause times.)
But conversely, malloc / free typically makes better use of memory than a modern copying GC ... provided that you don't run into heap fragmentation problems. And malloc / free "works" when the programming language doesn't provide enough information to allow a GC to distinguish heap pointers from other values.
Also, what if the counter style memory management (objective-c is using it) joins the party?
The overheads of reference counting make pointer assignment more expensive, and you have to somehow deal with reference cycles.
On the other hand, reference counting does offer a way to control memory management pauses ... which can be a significant issue for interactive games / apps. And memory usage is also better; see above.
FWIW, the points made in the source that you quoted are true. But it is not the whole picture.
The problem is that the whole picture is ... too complicated to be covered properly in a StackOverflow answer.

In case of Java there is no competition for any lock when the object is small enough to fit into the Thread Local Allocation Buffer.
TLAB.
This is an internal design and it has proven to work really good. From my understanding, allocating a new Object is just a pointer bump
TLAB Bump The Pointer
which is pretty fast.

Related

What is overhead of Java Native Memory Tracking in "summary" mode?

I'm wondering what is the real/typical overhead when NMT is enabled via ‑XX:NativeMemoryTracking=summary (the full command options I'm after are -XX:+UnlockDiagnosticVMOptions ‑XX:NativeMemoryTracking=summary ‑XX:+PrintNMTStatistics)
I could not find much information anywhere - either on SO, blog posts or the official docs.
The docs say:
Note: Enabling NMT causes a 5% -10% performance overhead.
But they do not say which mode is expected to have this performance overhead (both summary and detail?)
and what this overhead really is (CPU, memory, ...).
In Native Memory Tracking guide they claim:
Enabling NMT will result in a 5-10 percent JVM performance drop, and memory usage for NMT adds 2 machine words to all malloc memory as a malloc header. NMT memory usage is also tracked by NMT.
But again, is this true for both summary and detail mode?
What I'm after is basically whether it's safe to add ‑XX:NativeMemoryTracking=summary permanently for a production app (similar to continuous JFR recording) and what are potential costs.
So far, when testing this on our app, I didn't spot a difference but it's difficult to
Is there an authoritative source of information containing more details about this performance overhead?
Does somebody have experience with enabling this permanently for production apps?
Disclaimer: I base my answer on JDK 18. Most of what I write should be valid for older releases. When in doubt, you need to measure for yourself.
Background:
NMT tracks hotspot VM memory usage and memory usage done via Direct Byte Buffers. Basically, it hooks into malloc/free and mmap/munmap calls and does accounting.
It does not track other JDK memory usage (outside hotspot VM) or usage by third-party libraries. That matters here, since it makes NMT behavior somewhat predictable. Hotspot tries to avoid fine-grained allocations via malloc. Instead, it relies on custom memory managers like Arenas, code heap or Metaspace.
Therefore, for most applications, mallocs/mmap from hotspot are not that "hot" and the number of mallocs is not that large.
I'll focus on malloc/free for the rest of this writeup since they outnumber the number of mmaps/munmaps by far:
Memory cost:
(detail+summary): 2 words per malloc()
(detail only): Malloc site table
(detail+summary): mapping lists for vm mappings and thread stacks
Here, (1) completely dwarfs (2) and (3). Therefore the memory overhead between summary and detail mode is not significant.
Note that even (1) may not matter that much, since the underlying libc allocator already dictates a minimal allocation size that may be larger than (pure allocation size + 16 byte malloc header). So, how much of the NMT memory overhead actually translates into RSS increase needs to be measured.
How much memory overhead in total this means cannot be answered since JVM memory costs are usually dominated by heap size. So, comparing RSS with NMT cost is almost meaningless. But just to give an example, for spring petclinic with 1GB pretouched heap NMT memory overhead is about 0.5%.
In my experience, NMT memory overhead only matters in pathological situations or in corner cases that cause the JVM to do tons of fine-grained allocations. E.g. if one does insane amount of class loading. But often these are the cases where you want NMT enabled, to see what's going on.
Performance cost:
NMT does need some synchronization. It atomically increases counters on each malloc/free in summary mode.
In detail mode, it does a lot more:
capture call stack on malloc site
look for call stack in hash map
increase hash map entry counters
This requires more cycles. Hash map is lock free, but still modified with atomic operations. It looks expensive, especially if hotspot does many mallocs from different threads. How bad is it really?
Worst case example
Micro-benchmark, 64 mio malloc allocations (via Unsafe.allocateMemory()) done by 100 concurrent threads, on a 24 core machine:
NMT off: 6 secs
NMT summary: 34 secs
NMT detail: 46 secs
That looks insane. However, it may not matter in practice, since this is no real-life example.
Spring petclinic bootup, average of ten runs:
NMT off: 3.79 secs
NMT summary: 3.79 secs (+0%)
NMT detail: 3.91 secs (+3%)
So, here, not so bad. Cost of summary mode actually disappeared in test noise.
Renaissance, philosophers benchmark
Interesting example, since this does a lot of synchronization, leads to many object monitors being inflated, and those are malloced:
Average benchmark scores:
NMT off: 4697
NMT summary: 4599 (-2%)
NMT detail: 4190 (-11%)
Somewhat in the middle between the two other examples.
Conclusion
There is no clear answer.
Both memory- and performance cost depend on how many allocations the JVM does.
This number is small for normal well-behaved applications, but it may be large in pathological situations (e.g. JVM bugs) as well as in some corner case scenarios caused by user programs, e.g., lots of class loading or synchronization. To be really sure, you need to measure yourself.
The overhead of Native Memory Tracking obviously depends on how often the application allocates native memory. Usually, this is not something too frequent in a Java application, but cases may differ. Since you've already tried and didn't notice performance difference, your application is apparently not an exception.
In the summary mode, Native Memory Tracking roughly does the following things:
increases every malloc request in the JVM by 2 machine words (16 bytes);
records the allocation size and flags in these 2 words;
atomically increments (or decrements on free) the counter corresponding to the given memory type;
besides malloc and free, it also handles changes in virtual memory reservation and allocations of new arenas, but these are even less frequent than malloc/free calls.
So, to me, the overhead is quite small; 5-10% is definitely a large overestimation (the numbers would make sense for detail mode which collects and stores stack traces, which is expensive, but summary doesn't do that).
When many threads concurrently allocate/free native memory, the update of an atomic counter could become a bottleneck, but again, that's more like an extreme case. In short, if you measured a real application and didn't notice any degradation, you're likely safe to enable NMT summary in production.

Why garbage collector stops all the threads before reclaiming the memory

I was reading some performance related post, and then I encountered the sentence "Java's garbage collector stops all the threads before reclaiming the memory, which is also a performance issue". I tried to find it on Google, but I could not.
Could someone please share something so that I can be clear about this?
The short and not very informative answer is Because it's damn hard not to, so let's elaborate.
There are many built-in collectors in the HotSpot JVM (see https://blogs.oracle.com/jonthecollector/entry/our_collectors). The generational collectors have evolved significantly, but they still cannot achieve full concurrency, as we speak. They can concurrently mark which objects are dead and which are not, they can concurrently sweep the dead objects, but they still cannot concurrently compact the fragmented living objects without stopping your program (*).
This is mostly because it's really, really hard to ensure that you are not breaking someone's program by changing an object's heap location and updating all references to it, without stopping the world and doing all in one clean swoop. It's also hard to ensure that all the living objects that you are moving are not being changed under your nose.
Generational collectors can still run for significant amounts of time without stopping the world and doing the necessary work, but still their algorithms are delaying the inevitable, not guaranteeing fully concurrent GC. Notice how phrases like mostly concurrent (i.e. not always concurrent) are used when describing many GC algorithms.
There are also non-generational collectors like G1GC and they can show awesome results (to the point that G1GC will become the default collector in HotSpot), but they still cannot guarantee that there will be no stop-the-world pauses. Here the problem is again the concurrent compaction, but specifically for G1 this is somewhat less of a problem, because it can concurrently perform some region-based compaction.
To say that this is scraping the surface will be an embellishment - the area is gigantic and I'd recommend you to go over some of the accessible materials on the topic like Gil Tene's Understanding Garbage Collection for some theory behind it or Emad Benjamin's Virtualizing and Tuning Large Scale JVMs for some practical problems and solutions.
(*) This is not an entirely unsolvable problem, though. Azul's Zing JVM with its C4 garbage collector claims fully concurrent collection (there's a whitepaper about it, but you may find the details here more interesting). OpenJDK's Shenandoah project also shows very promising results. Still, as the8472 has explained, you pay some price in throughput and significant price in complexity. The G1GC team considered going for a fully concurrent algorithm, but decided that the benefits of a STW collector outweigh that in the context of G1GC.
In principle they don't have to. But writing and using a fully concurrent garbage collector is
incredibly complex
requires more breathing room to operate efficiently, which may not be acceptable on memory-constrained devices
incurs significant performance overhead. You end up trading throughput (CPU cycles spent in mutators vs. cpu cycles spent in collector threads) for improved (zeroish) pause times. This may be an acceptable deal on very large heaps and many-core machines that provide interactive services, but may not be acceptable on small devices or services that do batch processing.
Conversely an implementation using stop the world pauses is simpler and more efficient in terms of CPU utilization with the downside that there are pauses.
Additionally you have to consider that pause times can be fairly low on small heaps if we're using humans as yardsticks. So low-pause/pauseless collectors are generally only worth it if you either have a system with smaller tolerances than humans or larger heaps, something that has historically been rare and only become more common recently as computers kept growing.
To avoid diving into the details let's consider reference counting instead of mark-sweep-compact collectors. There fully concurrent reference counting incurs the overhead of atomic memory accesses and potentially more complex counting schemes with higher memory footprints.

Why does java wait so long to run the garbage collector?

I am building a Java web app, using the Play! Framework. I'm hosting it on playapps.net. I have been puzzling for a while over the provided graphs of memory consumption. Here is a sample:
The graph comes from a period of consistent but nominal activity. I did nothing to trigger the falloff in memory, so I presume this occurred because the garbage collector ran as it has almost reached its allowable memory consumption.
My questions:
Is it fair for me to assume that my application does not have a memory leak, as it appears that all the memory is correctly reclaimed by the garbage collector when it does run?
(from the title) Why is java waiting until the last possible second to run the garbage collector? I am seeing significant performance degradation as the memory consumption grows to the top fourth of the graph.
If my assertions above are correct, then how can I go about fixing this issue? The other posts I have read on SO seem opposed to calls to System.gc(), ranging from neutral ("it's only a request to run GC, so the JVM may just ignore you") to outright opposed ("code that relies on System.gc() is fundamentally broken"). Or am I off base here, and I should be looking for defects in my own code that is causing this behavior and intermittent performance loss?
UPDATE
I have opened a discussion on PlayApps.net pointing to this question and mentioning some of the points here; specifically #Affe's comment regarding the settings for a full GC being set very conservatively, and #G_H's comment about settings for the initial and max heap size.
Here's a link to the discussion, though you unfortunately need a playapps account to view it.
I will report the feedback here when I get it; thanks so much everyone for your answers, I've already learned a great deal from them!
Resolution
Playapps support, which is still great, didn't have many suggestions for me, their only thought being that if I was using the cache extensively this may be keeping objects alive longer than need be, but that isn't the case. I still learned a ton (woo hoo!), and I gave #Ryan Amos the green check as I took his suggestion of calling System.gc() every half day, which for now is working fine.
Any detailed answer is going to depend on which garbage collector you're using, but there are some things that are basically the same across all (modern, sun/oracle) GCs.
Every time you see the usage in the graph go down, that is a garbage collection. The only way heap gets freed is through garbage collection. The thing is there are two types of garbage collections, minor and full. The heap gets divided into two basic "areas." Young and tenured. (There are lots more subgroups in reality.) Anything that is taking up space in Young and is still in use when the minor GC comes along to free up some memory, is going to get 'promoted' into tenured. Once something makes the leap into tenured, it sits around indefinitely until the heap has no free space and a full garbage collection is necessary.
So one interpretation of that graph is that your young generation is fairly small (by default it can be a fairly small % of total heap on some JVMs) and you're keeping objects "alive" for comparatively very long times. (perhaps you're holding references to them in the web session?) So your objects are 'surviving' garbage collections until they get promoted into tenured space, where they stick around indefinitely until the JVM is well and good truly out of memory.
Again, that's just one common situation that fits with the data you have. Would need full details about the JVM configuration and the GC logs to really tell for sure what's going on.
Java won't run the garbage cleaner until it has to, because the garbage cleaner slows things down quite a bit and shouldn't be run that frequently. I think you would be OK to schedule a cleaning more frequently, such as every 3 hours. If an application never consumes full memory, there should be no reason to ever run the garbage cleaner, which is why Java only runs it when the memory is very high.
So basically, don't worry about what others say: do what works best. If you find performance improvements from running the garbage cleaner at 66% memory, do it.
I am noticing that the graph isn't sloping strictly upward until the drop, but has smaller local variations. Although I'm not certain, I don't think memory use would show these small drops if there was no garbage collection going on.
There are minor and major collections in Java. Minor collections occur frequently, whereas major collections are rarer and diminish performance more. Minor collections probably tend to sweep up stuff like short-lived object instances created within methods. A major collection will remove a lot more, which is what probably happened at the end of your graph.
Now, some answers that were posted while I'm typing this give good explanations regarding the differences in garbage collectors, object generations and more. But that still doesn't explain why it would take so absurdly long (nearly 24 hours) before a serious cleaning is done.
Two things of interest that can be set for a JVM at startup are the maximum allowed heap size, and the initial heap size. The maximum is a hard limit, once you reach that, further garbage collection doesn't reduce memory usage and if you need to allocate new space for objects or other data, you'll get an OutOfMemoryError. However, internally there's a soft limit as well: the current heap size. A JVM doesn't immediately gobble up the maximum amount of memory. Instead, it starts at your initial heap size and then increases the heap when it's needed. Think of it a bit as the RAM of your JVM, that can increase dynamically.
If the actual memory use of your application starts to reach the current heap size, a garbage collection will typically be instigated. This might reduce the memory use, so an increase in heap size isn't needed. But it's also possible that the application currently does need all that memory and would exceed the heap size. In that case, it is increased provided that it hasn't already reached the maximum set limit.
Now, what might be your case is that the initial heap size is set to the same value as the maximum. Suppose that would be so, then the JVM will immediately seize all that memory. It will take a very long time before the application has accumulated enough garbage to reach the heap size in memory usage. But at that moment you'll see a large collection. Starting with a small enough heap and allowing it to grow keeps the memory use limited to what's needed.
This is assuming that your graph shows heap use and not allocated heap size. If that's not the case and you are actually seeing the heap itself grow like this, something else is going on. I'll admit I'm not savvy enough regarding the internals of garbage collection and its scheduling to be absolutely certain of what's happening here, most of this is from observation of leaking applications in profilers. So if I've provided faulty info, I'll take this answer down.
As you might have noticed, this does not affect you. The garbage collection only kicks in if the JVM feels there is a need for it to run and this happens for the sake of optimization, there's no use of doing many small collections if you can make a single full collection and do a full cleanup.
The current JVM contains some really interesting algorithms and the garbage collection itself id divided into 3 different regions, you can find a lot more about this here, here's a sample:
Three types of collection algorithms
The HotSpot JVM provides three GC algorithms, each tuned for a specific type of collection within a specific generation. The copy (also known as scavenge) collection quickly cleans up short-lived objects in the new generation heap. The mark-compact algorithm employs a slower, more robust technique to collect longer-lived objects in the old generation heap. The incremental algorithm attempts to improve old generation collection by performing robust GC while minimizing pauses.
Copy/scavenge collection
Using the copy algorithm, the JVM reclaims most objects in the new generation object space (also known as eden) simply by making small scavenges -- a Java term for collecting and removing refuse. Longer-lived objects are ultimately copied, or tenured, into the old object space.
Mark-compact collection
As more objects become tenured, the old object space begins to reach maximum occupancy. The mark-compact algorithm, used to collect objects in the old object space, has different requirements than the copy collection algorithm used in the new object space.
The mark-compact algorithm first scans all objects, marking all reachable objects. It then compacts all remaining gaps of dead objects. The mark-compact algorithm occupies more time than the copy collection algorithm; however, it requires less memory and eliminates memory fragmentation.
Incremental (train) collection
The new generation copy/scavenge and the old generation mark-compact algorithms can't eliminate all JVM pauses. Such pauses are proportional to the number of live objects. To address the need for pauseless GC, the HotSpot JVM also offers incremental, or train, collection.
Incremental collection breaks up old object collection pauses into many tiny pauses even with large object areas. Instead of just a new and an old generation, this algorithm has a middle generation comprising many small spaces. There is some overhead associated with incremental collection; you might see as much as a 10-percent speed degradation.
The -Xincgc and -Xnoincgc parameters control how you use incremental collection. The next release of HotSpot JVM, version 1.4, will attempt continuous, pauseless GC that will probably be a variation of the incremental algorithm. I won't discuss incremental collection since it will soon change.
This generational garbage collector is one of the most efficient solutions we have for the problem nowadays.
I had an app that produced a graph like that and acted as you describe. I was using the CMS collector (-XX:+UseConcMarkSweepGC). Here is what was going on in my case.
I did not have enough memory configured for the application, so over time I was running into fragmentation problems in the heap. This caused GCs with greater and greater frequency, but it did not actually throw an OOME or fail out of CMS to the serial collector (which it is supposed to do in that case) because the stats it keeps only count application paused time (GC blocks the world), application concurrent time (GC runs with application threads) is ignored for those calculations. I tuned some parameters, mainly gave it a whole crap load more heap (with a very large new space), set -XX:CMSFullGCsBeforeCompaction=1, and the problem stopped occurring.
Probably you do have memory leaks that's cleared every 24 hours.

Repetitive allocation of same-size byte arrays, replace with pools?

As part of a memory analysis, we've found the following:
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
3 3.98% 19.85% 24259392 808 3849949016 1129587 359697 byte[]
4 3.98% 23.83% 24259392 808 3849949016 1129587 359698 byte[]
You'll notice that many objects are allocated, but few remain live. This is for a simple reason - the two byte arrays are allocated for each instance of a "client" that is generated. Clients are not reusable - each one can only handle one request and is then thrown away. The byte arrays always have the same size (30000).
We're considering moving to a pool (apache's GenericObjectPool) of byte arrays, as normally there are a known number of active clients at any given moment (so the pool size shouldn't fluctuate much). This way, we can save on memory allocation and garbage collection. The question is, would the pool cause a severe CPU hit? Is this idea a good idea at all?
Thanks for your help!
I think there are good gc related reasons to avoid this sort of allocation behaviour. Depending on the size of the heap & the free space in eden at the time of allocation, simply allocating a 30000 element byte[] could be a serious performance hit given that it could easily be bigger than the TLAB (hence allocation is not a bump the pointer event) & there may even not be enough space in eden available hence allocation directly into tenured which in itself likely to cause another hit down the line due to increased full gc activity (particularly if using cms due to fragmentation).
Having said that, the comments from fdreger are completely valid too. A multithreaded object pool is a bit of a grim thing that is likely to cause headaches. You mention they handle a single request only, if this request is serviced by a single thread only then a ThreadLocal byte[] that is wiped at the end of the request could be a good option. If the request is short lived relatively to your typical young gc period then the young->old reference issue may not be a big problem (as the probability of any given request being handled during a gc is small even if you're guaranteed to get this periodically).
Probably pooling will not help you much if at all - possibly it will make things worse, although it depends on a number of factors (what GC are you using, how long the objects live, how much memory is available, etc.):
The time of GC depends mostly on the number of live objects. Collector (I assume you run a vanilla Java JRE) does not visit dead objects and does not deallocate them one by one. It frees whole areas of memory after copying the live objects away (this keeps memory neat and compacted). 100 dead objects can collect as fast as 100000. On the other hand, all the live objects must be copied - so if you, say, have a pool of 100 objects and only 50 are used at a given time, keeping the unused object is going to cost you.
If your arrays currently tend to live shorter than the time needed to get tenured (copied to the old generation space), there is another problem: your pooled arrays will certainly live long enough. This will produce a situation where there is a lot of references from old generation to young - and GCs are optimized with a reverse situation in mind.
Actually it is quite possible that pooling arrays will make your GC SLOWER than creating new ones; this is usually the case with cheap objects.
Another cost of pooling comes from synchronizing objects across threads and cleaning them up after use. Both are trickier than they sound.
Summing up, unless you are well aware of the internals of your GC and understand how it works under the hood, AND have a results from a profiler that show that managing all the arrays is a bottleneck - DO NOT POOL. In most cases it is a bad idea.
If garbage collection in your case is really a performance hit (often cleaning up the eden space does not take much time if not many objects survive), and it is easy to plug in the object pool, try it, and measure it.
This certainly depends on your application's need.
The pool would work out much better as long as you always have a reference to it, this way the garbage collector simply ignores the pool and will only be declared once (you could always declare it static to be on the safe side). Although it would be persistent memory but I doubt that will be a problem for your application.

What are the issues with preallocating objects in Java?

We've spent the last few months tuning our production application so that we experience no full GCs. We now experience young GCs only, with the rate of young GCs dependent on the rate of object allocation.
Our application needs to be as close to "real-time" as possible so now we're trying to reduce the number of young GCs. As the old axiom goes, most of the data we allocate ends up being garbage and discarded at the next young GC. So no need to preallocate for this type of data. However there is a good amount of objects (defined by type) that we know will make it from young GC to old GC.
Would it make sense to preallocate these objects during more ideal times (i.e. at startup) so we'll end up allocating less during our less-than-ideal times? I've read the literature that mentions how object pooling is not recommended with the latest JVMs because allocation is much cheaper. What are the drawbacks to preallocating objects that I know will make it to the old GC?
Reducing the rate of allocation makes GC "pauses" less frequent, but not shorter. For smoother "real-time" operation, you may actually want to increase the number of GC invocations: this is trading more GC-related CPU in order to get shorter pauses. Sun's JVM can be tuned with various options; I suggest trying -XX:NewRatio to make the young generation smaller.
The usual argument against pooling is that you are basically trying to write your own allocator, hoping that you will do a better job at it than the JVM allocator. It is justified in some specific cases where allocation is expensive, e.g. creating Thread instances.
Just a note that there is a realtime jvm available. If your app needs predictable performance then this is a worth looking into.

Categories

Resources