Allocations in new TLAB vs allocations outside TLAB

Allocations in new TLAB vs allocations outside TLAB - java

The Java Mission Control tool in the JDK provides statistics about object allocation in new TLAB and allocations outside TLAB. (It's under Memory/Allocations). What is the significance of these statistics, what is good for the performance of an application? Should I be worried if some objects are allocated outside TLAB and if yes, what can I do about it?

A TLAB is a Thread Local Allocation Buffer. The normal way objects are allocated in HotSpot is within a TLAB. TLAB allocations can be done without synchronization with other threads, since the Allocation Buffer is Thread Local, synchronization is only needed when a new TLAB is fetched.
So, the ideal scenario is that as much as possible of the allocations are done in TLABs.
Some objects will be allocated outside TLABs, for example large objects. This is nothing to worry about as long as the percentage of allocations outside TLABs vs allocations in new TLABs is low.
The TLABs are dynamically resized during the execution for each thread individually. So, if a thread allocates very much, the new TLABs that it gets from the heap will increase in size. If you want you can try to set the flag -XX:MinTLABSize to set minimum TLAB size, for example
-XX:MinTLABSize=4k
Answer provided by my colleague David Lindholm :)

Related

how to optimize large short-lived objects in jvm gc

we meet a problem that about jvm gc. we have a large QPS application that jvm heap memory increased very fast. it will use more than 2g heap memory at a few seconds, then gc triggers that will collected more than 2g memory every time also very frequency。GC collect case like below picture.so this have two problems
gc need some time. what is more, it is frequent.
system will not stabilized.
I abstract the problem like below code. System allocate short-lived object fast.
public static void fun1() {
for(int i = 0; i < 5000; i++) {
Byte[] bs = new Byte[1024 * 1024 * 5];
bs = null;
}
}
so, I have some questions:
many say that set object equals null will let gc thread collect easy。what the mean of that? we all know that minor GC is always triggered when the JVM is unable to allocate space for a new object.Thus, whether a object
is null, gc will triggered only when space is not enough. so set object is null is not meaningful.
how to optimize this if exists large short-lived object? I means how to collect this objects not when young generation is not enough.
Any suggestions will help me.

I would say you need to cache them. Instead of recreating them every time try to find a way to reuse already created ones.
Try to allocate more memory to your app with the option
-Xms8g -Xmx8g
gc is called when there is not enough memory so if you have more gc won't be called so often.
It's hard to suggest something valuable without the huge objects nature. Try to add an example of such an object as well as how to you create/call them.
If you have big byte arrays inside the short lived objects try to place them out of java (e.g. in a file and keep file path only).

many say that set object equals null will let gc thread collect easy
This question is covered by several others on stack overflow so i won't repeat those here. E.g. Is unused object available for garbage collection when it's still visible in stack?
how to optimize this if exists large short-lived object? I means how to collect this objects not when young generation is not enough.
Increase the size of the young generation. So that it can contain the object. Or use G1GC which can selectively collect regions containing such objects instead of the entire old generation.
You could also try using direct byte buffers which will allocate the memory outside the heap thus reducing GC pressure. But the memory will stay allocated until the Buffers pointing to the memory get collected.
Other than that you should redesign your code to avoid such frequent, large allocations. Often one can use object pools or thread-local variables as scratch memory.

How to optimize the unused space in the Java heap

Do not take my word on this. I am just repeating what I have pieced together from different sources. HotSpot JVM uses Thread Local Allocation Buffers (TLABs). TLABs can be synchronized or not. Most of the time the TLABs are not synchronized and hence a thread can allocate very quickly. There are a large number of these TLABs so that the active threads get their own TLABs. The less active threads share a synchronized TLAB. When a thread exhausts its TLAB, then it gets another TLAB from a pool. When the pool runs out of TLABs, then Young GC is triggered or needed.
When the pool runs out of TLABs, there are still going to be TLABs with space left in them. This "unused space" adds up and is significant. One can see this space because GC is triggered before the reserved heap size or the max heap size is reached. Thus, the heap is effectively 10-30% smaller. At least that is my guess from looking at heap usage graphs.
How do I tune the JVM to reduce the unused space?

You can tweak that setting with the command-line option -XX:TLABSize
However as with most of these "deep down and dirty" settings, you should be very careful when changing those and monitor the effect of your changes closely.

You are correct that once there are no TLABs, there will be a young generation collection and they will be cleaned.
I can't tell much, but there is ResizeTLAB that allows for the JVM to resize it based on allocations stats I guess, eden size, etc. There's also a flag called TLABWasteTargetPercent (by default it is 1%). When the current TLAB can not fit one more object, JVM has to decide what to do : allocate directly to the heap, or allocate a new TLAB.
If this objects size is bigger than 1% of the current TLAB size it is allocated directly; otherwise the current TLAB is retired.
So let's say current size of the TLAB (TLABSize, by default it is zero, meaning it will be adaptive) is 100 bytes (all numbers are theoretical), 1% of that is 1 byte - that's the TLABWasteTargetPercent. Currently your TLAB is filled with 98 bytes and your object that you want to allocate is 3 bytes. It will not fit in this TLAB and at the same time it is bigger than 1 byte threshold => it is allocated directly on the heap.
The other way around is that your TLAB is full with 99.7 bytes and you try to allocate a 1/2 byte object - it will not fit; but it is smaller than 1 byte; thus this TLAB is committed and a new one is given to you.
As far as I understand, there is one more parameter called TLABWasteIncrement - when you fail to allocate in the TLAB (and allocate directly in the heap) - so that this story would not happen forever, the TLABWasteTargetPercent is increased by this value (default of 4%) increasing the chances of retiring this TLAB.
There is also TLABAllocationWeight and TLABRefillWasteFraction - will probably update this post a bit later with them

The allocation of TLABs when there is not enough space has a different algorithm but generally what you say about the free space is right.
The question now is how can you be sure that the default TLAB config is not right for you? You need to start by getting some logs by using -XX:+PrintTLAB and if you see that the space that is not used is too much then you need to try to increase/reduce the TLAB size or change -XX:TLABWasteTargetPercent or -XX:TLABWasteIncrement as people said.
This is an article I find useful when I go through TLABs: https://alidg.me/blog/2019/6/21/tlab-jvm

What is a TLAB (Thread Local Allocation Buffer)?

I couldn't find a comprehensive source that would explain the concept in a clear manner. My understanding is that a thread is given some chunk of memory in the eden where it allocates new objects. A competing thread will end up having a somewhat consecutive chunk of eden. What happens if the first thread runs out of free area in its TLAB? Would it request a new chunk of eden?

The idea of a TLAB is to reduce the need of synchronization between threads. Using TLABs, this need is reduced as any thread has an area it can use and expect that it is the only thread using this area. Assuming that a TLAB can hold 100 objects, a thread would only need to aquire a lock for claiming more memory when allocating the 101st object. Without TLABs, this would be required for every object. The downside is of course that you potentially waste space.
Large objects are typically allocated outside of a TLAB as they void the advantage of reducing the frequency of synchronized memory allocation. Some objects might not even fit inside of a TLAB.
You can set the size of a TLAB using the -XX:TLABSize flag but generally I do not recommend you to mess with these settings unless you really discovered a problem that you can solve by that.

Comparisons between GC and two other memory management methods

I just want to understand more about current popular garbage collection, malloc / free and counter.
From my understanding, GC is the most popular because it relieves the burden of managing memory manually from the developers and also it is more bullet proof. malloc / free is easy to make mistake and cause memory leaks.
From http://ocaml.org/learn/tutorials/garbage_collection.html:
Why would garbage collection be faster than explicit memory allocation
as in C? It's often assumed that calling free costs nothing. In fact
free is an expensive operation which involves navigating over the
complex data structures used by the memory allocator. If your program
calls free intermittently, then all of that code and data needs to be
loaded into the cache, displacing your program code and data, each
time you free a single memory allocation. A collection strategy which
frees multiple memory areas in one go (such as either a pool allocator
or a GC) pays this penalty only once for multiple allocations (thus
the cost per allocation is much reduced).
Is it true that GC faster than malloc / free?
Also, what if the counter style memory management (objective-c is using it) joins the party?
I hope someone can summary the comparisons with deeper insights.

Is is true that GC faster than malloc / free?
It can be. It depends on the memory usage patterns. It also depends on how you measure "faster". (For example, are you measuring overall memory management efficiency, individual calls to malloc / free, or ... pause times.)
But conversely, malloc / free typically makes better use of memory than a modern copying GC ... provided that you don't run into heap fragmentation problems. And malloc / free "works" when the programming language doesn't provide enough information to allow a GC to distinguish heap pointers from other values.
Also, what if the counter style memory management (objective-c is using it) joins the party?
The overheads of reference counting make pointer assignment more expensive, and you have to somehow deal with reference cycles.
On the other hand, reference counting does offer a way to control memory management pauses ... which can be a significant issue for interactive games / apps. And memory usage is also better; see above.
FWIW, the points made in the source that you quoted are true. But it is not the whole picture.
The problem is that the whole picture is ... too complicated to be covered properly in a StackOverflow answer.

In case of Java there is no competition for any lock when the object is small enough to fit into the Thread Local Allocation Buffer.
TLAB.
This is an internal design and it has proven to work really good. From my understanding, allocating a new Object is just a pointer bump
TLAB Bump The Pointer
which is pretty fast.

Java nonblocking memory allocation

I read somewhere that java can allocate memory for objects in about 12 machine instructions. It's quite impressive for me. As far as I understand one of tricks JVM using is preallocating memory in chunks. This help to minimize number of requests to operating system, which is quite expensive, I guess. But even CAS operations can cost up to 150 cycles on modern processors.
So, could anyone explain real cost of memory allocation in java and which tricks JVM uses to speed up allocation?

The JVM pre-allocates an area of memory for each thread (TLA or Thread Local Area).
When a thread needs to allocate memory, it will use "Bump the pointer allocation" within that area. (If the "free pointer" points to adress 10, and the object to be allocated is size 50, then we just bump the free pointer to 60, and tell the thread that it can use the memory between 10 and 59 for the object).

The best trick is the generational garbage-collector. This keeps the heap unfragmented, so allocating memory is increasing the pointer to the free space and returning the old value. If memory runs out, the garbage-collection copy objects and creates this way a new unfragmented heap.
As different threads have to synchronize over the pointer to the free memory, if increasing it, they preallocate chunks. So a thread can allocate new memory, without the lock.
All of this is explained in more detail here: http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html

There is no single memory allocator for the JVM. IIRC Sun's JVM and IBM's managed memory differently. However generally the way the JVM will operate is that it will initially allocate one piece of memory, this segment will be small enough to live in the processors cache making all access to this extremely fast.
As the application creates objects, the objects will take memory from within this segment. The object allocation within the segment is simply pointer arithmetic.
Initially the offset address into the freshly minted segment will be zero. The first object allocated will have an 'address' (actually an offset into the segment) of zero. When you allocate object then the memory manager will know how big the object is, allocate that much space within the segment (16 bytes say) and then increment it's "offset address" by that amount meaning that memory allocation is blindingly fast, it's just pointer arithmetic.
Sun have a whitepaper here Memory Management in the JavaHotSpot™ Virtual Machine and IBM used to have a bunch of stuff on ibm.com/developerworks

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.