What is a TLAB (Thread Local Allocation Buffer)?

What is a TLAB (Thread Local Allocation Buffer)? - java

I couldn't find a comprehensive source that would explain the concept in a clear manner. My understanding is that a thread is given some chunk of memory in the eden where it allocates new objects. A competing thread will end up having a somewhat consecutive chunk of eden. What happens if the first thread runs out of free area in its TLAB? Would it request a new chunk of eden?

The idea of a TLAB is to reduce the need of synchronization between threads. Using TLABs, this need is reduced as any thread has an area it can use and expect that it is the only thread using this area. Assuming that a TLAB can hold 100 objects, a thread would only need to aquire a lock for claiming more memory when allocating the 101st object. Without TLABs, this would be required for every object. The downside is of course that you potentially waste space.
Large objects are typically allocated outside of a TLAB as they void the advantage of reducing the frequency of synchronized memory allocation. Some objects might not even fit inside of a TLAB.
You can set the size of a TLAB using the -XX:TLABSize flag but generally I do not recommend you to mess with these settings unless you really discovered a problem that you can solve by that.

Related

How to optimize the unused space in the Java heap

Do not take my word on this. I am just repeating what I have pieced together from different sources. HotSpot JVM uses Thread Local Allocation Buffers (TLABs). TLABs can be synchronized or not. Most of the time the TLABs are not synchronized and hence a thread can allocate very quickly. There are a large number of these TLABs so that the active threads get their own TLABs. The less active threads share a synchronized TLAB. When a thread exhausts its TLAB, then it gets another TLAB from a pool. When the pool runs out of TLABs, then Young GC is triggered or needed.
When the pool runs out of TLABs, there are still going to be TLABs with space left in them. This "unused space" adds up and is significant. One can see this space because GC is triggered before the reserved heap size or the max heap size is reached. Thus, the heap is effectively 10-30% smaller. At least that is my guess from looking at heap usage graphs.
How do I tune the JVM to reduce the unused space?

You can tweak that setting with the command-line option -XX:TLABSize
However as with most of these "deep down and dirty" settings, you should be very careful when changing those and monitor the effect of your changes closely.

You are correct that once there are no TLABs, there will be a young generation collection and they will be cleaned.
I can't tell much, but there is ResizeTLAB that allows for the JVM to resize it based on allocations stats I guess, eden size, etc. There's also a flag called TLABWasteTargetPercent (by default it is 1%). When the current TLAB can not fit one more object, JVM has to decide what to do : allocate directly to the heap, or allocate a new TLAB.
If this objects size is bigger than 1% of the current TLAB size it is allocated directly; otherwise the current TLAB is retired.
So let's say current size of the TLAB (TLABSize, by default it is zero, meaning it will be adaptive) is 100 bytes (all numbers are theoretical), 1% of that is 1 byte - that's the TLABWasteTargetPercent. Currently your TLAB is filled with 98 bytes and your object that you want to allocate is 3 bytes. It will not fit in this TLAB and at the same time it is bigger than 1 byte threshold => it is allocated directly on the heap.
The other way around is that your TLAB is full with 99.7 bytes and you try to allocate a 1/2 byte object - it will not fit; but it is smaller than 1 byte; thus this TLAB is committed and a new one is given to you.
As far as I understand, there is one more parameter called TLABWasteIncrement - when you fail to allocate in the TLAB (and allocate directly in the heap) - so that this story would not happen forever, the TLABWasteTargetPercent is increased by this value (default of 4%) increasing the chances of retiring this TLAB.
There is also TLABAllocationWeight and TLABRefillWasteFraction - will probably update this post a bit later with them

The allocation of TLABs when there is not enough space has a different algorithm but generally what you say about the free space is right.
The question now is how can you be sure that the default TLAB config is not right for you? You need to start by getting some logs by using -XX:+PrintTLAB and if you see that the space that is not used is too much then you need to try to increase/reduce the TLAB size or change -XX:TLABWasteTargetPercent or -XX:TLABWasteIncrement as people said.
This is an article I find useful when I go through TLABs: https://alidg.me/blog/2019/6/21/tlab-jvm

Get memory usage of a thread

I know that I Java's Runtime object can report the JVM's memory usage. However, I need the memory usage for a certain thread. Any idea how to get this?
I appreciate your answer!

A Thread shares everything except its stack and the CPU cycles with all other threads in the VM. All objects created by the Thread are pooled with all the other objects.
The problem is to define what the memory usage of a Thread is. Is it only those objects it created? What if these objects subsequently are referenced by other threads? Do they only count half, then? What about objects created somewhere else, but are now referenced by this Thread?
I know of no tool trying to measure the memory consumption of separate Threads.

Allocations in new TLAB vs allocations outside TLAB

The Java Mission Control tool in the JDK provides statistics about object allocation in new TLAB and allocations outside TLAB. (It's under Memory/Allocations). What is the significance of these statistics, what is good for the performance of an application? Should I be worried if some objects are allocated outside TLAB and if yes, what can I do about it?

A TLAB is a Thread Local Allocation Buffer. The normal way objects are allocated in HotSpot is within a TLAB. TLAB allocations can be done without synchronization with other threads, since the Allocation Buffer is Thread Local, synchronization is only needed when a new TLAB is fetched.
So, the ideal scenario is that as much as possible of the allocations are done in TLABs.
Some objects will be allocated outside TLABs, for example large objects. This is nothing to worry about as long as the percentage of allocations outside TLABs vs allocations in new TLABs is low.
The TLABs are dynamically resized during the execution for each thread individually. So, if a thread allocates very much, the new TLABs that it gets from the heap will increase in size. If you want you can try to set the flag -XX:MinTLABSize to set minimum TLAB size, for example
-XX:MinTLABSize=4k
Answer provided by my colleague David Lindholm :)

Repetitive allocation of same-size byte arrays, replace with pools?

As part of a memory analysis, we've found the following:
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
3 3.98% 19.85% 24259392 808 3849949016 1129587 359697 byte[]
4 3.98% 23.83% 24259392 808 3849949016 1129587 359698 byte[]
You'll notice that many objects are allocated, but few remain live. This is for a simple reason - the two byte arrays are allocated for each instance of a "client" that is generated. Clients are not reusable - each one can only handle one request and is then thrown away. The byte arrays always have the same size (30000).
We're considering moving to a pool (apache's GenericObjectPool) of byte arrays, as normally there are a known number of active clients at any given moment (so the pool size shouldn't fluctuate much). This way, we can save on memory allocation and garbage collection. The question is, would the pool cause a severe CPU hit? Is this idea a good idea at all?
Thanks for your help!

I think there are good gc related reasons to avoid this sort of allocation behaviour. Depending on the size of the heap & the free space in eden at the time of allocation, simply allocating a 30000 element byte[] could be a serious performance hit given that it could easily be bigger than the TLAB (hence allocation is not a bump the pointer event) & there may even not be enough space in eden available hence allocation directly into tenured which in itself likely to cause another hit down the line due to increased full gc activity (particularly if using cms due to fragmentation).
Having said that, the comments from fdreger are completely valid too. A multithreaded object pool is a bit of a grim thing that is likely to cause headaches. You mention they handle a single request only, if this request is serviced by a single thread only then a ThreadLocal byte[] that is wiped at the end of the request could be a good option. If the request is short lived relatively to your typical young gc period then the young->old reference issue may not be a big problem (as the probability of any given request being handled during a gc is small even if you're guaranteed to get this periodically).

Probably pooling will not help you much if at all - possibly it will make things worse, although it depends on a number of factors (what GC are you using, how long the objects live, how much memory is available, etc.):
The time of GC depends mostly on the number of live objects. Collector (I assume you run a vanilla Java JRE) does not visit dead objects and does not deallocate them one by one. It frees whole areas of memory after copying the live objects away (this keeps memory neat and compacted). 100 dead objects can collect as fast as 100000. On the other hand, all the live objects must be copied - so if you, say, have a pool of 100 objects and only 50 are used at a given time, keeping the unused object is going to cost you.
If your arrays currently tend to live shorter than the time needed to get tenured (copied to the old generation space), there is another problem: your pooled arrays will certainly live long enough. This will produce a situation where there is a lot of references from old generation to young - and GCs are optimized with a reverse situation in mind.
Actually it is quite possible that pooling arrays will make your GC SLOWER than creating new ones; this is usually the case with cheap objects.
Another cost of pooling comes from synchronizing objects across threads and cleaning them up after use. Both are trickier than they sound.
Summing up, unless you are well aware of the internals of your GC and understand how it works under the hood, AND have a results from a profiler that show that managing all the arrays is a bottleneck - DO NOT POOL. In most cases it is a bad idea.

If garbage collection in your case is really a performance hit (often cleaning up the eden space does not take much time if not many objects survive), and it is easy to plug in the object pool, try it, and measure it.
This certainly depends on your application's need.

The pool would work out much better as long as you always have a reference to it, this way the garbage collector simply ignores the pool and will only be declared once (you could always declare it static to be on the safe side). Although it would be persistent memory but I doubt that will be a problem for your application.

Is ThreadLocal allocated in TLAB?

I suppose, that ThreadLocal variables are allocated in Thread Local allocation Buffer(s) or TLABs, am I right ?
I was not successful in finding any document stating what exactly makes some class stored in TLAB. If you know some, please post a link.

I was not successfull to find any document stating what exactly makes some class stored in TLAB. If you know some, please post a link.
Actually, the explanation is right there in the blog post you lnked to:
A Thread Local Allocation Buffer (TLAB) is a region of Eden that is used for allocation by a single thread. It enables a thread to do object allocation using thread local top and limit pointers, which is faster than doing an atomic operation on a top pointer that is shared across threads.
Every thread allocates memory from its own chunk of Eden, the "Generation 0" part of the heap. Pretty much everything is stored in the TLAB for a period of time - quite possibly your ThreadLocals, too - but they get moved away from there after a gen0 garbage collection. TLABs are there to make allocations faster, not to make the memory unaccessible from other threads. A more accessible description from the same blog you linked to is A little thread privacy, please.

No. Here how it is:
As of 1.4 each thread in Java has a field called threadLocals where the map is kept. Each threadLocal has an index to the structure, so it doesn't use hashCode(). Imagine an array and each ThreadLocal keep a slot index.
When the thread dies and there are no more references to it, the ThreadLocals are GC'd. Very simple idea.
You can implement your own ThreaLocal(s) by extending Thread and adding a field to hold the reference. Then cast the Thread to youw own class and take the data.
So it's not TLAB, it's still the heap like any other object.
Historically there were implementations w/ static WeakHashMap which were very slow to access the data.

It is my understanding that TLAB is used for object allocation of all small to medium objects. Your ThreadLocal won't be allocated any differently.

I'm pretty sure that this is up to the discretion of the JVM implementer. They could put the data in TLABs if they wanted to, or in a global table keyed by the thread ID. The Java Language Specification tends to be mute about these sorts of issues so that JVM authors can deploy Java on as many and as diverse platforms as possible.

i think only the pointer to it is, while the data itself resides in some other memory area. see http://blogs.oracle.com/jonthecollector/entry/the_real_thing and http://wikis.sun.com/display/MaxineVM/Threads#Threads-Threadlocalvariables

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.