Pool of byte arrays for serialization

Pool of byte arrays for serialization - java

Imagine a stateless web application that processes many requests, say a thousand per sec. All data that we create inside the processing cycle will be immediately discarded at the end. The app will use serialization/deserialization a lot, so we create and discard about 50-200kb on each request.
I think it could put a big pressure on garbage collector, because GC will have to discard a large amount of short-living objects. Does it make sense to implement a pool of byte arrays to reuse them for serialization/deserialization purposes? Did anybody have such experience?

The garbage collection mechanism is built on the premise that a lot of objects created exist for a very short period of time. The first pool of objects is called the Eden Space (see this SO answer for the origin of the name) and is monitored regularly. So I would expect the garbage collector would be able to handle this.
Like most (all?) performance questions, I would measure your particular use case before applying premature optimisations. Numerous configurations are available for tuning garbage collection including parallel GC strategies (noting your 'stop the world' comment below)

Related

Does it still make sense to avoid creating objects for the sake of garbage collection?

For example in a service adapter you might:
a. have an input data model and an output data model, maybe even immutable, with different classes and use Object Mappers to transform between classes and create some short-lived objects along the way
b. have a single data model, some of the classes might be mutable, but the same object that was created for the input is also sent as output
There are other use-cases when you'd have to choose between clear code with many objects and less clear code with less objects and I would like to know if Garbage Collection still has a weight in this decision.

I should make this a comment as IMO it does not qualify as an answer, but it will not fit.
Even if the answer(s) are going to most probably be - do whatever makes your code more readable (and to be honest I still follow that all the time); we have faced this issue of GC in our code base.
Suppose that you want to create a graph of users (we had to - around 1/2 million) and load all their properties in memory and do some aggregations on them and filtering, etc. (it was not my decision), because these graph objects where pretty heavy - once loaded even with 16GB of heap the JVM would fail with OOM or GC would take huge pauses. And it's understandable - lots of data requires lots of memory, you can't run away from it. The solution proposed and that actually worked was to model that with simple BitSets - where each bit would be a property and a potential linkage to some other data; this is by far not readable and extremely complicated to maintain to this day. Lots of shifts, lots of intrinsics of the data - you have to know at all time what the 3-bit means for example, there's no getter for usernameIncome let's say - you have to do quite a lot shifts and map that to a search table, etc. But it would keep the GC pretty low, at least in the ranges where we were OK with that.
So unless you can prove that GC is taken your app time so much - you probably are even safer simply adding more RAM and increasing it(unless you have a leak). I would still go for clear code like 99.(99) % of the time.

Newer versions of Java have quite sophisticated mechanisms to handle very short-living objects so it's not as bad as it was in the past. With a modern JVM I'd say that you don't need to worry about garbage collection times if you create many objects, which is a good thing since there are now many more of them being created on the fly that this was the case with older versions of Java.
What's still valid is to keep the number of created objects low if the creation is coming with high costs, e.g. accessing a database to retrieve data from, network operations, etc.

As other people have said I think it's better to write your code to solve the problem in an optimum way for that problem rather than thinking about what the garbage collector (GC) will do.
The key to working with the GC is to look at the lifespan of your objects. The heap is (typically) divided into two main regions called generations to signify how long objects have been alive (thus young and old generations). To minimise the impact of GC you want your objects to become eligible for collection while they are still in the young generation (either in the Eden space or a survivor space, but preferably Eden space). Collection of objects in the Eden space is effectively free, as the GC does nothing with them, it just ignores them and resets the allocation pointer(s) when a minor GC is finished.
Rather than explicitly calling the GC via System.gc() it's much better to tune your heap. For example, you can set the size of the young generation using command line options like -XX:NewRatio=n, where n signifies the ratio of new to old (e.g. setting it to 3 will make the ratio of new:old 1:3 so the young generation will be 1 quarter of the heap). Alternatively, you can set the size explicitly using -XX:NewSize=n and -XX:MaxNewSize=m. The GC may resize the heap during collections so setting these values to be the same will keep it at a fixed size.
You can profile your code to establish the rate of object creation and how long your objects typically live for. This will give you the information to (ideally) configure your heap to minimise the number of objects being promoted into the old generation. What you really don't want is objects being promoted and then becoming garbage shortly thereafter.
Alternatively, you may want to look at the Zing JVM from Azul (full disclosure, I work for them). This uses a different GC algorithm, called C4, which enables compaction of the heap concurrently with application threads and so eliminates most of the impact of the GC on application latency.

What does influence the most on the time of creating an object?

I know that creating an object takes time and that's why the flyweight pattern exists.
What I would like to know is what increases the time of creating a single object the most?
I thought it might be the search of the a slightly larger space in the memory, but I guess it is only slightly larger than each of the fields the object has. Then maybe it is the travel to the correct address in the memory while we are looking for a value of a specific field, but then again: the only thing we added is looking for the address of the object.

There are 3 ways object creation is costly:
1) the object allocation. This is actually pretty cheap (like some nanos), however take into account that
many objects have "embedded" objects which are implicitely also allocated and
Additionally often the time of the constructor running (initializing the object) is more costly than the actual allocation.
2) any allocation consumes Eden space, so the higher the allocation rate, the more CPU is consumed by GC (NewGen GC runs more frequent)
3) CPU caches. If you allocate temporary objects (e.g. Integer when putting to HashMap, those temp objects are are put in the L1 cache evicting some other data. If you use it only once, this does not payoff. Therefore high allocation rate (especially temporarys/immutables) lead to cache misses, causing significant slowdown in case (depending on what the app is actually trying to achieve).
Another issue is life cycle. The VM can handle best short lived or very long lived objects. If your application creates a lot of middle-age-dying objects (e.g. cache's), you will get more frequent Full GC's.
Regarding flyweight patterns. It depends. If its a very smallish object, flyweight frequently will not pay off. However if your usage patterns involves many allocations of the flyweight candidate obejct, flyweight'ing will pay off. That's the reason hotspot caches 10.000 Integer objects internally by default

In modern JVMs the object creation is not as costly as it were. It mostly needs to bump the pointer. In fact, in modern JVMs, many objects are actually secretly allocated on the machine stack, and that's basically free- it takes no time at all.
And regarding flyweight pattern: flyweight pattern is not used as the object creation is costly rather it is used to minimize memory use by sharing as much data as possible with other similar objects.

Repetitive allocation of same-size byte arrays, replace with pools?

As part of a memory analysis, we've found the following:
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
3 3.98% 19.85% 24259392 808 3849949016 1129587 359697 byte[]
4 3.98% 23.83% 24259392 808 3849949016 1129587 359698 byte[]
You'll notice that many objects are allocated, but few remain live. This is for a simple reason - the two byte arrays are allocated for each instance of a "client" that is generated. Clients are not reusable - each one can only handle one request and is then thrown away. The byte arrays always have the same size (30000).
We're considering moving to a pool (apache's GenericObjectPool) of byte arrays, as normally there are a known number of active clients at any given moment (so the pool size shouldn't fluctuate much). This way, we can save on memory allocation and garbage collection. The question is, would the pool cause a severe CPU hit? Is this idea a good idea at all?
Thanks for your help!

I think there are good gc related reasons to avoid this sort of allocation behaviour. Depending on the size of the heap & the free space in eden at the time of allocation, simply allocating a 30000 element byte[] could be a serious performance hit given that it could easily be bigger than the TLAB (hence allocation is not a bump the pointer event) & there may even not be enough space in eden available hence allocation directly into tenured which in itself likely to cause another hit down the line due to increased full gc activity (particularly if using cms due to fragmentation).
Having said that, the comments from fdreger are completely valid too. A multithreaded object pool is a bit of a grim thing that is likely to cause headaches. You mention they handle a single request only, if this request is serviced by a single thread only then a ThreadLocal byte[] that is wiped at the end of the request could be a good option. If the request is short lived relatively to your typical young gc period then the young->old reference issue may not be a big problem (as the probability of any given request being handled during a gc is small even if you're guaranteed to get this periodically).

Probably pooling will not help you much if at all - possibly it will make things worse, although it depends on a number of factors (what GC are you using, how long the objects live, how much memory is available, etc.):
The time of GC depends mostly on the number of live objects. Collector (I assume you run a vanilla Java JRE) does not visit dead objects and does not deallocate them one by one. It frees whole areas of memory after copying the live objects away (this keeps memory neat and compacted). 100 dead objects can collect as fast as 100000. On the other hand, all the live objects must be copied - so if you, say, have a pool of 100 objects and only 50 are used at a given time, keeping the unused object is going to cost you.
If your arrays currently tend to live shorter than the time needed to get tenured (copied to the old generation space), there is another problem: your pooled arrays will certainly live long enough. This will produce a situation where there is a lot of references from old generation to young - and GCs are optimized with a reverse situation in mind.
Actually it is quite possible that pooling arrays will make your GC SLOWER than creating new ones; this is usually the case with cheap objects.
Another cost of pooling comes from synchronizing objects across threads and cleaning them up after use. Both are trickier than they sound.
Summing up, unless you are well aware of the internals of your GC and understand how it works under the hood, AND have a results from a profiler that show that managing all the arrays is a bottleneck - DO NOT POOL. In most cases it is a bad idea.

If garbage collection in your case is really a performance hit (often cleaning up the eden space does not take much time if not many objects survive), and it is easy to plug in the object pool, try it, and measure it.
This certainly depends on your application's need.

The pool would work out much better as long as you always have a reference to it, this way the garbage collector simply ignores the pool and will only be declared once (you could always declare it static to be on the safe side). Although it would be persistent memory but I doubt that will be a problem for your application.

Thread-local object pooling

Going through the Goetz "Java Concurrency in Practice" book, he makes a case against using object pooling (section 11.4.7) - main arguments:
1) allocation in Java is faster than C's malloc
2) threads requesting objects from a pool require costly synchronization
My problem is not so much that allocation is slow, but that periodic garbage collection introduces outliers in response time that could be eliminated by reducing object pools.
Are there any issues that I am not seeing in using this approach? Essentially I am partitioning an object pool across the threads...

If its thread local then you can forget about this:
2) threads requesting objects from a pool require costly synchronization
Being thread-local you need not worry about synchronization to retrieve from the pool itself.

(sun's) GC scans live objects. the assumption is that there are way more dead objects than live objects in a typical java program runtime. it marks live objects, and dispose the rest.
if you cache a lot of objects, they are all live. and if you have several GBs of such objects, GC is going to waste a lot of time scanning them in vain. long GC pauses can paralyze your application.
cache something just to make it non-garbage is not helping GC.
that's not to say caching is wrong. if you have 15G memory, and your database is 10G, why not cache everything in memory, so responses are lighting fast. note this is to cache something that would otherwise be slow to fetch.
to prevent GC from fruitlessly scanning the 10G cache, the cache must be outside GC's control. For example, use 'memcached" which lives in another process, and has its own cache-optimized GC.
the latest news is Terracotta's BigMemory which is a pure java solution that does similar thing.
an example of thread local pooling is sun's direct ByteBuffer pooling. when we call
channel.read(byteBuffer)
if byteBuffer is not "direct", a "direct" one must be allocated under the hood, used to communicate data with OS. in a network application, such allocations could be very frequent, it seems to be a waste, to discard a just allocated one, and immediately allocate another one in the next statement. sun's engineers, apparently don't trust GC that much, created a thread local pool of "direct" ByteBuffers.

In Java 1.4, object allocation was relatively expensive so Object pools for even simple objects could help. However, in Java 5.0, Object allocation was significantly improved, however synchronization still had a way to go meaning that object allocation was faster than synchronization. i.e. removing object pools improved performance in many cases. In Java 6, synchronization has improved to the point where an object pool can make a little difference to performance in simple cases.
Avoiding simple object pools is a good idea because it is simpler, not for performance reasons.
For more complex/larger objects, object pools can be useful in Java 6, even if you use synchronization. e.g. a Socket, File stream, or Database connection.

I think your case is reasonable situation to use pooling. There is no evil in pooling, Goetz means that you should not use it when it is not necessary. Another example is connection pooling, because creation of connection is very expensive.

If it is threadlocal, it's very likely you may not even need pooling. Of course it would depend on the use cases, but the chances are, on a given thread you will likely need only one object of that type at a given time.
The caveat with threadlocals, however, is memory management. Note that threadlocal values don't go away easily until the thread that owns those threadlocals go away. Therefore, if you have a large number of threads and a large number of threadlocals, they may contribute to used memory quite a bit.

I'd definitely try it out. Although is now "common knowledge" that one should not care about object creation, in fact there may be a lot of performance gained from using object pools and specific classes. For a file processing framework, I gained 5% read performance from pooling object[] objects.
So try it out and time your executions to see if you gain anything.

Even it's an old question, point of 2 threads requesting objects from a pool require costly synchronization does not completely hold true.
It's possible to write a concurrent (no synchronization) object pool that doesn't even exhibit sharing (even false sharing) on the fast path. In the simplistic case, of course, each thread might have its own pool (more like an associated object) but then such a greedy approach can lead to resource waste (or starvation/error if the resource cannot be allocated)
Pools are good for heavy objects like ByteBuffers, esp. direct ones, connections, sockets, threads, etc. Overall any objects that require non-java intervention.

Why do finalizers have a "severe performance penalty"?

Effective Java says :
There is a severe performance penalty for using finalizers.
Why is it slower to destroy an object using the finalizers?

Because of the way the garbage collector works. For performance, most Java GCs use a copying collector, where short-lived objects are allocated into an "eden" block of memory, and when the it's time for that generation of objects to be collected, the GC just needs to copy the objects that are still "alive" to a more permanent storage space, and then it can wipe (free) the entire "eden" memory block at once. This is efficient because most Java code will create many thousands of instances of objects (boxed primitives, temporary arrays, etc.) with lifetimes of only a few seconds.
When you have finalizers in the mix, though, the GC can't simply wipe an entire generation at once. Instead, it needs to figure out all the objects in that generation that need to be finalized, and queue them on a thread that actually executes the finalizers. In the meantime, the GC can't finish cleaning up the objects efficiently. So it either has to keep them alive longer than they should be, or it has to delay collecting other objects, or both. Plus you have the arbitrary wait time of actually executing the finalizers.
All these factors add up to a significant runtime penalty, which is why deterministic finalization (using a close() method or similar to explicitly finalize the object's state) is usually preferred.

Having actually run into one such problem:
In the Sun HotSpot JVM, finalizers are processed on a thread that is given a fixed, low priority. In a high-load application, it's easy to create finalization-required objects faster than the low-priority finalization thread can process them. Meanwhile, the space on the heap used by the finalization-pending objects is unavailable for other uses. Eventually, your application may spend all of its time garbage collecting, because all of the available memory is in use by objects pending finalization.
This is, of course, in addition to the other many reasons to not use finalizers that are described in Effective Java.

I just picked up my copy Effective Java off my desk to see what he's referring to.
If you read Chapter 2, Section 6, he goes into good detail about the various performance hits.
You can't know when the finalizer will run, or even if it will at all. Because those resources may never be claimed, you will have to run with fewer resources.
I would recommend reading the entirety of the section - it explains things much better than I can parrot here.

If you read the documentation of finalize() closely, you will notice that finalizers enable an object to prevent being collected by the GC.
If no finalizer is present, the object simply can be removed and does not need any more attention. But if there is a finalizer, it needs to be checked afterwards, if the object didn't become "visible" again.
Without knowing exactly how the current Java garbage collection is implemented (actually, because there are different Java implementations out there, there are also different GCs), you can assume that the GC has to do some additional work if an object has a finalizer, because of this feature.

My thought is this:
Java is a garbage collected language, which deallocates memory based on its own internal algorithms. Every so often, the GC scans the heap, determines which objects are no longer referenced, and de-allocates the memory.
A finalizer interrupts this and forces the deallocation of memory outside of the GC cycle, potentially causing inefficiencies.
I think best practices are to use finalizers only when ABSOLUTELY necessary such as freeing file handles or closing DB connections which should be done deterministically.

One reason I can think of is that explicit memory cleanup is unnecessary if your resources are all Java Objects, and not native code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.