How do I efficiently cache objects in Java using available RAM?

How do I efficiently cache objects in Java using available RAM? - java

I need to cache objects in Java using a proportion of whatever RAM is available. I'm aware that others have asked this question, but none of the responses meet my requirements.
My requirements are:
Simple and lightweight
Not dramatically slower than a plain HashMap
Use LRU, or some deletion policy that approximates LRU
I tried LinkedHashMap, however it requires you to specify a maximum number of elements, and I don't know how many elements it will take to fill up available RAM (their sizes will vary significantly).
My current approach is to use Google Collection's MapMaker as follows:
Map<String, Object> cache = new MapMaker().softKeys().makeMap();
This seemed attractive as it should automatically delete elements when it needs more RAM, however there is a serious problem: its behavior is to fill up all available RAM, at which point the GC begins to thrash and the whole app's performance deteriorates dramatically.
I've heard of stuff like EHCache, but it seems quite heavy-weight for what I need, and I'm not sure if it is fast enough for my application (remembering that the solution can't be dramatically slower than a HashMap).

I've got similar requirements to you - concurrency (on 2 hexacore CPUs) and LRU or similar - and also tried Guava MapMaker. I found softValues() much slower than weakValues(), but both made my app excruciatingly slow when memory filled up.
I tried WeakHashMap and it was less problematic, oddly even faster than using LinkedHashMap as an LRU cache via its removeEldestEntry() method.
But by the fastest for me is ConcurrentLinkedHashMap which has made my app 3-4 (!!) times faster than any other cache I tried. Joy, after days of frustration! It's apparently been incorporated into Guava's MapMaker, but the LRU feature isn't in Guava r07 at any rate. Hope it works for you.

I've implemented serval caches and it's probably as difficult as implementing a new datasource or threadpool, my recommendation is use jboss-cache or a another well known caching lib.
So you will sleep well without issues

I've heard of stuff like EHCache, but it seems quite heavy-weight for what I need, and I'm not sure if it is fast enough for my application (remembering that the solution can't be dramatically slower than a HashMap).
I really don't know if one can say that EHCache is heavy-weight. At least, I do not consider EHCache as such, especially when using a Memory Store (which is backed by an extended LinkedHashMap and is of course the the fastest caching option). You should give it a try.

I believe MapMaker is going to be the only reasonable way to get what you're asking for. If "the GC begins to thrash and the whole app's performance deteriorates dramatically," you should spend some time properly setting the various tuning parameters. This document may seem a little intimidating at first, but it's actually written very clearly and is a goldmine of helpful information about GC:
https://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf

I don't know if this would be a simple solution, especially compared with EHCache or similar, but have you looked at the Javolution library? It is not designed for as such, but in the javolution.context package they have an Allocator pattern which can reuse objects without the need for garbage collection. This way they keep object creation and garbage collection to a minimum, an important feature for real-time programming. Perhaps you should take a look and try to adapt it to your problem.

This seemed attractive as it should
automatically delete elements when it
needs more RAM, however there is a
serious problem: its behavior is to
fill up all available RAM
Using soft keys just allows the garbage collector to remove objects from the cache when no other objects reference them (i.e., when the only thing referring to the cache key is the cache itself). It does not guarantee any other kind of expulsion.
Most solutions you find will be features added on top of the java Map classes, including EhCache.
Have you looked at the commons-collections LRUMap?
Note that there is an open issue against MapMaker to provide LRU/MRU functionality. Perhaps you can voice your opinion there as well

Using your existing cache, store WeakReference rather than normal object refererences.
If GC starts running out of free space, the values held by WeakReferences will be released.

In the past I have used JCS. You can set up the configuration to try and meet you needs. I'm not sure if this will meet all of your requirements/needs but I found it to be pretty powerful when I used it.

You cannot "delete elements" you can only stop to hard reference them and wait for the GC to clean them, so go on with Google Collections...

I'm not aware of an easy way to find out an object's size in Java. Therefore, I don't think you'll find a way to limit a data structure by the amount of RAM it's taking.
Based on this assumption, you're stuck with limiting it by the number of cached objects. I'd suggest running simulations of a few real-life usage scenarios and gathering statistics on the types of objects that go into the cache. Then you can calculate the statistically average size, and the number of objects you can afford to cache. Even though it's only an approximation of the amount of RAM you want to dedicate to the cache, it might be good enough.
As to the cache implementation, in my project (a performance-critical application) we're using EhCache, and personally I don't find it to be heavyweight at all.
In any case, run several tests with several different configurations (regarding size, eviction policy etc.) and find out what works best for you.

Caching something, SoftReference maybe the best way until now I can imagine.
Or you can reinvent an Object-pool. That every object you doesn't use, you don't need to destroy it. But it to save CPU rather than save memory

Assuming you want the cache to be thread-safe, then you should examine the cache example in Brian Goetz's book "Java Concurrency in Practice". I can't recommend this highly enough.

Related

How to make an ideal cache.?

In an application we are using an LRU(Least Recently Used) cache(Concurrent HashMap Implementation) with Max-size constrain. I'm wondered whether i could improve the performance of this cache. Following were few alternatives which i found on the net .
Using Google Gauva pool library.(since my implementation uses LRU , I dont see any benefit from gauva library)
If i wrap the objects as soft-references and store as values in LRU map(with out any size constrain) , can i see any benefit ? (This is not an ideal way of caching. After major gc run , all the soft references will be garbage collected).
How about using a hybrid pool which is a combination of LRU map + a soft reference map.(idea is when ever a object is pruned from LRU map , it is stored in a soft reference map.
By this approach we can have more number of objects in cache. But this approach might be a time consuming one.)
Are there any other methods to improve the performance of cache?

First of all, welcome to the club of cache implementers and improvers!
Don't use LRU. There are a lot of algorithms that are better then LRU, that are now
more then 10 years old meanwhile. As a start read these relevant resarch papers:
Wikipedia: LIRS Caching Algorithm
Wikipedia: Adaptive Replacement Cache
Within these papers you find also more basic papers about the idea of adaptive caching.
(e.g. 2Q, LRFU, LRU-k).
Warpping objects: It depends on what you want to achieve. Actually you have at least three additional object for a cache entry: The hashtable entry, the weakreference object, the cache entry object. With this approach you increase the memory footprint and if you
have a low efficiency, e.g. because of short expiry, you have a lot of GC trashing.
Adapt to available memory: If you want to adapt to the available memory it is better to evict entries if memory becomes lower. This way you evict entries that are used very seldom, instead of a random eviction. However, this approach affords more coding. EHCache with Auto Resource Control has implemented something like this.
The reference wrappers are a nice and easy way if you want to use more memory for the cache but avoid low heap conditions, but it is nothing high performance in terms of over all memory efficiency and access times.
Measure it! It depends heavily on the usage scenario whether you get an "performance improvement" or not. Ideally you need to know and consider the access pattern, the cached object sizes, expiry constraints and the expected parallelism. I put together a benchmark suite that you can find on GitHub cache2k benchmarks.
However, for now, these benchmarks just focus on the replacement policy efficiency and access times. Comparison of memory overhead and possible parallelism is missing. This will be added in somehow half a year scope. The benchmark results are available on the cache2k benchmark results page.
If you are generally interested in the topic and do some research in the field consider contributing to cache2k. I am especially happy about more benchmarking code, or descriptions of usage scenarios and traces of access patterns to optimize and improve the replacement algorithm.

How B-Tree works in term of serialisation?

In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.

you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.

You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.

For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)

Keep serialized and compressed Objects in-memory

I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...

You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc

The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.

"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"

Implementing a file based queue

I have an in memory bounded queue in which multiple threads queue objects. Normally the queue should be emptied by a single reader thread that processes the items in the queue.
However, there is a possibility that the queue is filled up. In such a case I would like to persist any additional items on the disk that would be processed by another background reader thread that scans a directory for such files and processes the entries within the files. I am familiar with Active MQ but prefer a more light weight solution. It is ok if the "FIFO" is not strictly followed (since the persisted entries may be processed out of order).
Are there any open source solutions out there? I did not find any but thought I would ping this list for suggestions before I embark on the implementation myself.
Thank you!

Take a look at http://square.github.io/tape/, and its impressive QueueFile.
(thanks to Brian McCallister's "The Long Tail Treasure Trove" for pointing me at that).

You could use something like SQLLite to store the objects in.

EHCache can overflow to disk. It's also highly concurrent, though you dont really need that

Why is the queue bounded? Why not use a dynamically expandable data structure? That seems much simpler than involving the disk.
Edit:
It's hard to answer your question with out more context.
Can you clarify what you mean by "run out of memory"? How big is the queue? How much memory do you have?
Are you on an embedded system with very little memory? Or do you have 2 GB or more of stuff in the queue?
If either is true, you really aught to use a "swappable" data structure like a BTree. Implementing one your self for one queue seems like overkill. I would just use an embedded database like SQL lite.
If neither of those us true, then just use a vector or a linked list.
Edit 2:
You probably don't need a BTree or a database. You could just use a linked list of pages. But again,
I have to ask: is this necessary?
Or, if you are willing to process things non serially, why not have multiple reader threads all the time?
Ultimately though I don't think your proposal is the way to go.

You could embed berkley db java edition for keeping queue elements in files.
You can look at working example here:
http://sysgears.com/articles/lightweight-fast-persistent-queue-in-java-using-berkley-db
Hope this helps

MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.
https://github.com/jankotek/MapDB
http://www.mapdb.org/

The most performant and GC friendly solution I've found by now is Chronicle Queue.
It has extremely low write latency, order of tens of nanoseconds, several grades of magnitude lower than MapDB or SQLite.

determining java memory usage

Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)

I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.

As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)

You can use JMP, but it's only caught up to Java 1.5.

I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.

If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.

See if PerfAnal will give you what you are looking for.

This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory

I believe the profiler included in Netbeans can moniter memory usage also, you can try that

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.