Java - Cache HashMap with dynamic object size

Java - Cache HashMap with dynamic object size - java

I have a map of HashMap where Node is a class containing some info and most importantly containing a neighbors HashMap.
My algorithm does random gets from the outer HashMap and puts in the inner HashMap, e.g.
Node node = map.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
I have lots of entries in both the outer map and inner maps. The full dataset cannot be handled by memory alone, so I need to use some disk caching. I would like to use memory as much as possible (until almost full, or until it reaches a threshold I specify) and then evict entries to disk.
I have used several caches (mostly mapDb and EhCache) trying to solve my problem but with no luck. I am setting a maximum memory size, but the cache just ignores it.
I am almost certain that the problem lies to the fact that my object is of dynamic size.
Anyone got any ideas on how I could handle this problem?
Thanks in advance.

Note: I work on Ehcache
Ehcache, as most other caching products, cannot know about your object size growing unless you let it know by updating the mapping from the outside:
Node node = cache.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
cache.put(someInt, node);
In that context, Ehcache will properly track the object growth and trigger memory eviction.
Note that Ehcache no longer uses an overflow model between the heap and disk but instead always stores the mapping on disk while keeping a hotset on heap.

Related

Alternate to Hashmap as cache other than in memory databases.?

I am using hashmap as a cache to store id and name. because it is frequently used and it lives through the lifetime of the application.
for every user using the application, around 5000 and more (depends on workspace) ids and names get stored in hashmap. At some point java.lang.OutOfMemoryError exception gets thrown. since I am saving a lot of (id, name) in hashmap.
I don't want to clear my hashmap cache value. but I know to be efficient we have to clear cache using the LRU approach or other approaches.
Note: I don't want to use Redis, Memcached, or any in-memory key-value store.
Usecase: slack will return id in place of the user name in every
message.
for eg: Hello #john doe = return Hello #dxap123.
I don't want an API hit for every message to get the user name.
Can somebody provide me an alternate efficient approach or correct me if I am doing something wrong in my approach.?

Like others have said 5000 shouldn't give you out of memory, but if you don't keep a limit on the size of the map eventually you will get out of memory error. You should cache the values that are most recently used or most frequently used to optimize the size of the map.
Google guava library has cache implementations which i think would fit your usecase
https://github.com/google/guava/wiki/CachesExplained

For 5000 key-value pairs, It should not through OutOfMemoryException. If It is throwing the same you are not managing the HashMap properly. If you have more than 5000 items and want an alternate for hashmap you can use ehcache, A widely adopted Java cache with tiered storage options instead of going with in-memory cache technologies.
The memory areas supported by Ehcache include:
On-Heap Store: Uses the Java heap memory to store cache entries and shares the memory with the application. The cache is also scanned by the garbage collection. This memory is very fast, but also very limited.
Off-Heap Store: Uses the RAM to store cache entries. This memory is not subject to garbage collection. Still quite fast memory, but slower than the on-heap memory, because the cache entries have to be moved to the on-heap memory before they can be used.
Disk Store: Uses the hard disk to store cache entries. Much slower than RAM. It is recommended to use a dedicated SSD that is only used for caching.
You can find the documentation here. http://www.ehcache.org/documentation/
If you are using spring-boot you can follow this article to implement the same.
https://springframework.guru/using-ehcache-3-in-spring-boot/

If the "names" are not unique, then try with calling on it String.intern() before inserting the "name" to the map, this would then reduce the memory usage.

Memory overhead of shrinking collections

I have been studying Java Collections recently. I noticed that ArrayList, ArrayDeque or HashMap contains helper functions which expand capacity of the containers if necessary, but neither of them have function to narrow the cap if the container gets empty.
If I am correct, is the memory cost of references (4 byte) so irrelevant?

You're correct, most of the collections have an internal capacity that is expanded automatically and that never shrinks. The exception is ArrayList, which has methods ensureCapacity() and trimToSize() that let the application manage the list's internal capacity explicitly. In practice, I believe these methods are rarely used.
The policy of growing but not shrinking automatically is based on some assumptions about the usage model of collections:
applications often don't know how many elements they want to store, so the collections will expand themselves automatically as elements are added;
once a collection is fully populated, the number of elements will generally remain around that number, neither growing nor shrinking significantly;
the per-element overhead of a collection is generally small compared to the size of the elements themselves.
For applications that fit these assumptions, the policy seems to work out reasonably well. For example, suppose you insert a million key-value pairs into a HashMap. The default load factor is 0.75, so the internal table size would be 1.33 million. Table sizes are rounded up to the next power of two, which would be 2^21 (2,097,152). In a sense, that's a million or so "extra" slots in the map's internal table. Since each slot is typically a 4-byte object reference, that's 4MB of wasted space!
But consider, you're using this map to store a million key-value pairs. Suppose each key and value is 50 bytes (which seems like a pretty small object). That's 100MB to store the data. Compared to that, 4MB of extra map overhead isn't that big of a deal.
Suppose, though, that you've stored a million mappings, and you want to run through them all and delete all but a hundred mappings of interest. Now you're storing 10KB of data, but your map's table of 2^21 elements is occupying 8MB of space. That's a lot of waste.
But it also seems that performing 999,900 deletions from a map is kind of an unlikely thing to do. If you want to keep 100 mappings, you'd probably create a new map, insert just the 100 mappings you want to keep, and throw away the original map. That would eliminate the space wastage, and it would probably be a lot faster as well. Given this, the lack of an automatic shrinking policy for the collections is usually not a problem in practice.

Elasticsearch Field Data Cache Distributed?

Since upgrading to Elasticsearch 1.0.1 I've become aware of the field data cache and its circuit breaker.
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/index-modules-fielddata.html
I use the facets(and now aggregates) quite heavily and I was just wondering if the field cache is distributed and if so, how it is distributed?
I.e. If I use 2GB of field cache on one node, if I then add 3 more nodes, will the 2GB be distributed over the 4 nodes or will I see a 2GB cache on each node?
Thanks in advance,
J

You can think of the field data as a data structure that's loaded into memory per shard. You have field data potentially on any data node. It is correct that its memory footprint gets distributed if you scale out by adding more data nodes, although that depends on how many indices/shards you have and which ones you are using for faceting/sorting/scripting.

Algorithm to store Item-to-Item-Associations

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?

Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap

You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

How to reduce the total memory hogging by compacting my Objects in Java?

I have a table with around 20 columns with mostly consisting of varchars and decimals. This table has almost 1.5M rows. But few things are common in them like column1 consists of only 100 distinct strings , column2 has almost 1000 and column3 has almost 500.
Right now, I am storing all these column values in a map with Key as first 5 columns and Data as rest of columns. My task is such, I need to initialize all these at the start of the task.
What pattern(like Flyweight, etc) or data structure should I use to minimize my Object storage?
Why I need pre-load of all data?
Assume the whole data of the table as a tree and the victims can be at any leaf, trunk or at root. So for each entry[this is coming from different place], I need to see if there is any match in the tree.

Internalizing is not the best option. Garbage collecting from the PermSpace is possible but nothing the VM is optimized for.
You can implement your own CharSequence implementation that is backed by shared char[] arrays.
With a CharSequence implementation you'll be able to implement basic sharing semantics like internalized strings or more complicated ones taking substrings and other projections into account.
A custom CharSequence implementation can also be optimized to perform fewer memory allocations than the String class which is copying char[] around (for safety reasons that are not necessary if you have the backing char[] under your full control). Even new String("..").intern() will intantiate a new String instance (char[] array) that is rapidly garbage collected.

My first question would be, what does you task plan with doing with the data in the table? Preloading a complete table into memory is not always the best approach, for instance keeping your current setup but loading on demand might be a better solution. And you might want to investigate flushing data that isn't used for a while, i.e. a kind of recently used map.
Could you elaborate what your task tries to achieve with all that data cached in a map?
Is the "victim" identification part of the key or part of the object? If part of the object, how do you select the keys that select the objects that you need? In other words; it sounds like you try to reproduce functionality that the database is very good at.
If your problem is that your table contents does not map easily on a tree-like structure, you could add that information in a way that is useable through the DB interface.

If your data loading process can support it then it isn't too difficult to implement something like String.intern() without the GC permgen side effects.
For any hashable data element, you can simply have a Map<T,T> to look-up preexisting instances. So for String:
Map<String,String> stringCache = new HashMap<String,String>();
...
String sharedValue = stringCache.get(loadedValue);
The process that loads the data from wherever will still be creating temporary strings but these will be rapidly GC'ed. Without knowing more about the specifics of where the data is coming from, it's difficult to comment on whether those temporary objects are necessary... though I have trouble seeing a way around it. They would be reclaimed rapidly during the load process anyway.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.