Memory overhead of shrinking collections

Memory overhead of shrinking collections - java

I have been studying Java Collections recently. I noticed that ArrayList, ArrayDeque or HashMap contains helper functions which expand capacity of the containers if necessary, but neither of them have function to narrow the cap if the container gets empty.
If I am correct, is the memory cost of references (4 byte) so irrelevant?

You're correct, most of the collections have an internal capacity that is expanded automatically and that never shrinks. The exception is ArrayList, which has methods ensureCapacity() and trimToSize() that let the application manage the list's internal capacity explicitly. In practice, I believe these methods are rarely used.
The policy of growing but not shrinking automatically is based on some assumptions about the usage model of collections:
applications often don't know how many elements they want to store, so the collections will expand themselves automatically as elements are added;
once a collection is fully populated, the number of elements will generally remain around that number, neither growing nor shrinking significantly;
the per-element overhead of a collection is generally small compared to the size of the elements themselves.
For applications that fit these assumptions, the policy seems to work out reasonably well. For example, suppose you insert a million key-value pairs into a HashMap. The default load factor is 0.75, so the internal table size would be 1.33 million. Table sizes are rounded up to the next power of two, which would be 2^21 (2,097,152). In a sense, that's a million or so "extra" slots in the map's internal table. Since each slot is typically a 4-byte object reference, that's 4MB of wasted space!
But consider, you're using this map to store a million key-value pairs. Suppose each key and value is 50 bytes (which seems like a pretty small object). That's 100MB to store the data. Compared to that, 4MB of extra map overhead isn't that big of a deal.
Suppose, though, that you've stored a million mappings, and you want to run through them all and delete all but a hundred mappings of interest. Now you're storing 10KB of data, but your map's table of 2^21 elements is occupying 8MB of space. That's a lot of waste.
But it also seems that performing 999,900 deletions from a map is kind of an unlikely thing to do. If you want to keep 100 mappings, you'd probably create a new map, insert just the 100 mappings you want to keep, and throw away the original map. That would eliminate the space wastage, and it would probably be a lot faster as well. Given this, the lack of an automatic shrinking policy for the collections is usually not a problem in practice.

Related

Why need HashMap to relocate elements while rehashing?

When hashmap reaches allowed size (capacity*loadFactor) then it automatically increased and after that all elements will be relocated into the new indexies. So, why need to perform this relocation?

Because it makes hash table sparse, allowing elements to sit in their own buckets instead of piling up in small number of buckets.
When several elements hit the same bucket, HashMap has to create a list (and sometimes even a tree) which is bad for both memory footprint and performance of elements retrieval. So, to prevent the number of such collisions, HashMap is growing its internal hash table and rehashes.

The rehashing is required because the calculation used when mapping a key value to a bucket is dependent on the total number of buckets. When the number of buckets changes (to increase capacity), the new mapping calculation may map a given key to a different bucket.
In other words, lookups for some or all of the previous entries may fail to behave properly because the entries are in the wrong buckets after growing the backing store.
While this may seem unfortunate, you actually want the mapping function to take into account the total number of buckets that are available. In this way, all buckets can be utilized and no entries get mapped to buckets that do not exist.
There are other data structures that do not have this property, but this is the standard way that hash maps work.

Java - Cache HashMap with dynamic object size

I have a map of HashMap where Node is a class containing some info and most importantly containing a neighbors HashMap.
My algorithm does random gets from the outer HashMap and puts in the inner HashMap, e.g.
Node node = map.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
I have lots of entries in both the outer map and inner maps. The full dataset cannot be handled by memory alone, so I need to use some disk caching. I would like to use memory as much as possible (until almost full, or until it reaches a threshold I specify) and then evict entries to disk.
I have used several caches (mostly mapDb and EhCache) trying to solve my problem but with no luck. I am setting a maximum memory size, but the cache just ignores it.
I am almost certain that the problem lies to the fact that my object is of dynamic size.
Anyone got any ideas on how I could handle this problem?
Thanks in advance.

Note: I work on Ehcache
Ehcache, as most other caching products, cannot know about your object size growing unless you let it know by updating the mapping from the outside:
Node node = cache.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
cache.put(someInt, node);
In that context, Ehcache will properly track the object growth and trigger memory eviction.
Note that Ehcache no longer uses an overflow model between the heap and disk but instead always stores the mapping on disk while keeping a hotset on heap.

Large 2D Array Storage in Java (Android)

I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)

Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.

Do all the dictionary-like data structures need to rehash whenever its size reaches the threshold ?

I am wondering for those dictionary-like data structure (Hashtable, HashMap, LinkedHashMap, TreeMap, ConncurrentHashMap, SortedMap and so on) need to do rehashing operation when its size reaches the threshold ? Since it's really expensive whenever we resize our table, so I am wondering is there anything else that doesn't require rehashing when resizing the table or any way to improve the performance on such operation ?

SortedMap (TreeMap) doesn't need to rehash, it's designed as red-black tree, and thus is self-balanced
closed-hash related structures might need to be rehashed in order to get performance boost, however it's kinda trade-off.
so it can be implemented without rehashing, and that is what load factor parameter introduced for.

To improve the performance estimate approximate initial capacity based on the application and choose the right load factor when allocating memory for the datastructure.
No escape from re-hashing unless you implement your own HashMap.
A few other facts: Rehashing happens only on HashRelated data structures i.e HashMap, HashTable so on. So all non hash datastructures like treemap, sortedmap(is an interface so we can take this out of picture) are not re-hashed on reaching a limit.
Re-sizing always happens on ArrayList when it reaches the threshold. So use Linked list if you have a lot of insertions and deletions in between the datastructure.

Algorithm to store Item-to-Item-Associations

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?

Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap

You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.