Algorithm to store Item-to-Item-Associations

Algorithm to store Item-to-Item-Associations - java

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?

Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap

You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

Related

Memory overhead of shrinking collections

I have been studying Java Collections recently. I noticed that ArrayList, ArrayDeque or HashMap contains helper functions which expand capacity of the containers if necessary, but neither of them have function to narrow the cap if the container gets empty.
If I am correct, is the memory cost of references (4 byte) so irrelevant?

You're correct, most of the collections have an internal capacity that is expanded automatically and that never shrinks. The exception is ArrayList, which has methods ensureCapacity() and trimToSize() that let the application manage the list's internal capacity explicitly. In practice, I believe these methods are rarely used.
The policy of growing but not shrinking automatically is based on some assumptions about the usage model of collections:
applications often don't know how many elements they want to store, so the collections will expand themselves automatically as elements are added;
once a collection is fully populated, the number of elements will generally remain around that number, neither growing nor shrinking significantly;
the per-element overhead of a collection is generally small compared to the size of the elements themselves.
For applications that fit these assumptions, the policy seems to work out reasonably well. For example, suppose you insert a million key-value pairs into a HashMap. The default load factor is 0.75, so the internal table size would be 1.33 million. Table sizes are rounded up to the next power of two, which would be 2^21 (2,097,152). In a sense, that's a million or so "extra" slots in the map's internal table. Since each slot is typically a 4-byte object reference, that's 4MB of wasted space!
But consider, you're using this map to store a million key-value pairs. Suppose each key and value is 50 bytes (which seems like a pretty small object). That's 100MB to store the data. Compared to that, 4MB of extra map overhead isn't that big of a deal.
Suppose, though, that you've stored a million mappings, and you want to run through them all and delete all but a hundred mappings of interest. Now you're storing 10KB of data, but your map's table of 2^21 elements is occupying 8MB of space. That's a lot of waste.
But it also seems that performing 999,900 deletions from a map is kind of an unlikely thing to do. If you want to keep 100 mappings, you'd probably create a new map, insert just the 100 mappings you want to keep, and throw away the original map. That would eliminate the space wastage, and it would probably be a lot faster as well. Given this, the lack of an automatic shrinking policy for the collections is usually not a problem in practice.

App Engine + Cloud Datastore performance: order in query or in memory?

Question about Google App Engine + Datastore. We have some queries with several equality filters. For this, we don't need to keep any composed index, Datastore maintains these indexes automatically, as described here.
The built-in indexes can handle simple queries, including all entities of a given kind, filters and sort orders on a single property, and equality filters on any number of properties.
However, we need the result to be sorted on one of these properties. I can do that (using Objectify) with .sort("prop") on the datastore query, which requires me to add a composite index and will make for a huge index once deployed. The alternative I see is retrieving the unordered list (max 100 entities in the resultset) and then sorting them in-memory.
Since our entity implements Comparable, I can simply use Collections.sort(entities).
My question is simple: which one is desired? And even if the datastore composite index would be more performant, is it worth creating all those indexes?
Thanks!

There is no right or wrong approach - solution depends on your requirements. There are several factors to consider:
Extra indexes take space and cost more both in storage costs and in write costs - you have to update every index on every update of an entity.
Sort on property is faster, but with a small result set the difference is negligible.
You can store sorted results in Memcache and avoid sorting them in every request.
You will not be able to use pagination without a composite index, i.e. you will have to retrieve all results every time for in-memory sort.

It depends on your definition of "desired". IMO, if you know the result set is a "manageable" size, I would just do in memory sort. Adding lots of indexes will have cost impact, you can do cost analysis first to check it.

Large 2D Array Storage in Java (Android)

I'm creating a matrix in Java, which:
Can be up to 10,000 x 10,000 elements in the worst case
May change size from time to time (assume on the order of days)
Stores an integer in the range 0-5 inclusive (presumably a byte)
Has elements accessed by referring to a pair of Long IDs (system-determined)
Is symmetrical (so can be done in half the space, if needed, although it makes things like summing the rows harder (or impossible if the array is unordered))
Doesn't necessarily need to be ordered (unless halved into a triangle, as explained above)
Needs to be persistent after the app closes (currently it's being written to file)
My current implementation is using a HashMap<Pair<Long,Long>,Integer>, which works fine on my small test matrix (10x10), but according to this article, is probably going to hit unmanageable memory usage when expanded to 10,000 x 10,000 elements.
I'm new to Java and Android and was wondering: what is the best practice for this sort of thing?
I'm thinking of switching back to a bog standard 2D array byte[][] with a HashMap lookup table for my Long IDs. Will I take a noticable performance hit on matrix access? Also, I take it there's no way of modifying the array size without either:
Pre-allocating for the assumed worst-case (which may not even be the worst case, and would take an unnecessary amount of memory)
Copying the array into a new array if a size change is required (momentarily doubling my memory usage)

Thought I'd answer this for posterity. I've gone with Fildor's suggestion of using an SQL database with two look-up columns to represent the row and column indices of my "matrix". The value is stored in a third column.
The main benefit of this approach is that the entire matrix doesn't need to be loaded into RAM in order to read or update elements, with the added benefit of access to summing functions (and any other features inherently in SQL databases). It's a particularly easy method on Android, because of the built-in SQL functionality.
One performance drawback is that the initialisation of the matrix is extraordinarily slow. However, the approach I've taken is to assume that if an entry isn't found in the database, it takes a default value. This eliminates the need to populate the entire matrix (and is especially useful for sparse matrices), but has the downside of not throwing an error if trying to access an invalid index. It is recommended that this approach is coupled with a pair of lists that list the valid rows and columns, and these lists are referenced before attempting to access the database. If you're trying to sum rows using the built-in SQL features, this will also not work correctly if your default is non-zero, although this can be remedied by returning the number of entries found in the row/column being summed, and multiplying the "missing" elements by the default value.

keeping data in nested hash table is a good sign for both memory management and better performance

I have to get some better opinion or an answer from you that to tackle the problem regarding data handling through collection object/s and performance issue.
Here I'am fetching data from around 5 to 6 lakh rows to keep it into collection object from that I need to go very specific to each category filters to reach selected data,generally I took vector if I want to go any exact data, I should traverse every index of it. Due to this it slows my performance.
Instead of it I have a plan that hash table going to keep a key and itself as a value another hash table, similar it will grow nested way many hast tables. This is good for better solution or not, this is my common question.
Note: Every row contains around 15 to 17 columns(many be as array) in a oracle database.(those 6 lakh entries)

To answer the question in the title, a hash table (either Hashtable or HashMap) gives good lookup performance (O(1)), but consumes a significant amount of memory. The memory overhead is in the region of 8 words per entry ... in addition to the space of the key and value.
Using a hash table to speed up lookup of records is a reasonable tradeoff. However, using a hashtable to represent the fields of a record is a bad idea. You would be better off using a custom class with a field for each column of the table.
However, EJP's comment is also pertinent. You should consider performing your queries against the database. In many respects, this is better thabn building an in-memory copy of the data and indexes and implementing your own query infrastructures.

How to reduce the total memory hogging by compacting my Objects in Java?

I have a table with around 20 columns with mostly consisting of varchars and decimals. This table has almost 1.5M rows. But few things are common in them like column1 consists of only 100 distinct strings , column2 has almost 1000 and column3 has almost 500.
Right now, I am storing all these column values in a map with Key as first 5 columns and Data as rest of columns. My task is such, I need to initialize all these at the start of the task.
What pattern(like Flyweight, etc) or data structure should I use to minimize my Object storage?
Why I need pre-load of all data?
Assume the whole data of the table as a tree and the victims can be at any leaf, trunk or at root. So for each entry[this is coming from different place], I need to see if there is any match in the tree.

Internalizing is not the best option. Garbage collecting from the PermSpace is possible but nothing the VM is optimized for.
You can implement your own CharSequence implementation that is backed by shared char[] arrays.
With a CharSequence implementation you'll be able to implement basic sharing semantics like internalized strings or more complicated ones taking substrings and other projections into account.
A custom CharSequence implementation can also be optimized to perform fewer memory allocations than the String class which is copying char[] around (for safety reasons that are not necessary if you have the backing char[] under your full control). Even new String("..").intern() will intantiate a new String instance (char[] array) that is rapidly garbage collected.

My first question would be, what does you task plan with doing with the data in the table? Preloading a complete table into memory is not always the best approach, for instance keeping your current setup but loading on demand might be a better solution. And you might want to investigate flushing data that isn't used for a while, i.e. a kind of recently used map.
Could you elaborate what your task tries to achieve with all that data cached in a map?
Is the "victim" identification part of the key or part of the object? If part of the object, how do you select the keys that select the objects that you need? In other words; it sounds like you try to reproduce functionality that the database is very good at.
If your problem is that your table contents does not map easily on a tree-like structure, you could add that information in a way that is useable through the DB interface.

If your data loading process can support it then it isn't too difficult to implement something like String.intern() without the GC permgen side effects.
For any hashable data element, you can simply have a Map<T,T> to look-up preexisting instances. So for String:
Map<String,String> stringCache = new HashMap<String,String>();
...
String sharedValue = stringCache.get(loadedValue);
The process that loads the data from wherever will still be creating temporary strings but these will be rapidly GC'ed. Without knowing more about the specifics of where the data is coming from, it's difficult to comment on whether those temporary objects are necessary... though I have trouble seeing a way around it. They would be reclaimed rapidly during the load process anyway.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.