storing huge HashMap in database

storing huge HashMap in database - java

I have something like:
import java.util.HashMap;
import java.util.List;
public class A {
HashMap<Long, List<B>> hashMap = new HashMap<Long, List<B>>();
}
class B{
int a;
int b;
int c;
}
And I want to store this in database, because it will be huge huge.
I will have more 250000000 keys in HashMap and each key representing huge list of data (say list size may be around 1000).
How I can do this for best performance on retrieving list of B's objects with Long hashKey from database?
Any other suggestions?
Thanks in advance.

To me, this looks like a classical One-To-Many or Many-To-Many association between two tables.
If each B belongs to only one A, then you would have a table A and a table B containing a foreign key to A.
If a given B can belong to multiple As, then you would have a table A, a table B, and a join table between the two tables.
Make sure to index the foreign keys.

As you have a very large data set of up to 1/4 bn * 20 * 1 k or about 5 TB, the main problem you have is that it can't be stored in memory and is too large to store on SSD, so you will have to access disk efficiently otherwise you are looking at a latency of about 8 ms per key This should be your primary concern otherwise it will take days just to access every key randomly once.
Unless you have a good understand of how to implement this with memory mapped files you will need to use a database, preferable one design to handle large numbers of records. You with also need a disk sub-system not only for capacity but to give you more spindles so you can increase the number of requests you can perform concurrently.

using infinispan you could just work with your huge map and have parts of it (the ones not recently accessed) stored to disk to save RAM. easier to do than writing a whole D layer and (i think) faster and uses less memory #runtime (the entire map isnt in memory ever)

You can map this directly as a one to many relationship. You need two tables. One to hold the key (let's call it KeyTable), and the other one to keep the B objects (BTable). On BTable with the B objects, you need a foreign key to the KeyTable. Then you can query something like this to get the objects with key 1234:
SELECT * FROM BTABLE WHERE key=1234;
For performance, you probably should code this using JDBC instead of something like Hibernate, to have better control of memory usage.

Related

Non-serializing in-memory database

I have the following problem:
There is a Set<C> s of objects of class C. C is defined as follows:
class C {
A a;
B b;
...
}
Given A e, B f, ..., I want to find from s all objects o such that o.a = e, o.b = f, ....
Simplest solution: stream over s, filter, collect, return. But that takes a long time.
Half-assed solution: create a Map<A, Set<C>> indexA, which splits the set by a's value. Stream over indexA.get(e), filter for the other conditions, collect, return.
More-assed solution: create index maps for all fields, select for all criteria from the maps, stream over the shortest list, filter for other criteria, collect, return.
You see where this is going: we're accidentally building a database. The thing is that I don't want to serialize my objects. Sure I could grab H2 or HSQLDB and stick my objects in there, but I don't want to persist them. I basically just want indices on my regular old on-the-heap Java objects.
Surely there must be something out there that I can reuse.

Eventually, I found a couple of projects which tackle this problem including CQEngine, which seems like the most complete and mature library for this purpose.

HSQLDB provides the option of storing Java objects directly in an in-memory database without serializing them.
The property sql.live_object=true is used as a property on the connection URL to a mem: database, for example jdbc:hsqldb:mem:test;sql.live_object=true. A table is created with a column of type OTHER to store the object. Extra columns in this table duplicate any fields in the object that need indexing.
For example:
CREATE TABLE OBJECTLIST (ID INTEGER IDENTITY, OBJ OTHER, TS_FIELD TIMESTAMP, INT_FIELD INTEGER)
CREATE INDEX IDX1 ON OBJECTLIST(TS_FIELD)
CREATE INDEX IDX2 ON OBJECTLIST(INT_FIELD)
The object is stored in the OBJ column, and the timestamp and integer values for the fields that are indexed are stored the the extra columns. SQL queries such as SELECT * FROM OBJECTLIST WHERE INT_FILED = 1234 return the rows containing the relevant objects.
http://hsqldb.org/doc/2.0/guide/dbproperties-chapt.html#dpc_sql_conformance

What data structure should I use for a map with variable set as keys?

My dataset looks like this:
Task-1, Priority1, (SkillA, SkillB)
Task-2, Priority2, (SkillA)
Task-3, Priority3, (SkillB, SkillC)
Calling application (client) will send in a list of skills - say (SkillD, SkillA).
lookup:
Search thru dataset for SkillD first, and not find anything.
Search for SkillA. We will find two entries - Task-1 with Priority1, Task-2 with Priority2.
Identify the task with highest priority (in this case, Task-1)
Remove Task-1 from that dataset & return Task-1 to client
Design considerations:
there will be lot of add/update/delete to the dataset when website goes live
There are only few skills but not a static list (about 10), but for each skill, there can be thousands of tasks. So, the lookup/retrieval will have to be extremely fast
I have considered simple List with binarySearch(comparator) or Map(skill, SortedSettasks(task)), but looking for more ideas.
What is the best way to design a data structure for this kind of dataset that allows a complex key and sorted array of tasks associated with that key.

How about changing the aproach a bit?
You can use the Guava and a Multimap in particular.
Every experienced Java programmer has, at one point or another, implemented a Map<K, List<V>> or Map<K, Set<V>>, and dealt with the awkwardness of that structure. For example, Map<K, Set<V>> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.
There are two ways to think of a Multimap conceptually: as a collection of mappings from single keys to single values:
I would suggest you having a Multimap of and the answer to your problem in a powerfull feature introduced by Multimap called Views
Good luck!

I would consider MongoDB. The data object for one of your rows sounds like a good fit into a JSON format, versus a row in a table. The reason is because the skill set list may grow. In classic relational DB you solve this through one of three ways, have ever expanding columns to make sure you have max number of skill set columns (this is very ugly), have a separate table that has grouping of skill sets matched to an ID, or store the skill sets as a comma delimited list of skill sets. Each of these suck. In MongoDB you can have array fields and the items in the array are indexable.
So with this in mind I would do all the querying on MongoDB and let it deal with it all. I would create a POJO that would like this:
public class TaskPriority {
String taskId;
String priorityId;
List<String> skillIds;
}
In MongoDB you can index all these fields to get fast searching and querying.
If it is the case that you have to cache these items locally and do these queries off of Java data structures then what you can do is create an index for the items you care about that reference instances of the TaskPriority object.
For example to track skill sets to their TaskPriority's then the following Map can be used:
Map<String, TaskPriority> skillSetToTaskPriority;
You can repeat this for taskId and priorityId. You would have to manage these indexes. This is usually the job of your DB to do.
Finally, you can then have POJO's and tables (or MongodDB collections) that map the taskId to a Task object that contains any meta data about that task that you may wish to have. And the same is true for Priority and SkillSet. So thats 4 MongoDB collections... Tasks, Priorities, SkillSets, and TaskPriorities.

Ehcache - why are the entries so big?

I have a fairly simple data model like:
class MyParent {
// 7 fields here, some numeric, some String, not longer than 50 chars total
Set<MyChild> children;
}
class MyChild {
int ownerId;
// 3 more fields, numeric or dates
}
MyParent, MyChild and MyParent.children are all cached with read-only.
I have 40,000 instances of MyParent and 100,000 instances of MyChild. That yields 180,000 entries in cache (if you add 40,000 MyParent.children).
I want to cache everything, grouped by ownerId. Not wanting to reinvent the wheel, I wanted to use query cache like:
Query query = session
.createQuery(
"select distinct p from MyParent p join fetch p.children c where c.ownerId = :ownerId");
query.setParameter("ownerId", ownerId);
query.setCacheable(true);
query.setCacheRegion("MyRegion");
query.list();
For all 1,500 values of ownerId.
Cache works, but I noticed it's huge! Measured with Ehcache.calculateInMemorySize(), on average each entry is over one kilobyte big. In order to cache ~180,000 entries I would need over 200 MB. That's outragous, given that the entries themselves are much smaller.
Where does the overhead come from and how can I decrease it?

I'm not sure from the question what cache you used to do the math, but let me use the MyParent class as an example. Given what you explained about the class, on a 64bit VM with compressedOops enabled, a MyParent instance would be a little below 500 bytes in heap. And that is without the Set, I'll explain why later (it'd be another 128 bytes on top otherwise). The cache also needs to hold the key for that entry, which comes added to the calculation...
Hibernate doesn't directly use the primary key the key to something it stores in the cache, but a CacheKey entry. That instance holds the pk of the entity the value represents as well as four other fields: type, the Hibernate type mapping; entityOrRoleName, the entity or collection-role name; tenantId, the tenant identifier associated this data; and finally, the hashCode of the pk (see org.hibernate.type.Type.getHashCode).
Now sadly it all doesn't end here, the value for that entry isn't the MyParent instance, but a CacheEntry instance. This time, besides more metadata (subClass, the entity's name, which defaults to FQCN; lazyPropertiesAreUnfetched, a boolean; and the optimisitc locking value out of the entity), that instance still doesn't hold the MyParent instance, but a disassembled representation of it. This representation is an array of the state (all properties) of the entity.
I guess that with this information, the "estimated" sizes of your hibernate caches will make more sense. I'd like to stress out that these are only estimations, and if I remember correctly how it is being calculated, it probably is slightly above reality. Indeed some information in the CacheKey for instance probably should be accounted for differently. As of Ehcache 2.5, you will be able to enable memory based tuning on Caches (and even at the CacheManager level). When that is being done, cache entries are precisely measured and the calculateInMemorySize() will give you the real measured size of the cache.
You can download the beta for 2.5 now from the ehcache.org. Also note that when using byte-based sizing on your caches, the sizing engine will account for these shared instances across cached entries in Hibernate's cache types. You can read more on the way this all works here : http://ehcache.org/documentation/configuration.html#Memory_Based_Cache_Sizing_Ehcache_2.5_and_higher
Hope that helps you make more sense out of it all...
Alex

How to implement n:m relation in Java?

I need to implement an n:m relation in Java.
The use case is a catalog.
a product can be in multiple categories
a category can hold multiple products
My current solution is to have a mapping class that has two hashmaps.
The key of the first hashmap is the product id and the value is a list of category ids
The key to the second hashmap is the category id and the value is a list of product ids
This is totally redundant an I need a setting class that always takes care that the data is stored/deleted in both hashmaps.
But this is the only way I found to make the following performant in O(1):
what products holds a category?
what categories is a product in?
I want to avoid full array scans or something like that in every way.
But there must be another, more elegant solution where I don't need to index the data twice.
Please en-light me. I have only plain Java, no database or SQLite or something available. I also don't really want to implement a btree structure if possible.

If you associate Categories with Products via a member collection, and vica versa, then you can accomplish the same thing:
public class Product {
private Set<Category> categories = new HashSet<Category>();
//implement hashCode and equals, potentially by id for extra performance
}
public class Category {
private Set<Product> contents = new HashSet<Product>();
//implement hashCode and equals, potentially by id for extra performance
}
The only difficult part is populating such a structure, where some intermediate maps might be needed.
But the approach of using auxiliary hashmaps/trees for indexing is not a bad one. After all, most indices placed on databases for example are auxiliary data structures: they coexist with the table of rows; the rows aren't necessarily organized in the structure of the index itself.
Using an external structure like this empowers you to keep optimizations and data separate from each other; that's not a bad thing. Especially if tomorrow you want to add O(1) look-ups for Products given a Vendor, e.g.
Edit: By the way, it looks like what you want is an implementation of a Multimap optimized to do reverse lookups in O(1) as well. I don't think Guava has something to do that, but you could implement the Multimap interface so at least you don't have to deal with maintaining the HashMaps separately. Actually it's more like a BiMap that is also a Multimap which is contradictory given their definitions. I agree with MStodd that you probably want to roll your own layer of abstraction to encapsulate the two maps.

Your solution is perfectly good. Remember that putting an object into a HashMap doesn't make a copy of the Object, it just stores a reference to it, so the cost in time and memory is quite small.

I would go with your first solution. Have a layer of abstraction around two hashmaps. If you're worried about concurrency, implement appropriate locking for CRUD.

If you're able to use an immutable data structure, Guava's ImmutableMultimap offers an inverse() method, which enables you to get a collection of keys by value.

Hash join in java

I have two relations A and B, both with all integer attributes (A {a1,a2,a3,...} B{b1,b2,b3,..}. How would I hash-join these two in java? The user will pick the two joining attributes. Do I make two hashtables and then proceed to join them?

Well, what form do your relations have? Are they in a relational database? If so, just use an SQL JOIN - the DBMS will probably do a hash join, but you don't have to care about that.
If they're not in a relational database, why not?
If some weird constraint prevents you form using the best tool for the job then yes, doing a hash join is as simple as putting each tuple into a hashtable keyed on the join attribute and then iterating over the entries of one and looking up matches in the other. If all your data fits into main memory, that is.

Here his a way to do an hash-join in Java
The best is to hash one table A with an hashmap.
HashMap<Sting, Object[]> hash = new HashMap<String, Object[]>();
for (Object[] a : as) {
hash.put(a.a1, a);
}
Then loop in B using the hash and regroup the matched.
ArrayList joined = new ArrayList();
for(Objec[] b : bs){
A a = hash.get(b.b1);
joined.add(new Object[]{a, b});
}
This will work only if each element of A table has an unique a1.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

storing huge HashMap in database - java

using infinispan you could just work with your huge map and have parts of it (the ones not recently accessed) stored to disk to save RAM. easier to do than writing a whole D layer and (i think) faster and uses less memory #runtime (the entire map isnt in memory ever)

Related

Non-serializing in-memory database

What data structure should I use for a map with variable set as keys?

Ehcache - why are the entries so big?

How to implement n:m relation in Java?

Hash join in java

Categories

Resources