Working with huge maps (putIfAbsent)

Working with huge maps (putIfAbsent) - java

I have this Map definition :
TreeMap <String, Set<Integer>>
It may contain millions of entries, and I also need a "natural order" (that's why I've chosen a TreeMap, though I could write a Comparator if needed).
So, what I have to do in order to add an element to the map is :
Check if a key already exists.
If not, create a new Set and add the value.
If it exists, I have to add the value to the Set
I have this implementation which works fine :
private void addToMap (String key, Integer value){
Set<Integer> vs = dataMap.get(key);
if (vs == null){
vs = new TreeSet<Integer>();
dataMap.put(key,vs);
}
vs.add(value);
}
But I would like to avoid searching for the key and then putting the element if it doesn't exist (it will perform a new search over the huge map).
I think I could use ConcurrentHashMap.putIfAbsent method, but then :
I will not have the natural ordering of the keys (and I will need to perform a sort on the millions keys)
I may have (I don't know) additional overhead because of synchronization over the ConcurrentHashMap, and in my situation my process is single threaded and it may impact performance.
Reading this post : Java map.get(key) - automatically do put(key) and return if key doesn't exist?
there's an answer that talks about Guava MapMaker.makeComputingMap but looks like the method is not there anymore.
Performance is critical in this situation (as always :D), so please let me know your recommendations.
Thanks in advance.
NOTE :
Thanks a lot for so many helping answers in just some minutes.
(I don't know which one to select as the best).
I will do some performance tests on the suggestions (TreeMultiMap, ConcurrentSkipListMap, TreeSet + HashMap) and update with the results. I will select the one with the best performance then, as I'd like to select all three of them but I cannot.
NOTE2
So, I did some performance testing with 1.5 million entries, and these are the results :
ConcurrentSkipListMap, it doesn't work as I expected, because it replaces the existing value with the new empty set I provided. I thought it was setting the value only if it the key doesn't exist, so I cannot use this one. (my mistake).
TreeSet + HashMap, works fine but doesn't give the best performance. It is like 1.5 times slower than TreeMap alone or TreeMultiMap.
TreeMultiMap gives the best performance, but it is almost the same as the TreeMap alone. I will check this one as the answer.
Again, thanks a lot for your contributions and help.

Concurrent map does not do magic, it checks the existence and then inserts if not exists.
Guava have MultiMaps, for example TreeMultiMap can be what you need.

If performance is critical I wouldn't use a TreeSet of Integer, I would find a more light weight structure like TIntArrayList or something which wraps int values. I would also use a HashMap as its look up is O(1) instead of O(log N). If you also need to keep the keys sorted, I would use a second collection for that.
I agree that putIfAbsent on ConcurrentHashMap is overkill and get/put on a HashMap is likely to be the fastest option.
ConcurrentSkipListMap might be a good option to use putIfAbsent, but it I would make sure its not slower.
BTW Even worse than doing a get/put is creating a HashSet you don't need.

PutIfAbsent has the benefit of concurrency, that is: if many threads call this at the same time, they don't have to wait (it doesn't use synchronized internally). However this comes at a minor cost of execution speed, so if you work only single-threaded, this will slow things down.
If you need this sorted, try the ConcurrentSkipListMap.

Related

Does HashMap keeps its order (not insertion order)

Sorry if and answer to this is out there. I just could not find it.
I don't care about insertion order, I just want to ensure that HashMap keeps its order without any puts being used in between.
If I have the following code:
StringBuilder keyChecker = new StringBuilder("");
for(String field : hashmap().keySet()){
keyChecker.append(field).append(",");
}
for(String field : hashmap().keySet()){
setAny(checker,x++, hashmap().get(field) );
x++;
}
Will the (1st,2nd,3rd,etc) field always match the same one next time I call HashMap keyset.
From my tests it seems like it always does, but I am not sure about any edge cases that I may come across.

Yes. It will keep its order if no new items are added. An idle map does not just decide to rearrange itself. But that order is non deterministic and can change once items are added.

WJS is correct. That said, it is very bad style to depend on this. If you actually depend on the order of the entries, I would suggest using a TreeMap or one of the Apache Commons implementations of OrderedMap.
You might be able to get by with your assumption that the order will be stable right now ... but if another developer works on the code, that assumption might not be known and the code will break in unexpected ways that will be big headache for somebody to solve.
If you depend on entry order, use a data structure that guarantees that order.

Is it a good idea to store data as keys in HashMap with empty/null values?

I had originally written an ArrayList and stored unique values (usernames, i.e. Strings) in it. I later needed to use the ArrayList to search if a user existed in it. That's O(n) for the search.
My tech lead wanted me to change that to a HashMap and store the usernames as keys in the array and values as empty Strings.
So, in Java -
hashmap.put("johndoe","");
I can see if this user exists later by running -
hashmap.containsKey("johndoe");
This is O(1) right?
My lead said this was a more efficient way to do this and it made sense to me, but it just seemed a bit off to put null/empty as values in the hashmap and store elements in it as keys.
My question is, is this a good approach? The efficiency beats ArrayList#contains or an array search in general. It works.
My worry is, I haven't seen anyone else do this after a search. I may be missing an obvious issue somewhere but I can't see it.

Since you have a set of unique values, a Set is the appropriate data structure. You can put your values inside HashSet, an implementation of the Set interface.
My lead said this was a more efficient way to do this and it made sense to me, but it just seemed a bit off to put null/empty as values in the hashmap and store elements in it as keys.
The advice of the lead is flawed. Map is not the right abstraction for this, Set is. A Map is appropriate for key-value pairs. But you don't have values, only keys.
Example usage:
Set<String> users = new HashSet<>(Arrays.asList("Alice", "Bob"));
System.out.println(users.contains("Alice"));
// -> prints true
System.out.println(users.contains("Jack"));
// -> prints false
Using a Map would be awkward, because what should be the type of the values? That question makes no sense in your use case,
as you have just keys, not key-value pairs.
With a Set, you don't need to ask that, the usage is perfectly natural.
This is O(1) right?
Yes, searching in a HashMap or a HashSet is O(1) amortized worst case, while searching in a List or an array is O(n) worst case.
Some comments point out that a HashSet is implemented in terms of HashMap.
That's fine, at that level of abstraction.
At the level of abstraction of the task at hand ---
to store a collection of unique usernames,
using a set is a natural choice, more natural than a map.

This is basically how HashSet is implemented, so I guess you can say it's a good approach. You might as well use HashSet instead of your HashMap with empty values.
For example :
HashSet's implementation of add is
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
where map is the backing HashMap and PRESENT is a dummy value.
My worry is, I haven't seen anyone else do this after a search. I may be missing an obvious issue somewhere but I can't see it.
As I mentioned, the developers of the JDK are using this same approach.

java concurrent map sorted by value

I'm looking for a way to have a concurrent map or similar key->value storage that can be sorted by value and not by key.
So far I was looking at ConcurrentSkipListMap but I couldn't find a way to sort it by value (using Comparator), since compare method receives only the keys as parameters.
The map has keys as String and values as Integer. What I'm looking is a way to retrieve the key with the smallest value(integer).
I was also thinking about using 2 maps, and create a separate map with Integer keys and String values and in this way I will have a sorted map by integer as I wanted, however there can be more than one integers with the same value, which could lead me into more problems.
Example
"user1"=>3
"user2"=>1
"user3"=>3
sorted list:
"user2"=>1
"user1"=>3
"user3"=>3
Is there a way to do this or are any 3rd party libraries that can do this?
Thanks

To sort by value where you can have multiple "value" to "key" mapping, you need a MultiMap. This needs to be synchronized as there is no concurrent version.
This doesn't meant the performance will be poor as that depends on how often you call this data structure. e.g. it could add up to 1 micro-second.

I recently had to do this and ended up using a ConcurrentSkipListMap where the keys contain a string and an integer. I ended up using the answer proposed below. The core insight is that you can structure your code to allow for a duplicate of a key with a different value before removing the previous one.
Atomic way to reorder keys in a ConcurrentSkipListMap / ConcurrentSkipListSet?
The problem was to keep a dynamic set of strings which were associated with integers that could change concurrently from different threads, described below. It sounds very similar to what you wanted to do.
Is there an embeddable Java alternative to Redis?
Here's the code for my implementation:
https://github.com/HarvardEconCS/TurkServer/blob/master/turkserver/src/main/java/edu/harvard/econcs/turkserver/util/UserItemMatcher.java

The principle of a ConcurrentMap is that it can be accessed concurrently - if you want it sorted at any time, performance will suffer significantly as that map would need to be fully synchronized (like a hashtable), resulting in poor throughput.
So I think your best bet is to return a sorted view of your map by putting all elements in an unmodifiable TreeMap for example (although sorting a TreeMap by values needs a bit of tweaking).

Iterating through the union of several Java Map key sets efficiently

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...

Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.

How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}

Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.

You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)

Should use TreeMap or HashMap for wrapping named parameters?

In most cases, there will be only 0-5 parameters in the map. I guess TreeMap might have a smaller footprint, because it's less sparse then HashMap. But I'm not sure.
Or, maybe it's even better to write my own Map in such case?

The main difference is that TreeMap is a SortedMap, and HashMap is not. If you need your map to be sorted, use a TreeMap, if not then use a HashMap. The performance characteristics and memory usage can vary, but if you only have 0-5 entries then there will be no noticeable difference.
I would not recommend you write your own map unless you need functionality which is not available from the standard Maps, which it sounds like you don't.

I guess TreeMap might have a smaller
footprint, because it's less sparse
then HashMap.
That may actually be wrong, because empty HashMap slots are null and thus take up little space, and TreeMap entries have a higher overhead than HashMap entries because of the child pointers and color flag.
In any case, it's only a concern if you have hundreds of thousands of such maps.

I guess you don't need order of entries in Map, so HashMap is OK for you.
And 5 entries are not performance concern.
You need to write Map which has dozen of methods to implemented, I don't think that is what you need.

If your about 5 keys are always the same (or part of a small set of keys), and you are usually querying them by string literals anyway, and you only seldom have to really parse the keys from user input or similar, then you may think about using an enum type as key type of a EnumMap. This should be even more efficient than a HashMap. The difference will only matter if you have many of these maps, though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Working with huge maps (putIfAbsent) - java

Concurrent map does not do magic, it checks the existence and then inserts if not exists. Guava have MultiMaps, for example TreeMultiMap can be what you need.

Related

Does HashMap keeps its order (not insertion order)

Is it a good idea to store data as keys in HashMap with empty/null values?

java concurrent map sorted by value

Iterating through the union of several Java Map key sets efficiently

Should use TreeMap or HashMap for wrapping named parameters?

Categories

Resources