I am looking out for implementing a timestamp based cache with multiple keys. What data structure other than hash tables I would use. Any suggestions...
for two values, pair could be used, java (un)fortunately doesn't have a pair.
if it has to be triplet or quartet, what architecture is advised. or just the best-practise data structure to be used is also sufficient...
Assuming that you want to only retrieve the cached value given all of the keys, you can simply just make a CacheKey object. Maps/Hashtable are still a decent candidate here:
map.put(new CacheKey(keyA, keyB, keyC), value);
map.get(new CacheKey(keyA, keyB, keyC));
//etc...
Just make sure to properly implement equals() and hashcode() in the CacheKey class.
However if you intend to heavily use this map or hashtable as a cache, you should seriously consider re-using an existing caching library, unless you want to deal with things like limiting the number of entries stored in the map, choosing which entries get evicted when you reach the limit, etc. EhCache is incredibly simple to use and has many many configuration options - caches can have a maximum number of entries or maximum memory size, caches can overflow to disk, etc.
Make a hashtable where the value is a reference to the object, so that you don't have to store the object multiple times if it has multiple keys.
Luckily, this is the default in Java.
Just use plain MultiKeyMap or even decorate LRU map:
MultiKeyMap cache = MultiKeyMap.decorate(new LRUMap());
cache.put(keyA, keyB, value);
Related
I have couple of scenarios related to storing of HashMap, which I am not aware how to accomplish.
Case 1: As there are buckets on which objects are saved, and hashcode will be taken into consideration while saving it. Now say, there are 5 buckets and I want to have my own control on which bucket to save it. Is there a way to achieve it? Say, By internal mechanism, it was going to be saved into bucket 4, but I wanted to save that particular object into bucket 1.
Case 2: Similarly, If I see that out of 5 buckets, 1 bucket was getting much more load than other, and I want to do a load balancing kind of job by moving it to different buckets. How can that be accomplished?
There is fundamentally no way to achieve load balancing in a hashtable. The quintessential property of this structure is direct access to exactly the bucket which must hold the requested key. Any balancing scheme would involve reshuffling the objects among buckets and destroy this property. This is the reason why good-quality hashcodes are vital to the proper operation of a hashtable.
Additionally note that you can't even control bucket selection by manipulating the hashCode() method of your objects, because hashcodes of any two equal objects must match, and because any self-respecting hashtable implementation will additionally shuffle the bits of the value retrieved from hashCode() to ensure better dispersion.
The implementations are designed so that you shouldn't have to worry about these details.
If you wanted to control these more carefully, then you can create your own class implementing Map.
With HashMap and with all Collections whose names start with Hash the more important part is the hasCode generated by the domain object that you are trying to store. That's why every object has a hashCode implementation(implicity with object.hashCode() or explicitely).
First of all HashMap tries to accomplish what you stated in case 2(sort of). If your hashCode implementation is good, meaning can produce evenly dispersed hashCode values for variety of objects than load of the buckets of HashMap is more or less evenly distributed, and you don't have to anything(other than writing a good hashCode function.). Also you can somehow manupulate the balance by implementing your hascode accordingly by producing same hashcode for objects that you want them to be in the same bucket.
If you want to have complete control on the internals of the hashMap than you should implement your own HashMap by implementing Map interface.
The underlying mechanism for bucket creation and placement are abstracted away.
For case 1, you can simply use objects as your keys for the bucket placement. For case 2, you cannot see the actual placement of objects directly.
Although, what you can do is use a Multimap which you can treat the keys as if they were buckets. It's basically a map from keys to collections. Here you can check any given key(bucket) and see how many items you have placed in there. Here you can satisfy requirements from both cases. This is probably as close as you're going to get without actually tampering with the internal bucketing mechanism.
From the link, here is a snippet:
public class MutliMapTest {
public static void main(String... args) {
Multimap<String, String> myMultimap = ArrayListMultimap.create();
// Adding some key/value
myMultimap.put("Fruits", "Bannana");
myMultimap.put("Fruits", "Apple");
myMultimap.put("Fruits", "Pear");
myMultimap.put("Vegetables", "Carrot");
// Getting the size
int size = myMultimap.size();
System.out.println(size); // 4
// Getting values
Collection<string> fruits = myMultimap.get("Fruits");
System.out.println(fruits); // [Bannana, Apple, Pear]
Collection<string> vegetables = myMultimap.get("Vegetables");
System.out.println(vegetables); // [Carrot]
// Iterating over entire Mutlimap
for(String value : myMultimap.values()) {
System.out.println(value);
}
// Removing a single value
myMultimap.remove("Fruits","Pear");
System.out.println(myMultimap.get("Fruits")); // [Bannana, Pear]
// Remove all values for a key
myMultimap.removeAll("Fruits");
System.out.println(myMultimap.get("Fruits")); // [] (Empty Collection!)
}
What is the reason to make unique hashCode for hash-based collection to work faster?And also what is with not making hashCode mutable?
I read it here but didn't understand, so I read on some other resources and ended up with this question.
Thanks.
Hashcodes don't have to be unique, but they work better if distinct objects have distinct hashcodes.
A common use for hashcodes is for storing and looking objects in data structures like HashMap. These collections store objects in "buckets" and the hashcode of the object being stored is used to determine which bucket it's stored in. This speeds up retrieval. When looking for an object, instead of having to look through all of the objects, the HashMap uses the hashcode to determine which bucket to look in, and it looks only in that bucket.
You asked about mutability. I think what you're asking about is the requirement that an object stored in a HashMap not be mutated while it's in the map, or preferably that the object be immutable. The reason is that, in general, mutating an object will change its hashcode. If an object were stored in a HashMap, its hashcode would be used to determine which bucket it gets stored in. If that object is mutated, its hashcode would change. If the object were looked up at this point, a different hashcode would result. This might point HashMap to the wrong bucket, and as a result the object might not be found even though it was previously stored in that HashMap.
Hash codes are not required to be unique, they just have a very low likelihood of collisions.
As to hash codes being immutable, that is required only if an object is going to be used as a key in a HashMap. The hash code tells the HashMap where to do its initial probe into the bucket array. If a key's hash code were to change, then the map would no longer look in the correct bucket and would be unable to find the entry.
hashcode() is basically a function that converts an object into a number. In the case of hash based collections, this number is used to help lookup the object. If this number changes, it means the hash based collection may be storing the object incorrectly, and can no longer retrieve it.
Uniqueness of hash values allows a more even distribution of objects within the internals of the collection, which improves the performance. If everything hashed to the same value (worst case), performance may degrade.
The wikipedia article on hash tables provides a good read that may help explain some of this as well.
It has to do with the way items are stored in a hash table. A hash table will use the element's hash code to store and retrieve it. It's somewhat complicated to fully explain here but you can learn about it by reading this section: http://www.brpreiss.com/books/opus5/html/page206.html#SECTION009100000000000000000
Why searching by hashing is faster?
lets say you have some unique objects as values and you have a String as their keys. Each keys should be unique so that when the key is searched, you find the relevant object it holds as its value.
now lets say you have 1000 such key value pairs, you want to search for a particular key and retrieve its value. if you don't have hashing, you would then need to compare your key with all the entries in your table and look for the key.
But with hashing, you hash your key and put the corresponding object in a certain bucket on insertion. now when you want to search for a particular key, the key you want to search will be hashed and its hash value will be determined. And you can go to that hash bucket straight, and pick your object without having to search through the entire key entries.
hashCode is a tricky method. It is supposed to provide a shorthand to equality (which is what maps and sets care about). If many objects in your map share the same hashcode, the map will have to check equals frequently - which is generally much more expensive.
Check the javadoc for equals - that method is very tricky to get right even for immutable objects, and using a mutable object as a map key is just asking for trouble (since the object is stored for its "old" hashcode)
As long, as you are working with collections that you are retrieving elements from by index (0,1,2... collection.size()-1) than you don't need hashcode. However, if we are talking about associative collections like maps, or simply asking collection does it contain some elements than we are talkig about expensive operations.
Hashcode is like digest of provided object. It is robust and unique. Hashcode is generally used for binary comparisions. It is not that expensive to compare on binary level hashcode of every collection's member, as comparing every object by it's properties (more than 1 operation for sure). Hashcode needs to be like a fingerprint - one entity - one, and unmutable hashcode.
The basic idea of hashing is that if one is looking in a collection for an object whose hash code differs from that of 99% of the objects in that collection, one only need examine the 1% of objects whose hash code matches. If the hashcode differs from that of 99.9% of the objects in the collection, one only need examine 0.1% of the objects. In many cases, even if a collection has a million objects, a typical object's hash code will only match a very tiny fraction of them (in many cases, less than a dozen). Thus, a single hash computation may eliminate the need for nearly a million comparisons.
Note that it's not necessary for hash values to be absolutely unique, but performance may be very bad if too many instances share the same hash code. Note that what's important for performance is not the total number of distinct hash values, but rather the extent to which they're "clumped". Searching for an object which is in a collection of a million things in which half the items all have one hash value and each remaining items each have a different value will require examining on average about 250,000 items. By contrast, if there were 100,000 different hash values, each returned by ten items, searching for an object would require examining about five.
You can define a customized class extending from HashMap. Then you override the methods (get, put, remove, containsKey, containsValue) by comparing keys and values only with equals method. Then you add some constructors. Overriding correctly the hashcode method is very hard.
I hope I have helped everybody who wants to use easily a hashmap.
I'm looking for a way to have a concurrent map or similar key->value storage that can be sorted by value and not by key.
So far I was looking at ConcurrentSkipListMap but I couldn't find a way to sort it by value (using Comparator), since compare method receives only the keys as parameters.
The map has keys as String and values as Integer. What I'm looking is a way to retrieve the key with the smallest value(integer).
I was also thinking about using 2 maps, and create a separate map with Integer keys and String values and in this way I will have a sorted map by integer as I wanted, however there can be more than one integers with the same value, which could lead me into more problems.
Example
"user1"=>3
"user2"=>1
"user3"=>3
sorted list:
"user2"=>1
"user1"=>3
"user3"=>3
Is there a way to do this or are any 3rd party libraries that can do this?
Thanks
To sort by value where you can have multiple "value" to "key" mapping, you need a MultiMap. This needs to be synchronized as there is no concurrent version.
This doesn't meant the performance will be poor as that depends on how often you call this data structure. e.g. it could add up to 1 micro-second.
I recently had to do this and ended up using a ConcurrentSkipListMap where the keys contain a string and an integer. I ended up using the answer proposed below. The core insight is that you can structure your code to allow for a duplicate of a key with a different value before removing the previous one.
Atomic way to reorder keys in a ConcurrentSkipListMap / ConcurrentSkipListSet?
The problem was to keep a dynamic set of strings which were associated with integers that could change concurrently from different threads, described below. It sounds very similar to what you wanted to do.
Is there an embeddable Java alternative to Redis?
Here's the code for my implementation:
https://github.com/HarvardEconCS/TurkServer/blob/master/turkserver/src/main/java/edu/harvard/econcs/turkserver/util/UserItemMatcher.java
The principle of a ConcurrentMap is that it can be accessed concurrently - if you want it sorted at any time, performance will suffer significantly as that map would need to be fully synchronized (like a hashtable), resulting in poor throughput.
So I think your best bet is to return a sorted view of your map by putting all elements in an unmodifiable TreeMap for example (although sorting a TreeMap by values needs a bit of tweaking).
In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...
Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}
Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.
You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)
Is there a theoretical limit for the number of key entries that can be stored in a HashMap or does the maximum purely depend on the heap memory available?
Also, which data structure is the best to store a very large number of objects (say several hundred thousand objects)?
Is there a theoretical limit for the number of key entries that can be stored in a HashMap or does it purely depend on the heapmemory available ?
Looking at the documentation of that class, I would say that the theoretical limit is Integer.MAX_VALUE (231-1 = 2147483647) elements.
This is because to properly implement this class, the size() method is obliged to return an int representing the number of key/value pairs.
From the documentation of HashMap.size()
Returns: the number of key-value mappings in this map
Note: This question is very similar to How many data a list can hold at the maximum.
which data structure is the best to store a very large number of objects(say several hundred thousand objects)?
I would say it depends on what you need to store and what type of access you require. All built in collections are probably well optimized for large quantities.
HashMap holds the values in an array, which can hold up to Integer.MAX_VALUE. But this does not count collisions. Each Entry has a next field, which is also an entry. This is how collisions (two or more objects with the same hashcode) are resolved. So I wouldn't say there is any limit (apart from the available memory)
Note that if you exceed Integer.MAX_VALUE, you'll get unexpected behaviour from some methods, like size(), but get() and put() will still work. And they will work, because the hashCode() of any object will return an int, hence by definition each object will fit in the map. And then each object will collide with an existing one.
There is no theoretical limit, but there is a limit of buckets to store different entry chains (stored under a different hashkey). Once you reach this limit every new addition will result in a hash collision -- but this is no a problem except for performance...
I agree with #Bozho's and will also add that you should read the Javadoc on HashMap carefully. Note how it discusses the initial capacity and load factor and how they'll affect the performance of the HashMap.
HashMap is perfectly fine for holding large sets of data (as long as you don't run out of keys or memory) but performance can be an issue.
You may need to look in to distributed caches/data grids if you find you can't manipulate the datasets you need in a single Java/JVM program.