Java HashMap to store in different buckets internally

Java HashMap to store in different buckets internally - java

I have couple of scenarios related to storing of HashMap, which I am not aware how to accomplish.
Case 1: As there are buckets on which objects are saved, and hashcode will be taken into consideration while saving it. Now say, there are 5 buckets and I want to have my own control on which bucket to save it. Is there a way to achieve it? Say, By internal mechanism, it was going to be saved into bucket 4, but I wanted to save that particular object into bucket 1.
Case 2: Similarly, If I see that out of 5 buckets, 1 bucket was getting much more load than other, and I want to do a load balancing kind of job by moving it to different buckets. How can that be accomplished?

There is fundamentally no way to achieve load balancing in a hashtable. The quintessential property of this structure is direct access to exactly the bucket which must hold the requested key. Any balancing scheme would involve reshuffling the objects among buckets and destroy this property. This is the reason why good-quality hashcodes are vital to the proper operation of a hashtable.
Additionally note that you can't even control bucket selection by manipulating the hashCode() method of your objects, because hashcodes of any two equal objects must match, and because any self-respecting hashtable implementation will additionally shuffle the bits of the value retrieved from hashCode() to ensure better dispersion.

The implementations are designed so that you shouldn't have to worry about these details.
If you wanted to control these more carefully, then you can create your own class implementing Map.

With HashMap and with all Collections whose names start with Hash the more important part is the hasCode generated by the domain object that you are trying to store. That's why every object has a hashCode implementation(implicity with object.hashCode() or explicitely).
First of all HashMap tries to accomplish what you stated in case 2(sort of). If your hashCode implementation is good, meaning can produce evenly dispersed hashCode values for variety of objects than load of the buckets of HashMap is more or less evenly distributed, and you don't have to anything(other than writing a good hashCode function.). Also you can somehow manupulate the balance by implementing your hascode accordingly by producing same hashcode for objects that you want them to be in the same bucket.
If you want to have complete control on the internals of the hashMap than you should implement your own HashMap by implementing Map interface.

The underlying mechanism for bucket creation and placement are abstracted away.
For case 1, you can simply use objects as your keys for the bucket placement. For case 2, you cannot see the actual placement of objects directly.
Although, what you can do is use a Multimap which you can treat the keys as if they were buckets. It's basically a map from keys to collections. Here you can check any given key(bucket) and see how many items you have placed in there. Here you can satisfy requirements from both cases. This is probably as close as you're going to get without actually tampering with the internal bucketing mechanism.
From the link, here is a snippet:
public class MutliMapTest {
public static void main(String... args) {
Multimap<String, String> myMultimap = ArrayListMultimap.create();
// Adding some key/value
myMultimap.put("Fruits", "Bannana");
myMultimap.put("Fruits", "Apple");
myMultimap.put("Fruits", "Pear");
myMultimap.put("Vegetables", "Carrot");
// Getting the size
int size = myMultimap.size();
System.out.println(size); // 4
// Getting values
Collection<string> fruits = myMultimap.get("Fruits");
System.out.println(fruits); // [Bannana, Apple, Pear]
Collection<string> vegetables = myMultimap.get("Vegetables");
System.out.println(vegetables); // [Carrot]
// Iterating over entire Mutlimap
for(String value : myMultimap.values()) {
System.out.println(value);
}
// Removing a single value
myMultimap.remove("Fruits","Pear");
System.out.println(myMultimap.get("Fruits")); // [Bannana, Pear]
// Remove all values for a key
myMultimap.removeAll("Fruits");
System.out.println(myMultimap.get("Fruits")); // [] (Empty Collection!)
}

Related

Does a java hashmap bucket really contain a list?

Ive always been certain that a 'bucket' in a java hash map contains either a linked list or a Tree of some kind, indeed you can read in many places on the web how the bucket holds this list then iterates over the entries using the equals function to find entries that are stored in the same bucket (ie have the same key), bearing this in mind, can someone explain why the following, trivial code doesnt work as expected :-
private class MyString {
String internalString;
MyString(String string) {
internalString = string;
}
#Override
public int hashCode() {
return internalString.length(); // rubbish hashcode but perfectly legal
}
}
...
Map<MyString, String> map = new HashMap<>();
map.put(new MyString("key1"), "val1");
map.put(new MyString("key2"), "val2");
String retVal = map.get(new MyString("key1"));
System.out.println("Val returned = "+retVal);
In this example I would have expected the two map entries to be in the list in the (same) bucket and for retVal to equal "val1", however it equals null?
A quick debug shows why, the bucket does not contain a list at all just a single entry.....
I thought i was going mad until I read this on the baeldung website (https://www.baeldung.com/java-map-duplicate-keys)
...However, none of the existing Java core Map implementations allow a Map to handle multiple values for a single key.
What is going on, does a bucket in a hash map contain a list or not ?

Does a java hashmap bucket really contain a list?
It depends.
For older implementations (Java 7 and earlier), yes it really does contain list. (It is a singly linked list of an internal Node type.)
For newer implementations (Java 8 and later), it can contain either a list or a binary tree, depending on how many entries hash to the particular bucket. If the number is small, a singly linked list is used. If the number is larger than a hard-coded threshold (8 in Java 8), then the HashMap converts the list to a balanced binary tree ... so that bucket searches are O(logN) instead of O(N). This mitigates the effects of a hash code function that generates a lot of collisions (or one where this can be made to occur by choosing keys in a particular way.)
If you want to learn more about how HashMap works, read the source code. (It is well commented, and the comments explain the rationale as well as the nitty-gritty how does it work stuff. It is worse the time ... if you are interested in this kind of thing.)
However, none of the existing Java core Map implementations allow a Map to handle multiple values for a single key.
That is something else entirely. That is about multiple values for a key rather than multiple keys in a bucket.
The article is correct. And this doesn't contradict my "a bucket contains a list or tree" statement.
Put simply, a HashMap bucket can contain multiple key / value pairs, where the keys are all different.
The only point on which I would fault the quoted text is that it seems to imply that it is implementations of Map that have the one value per key restriction. In reality, it is the Map API itself that imposes this restriction ... unless you use (say) a List as the map's value type.

Why linkedlist required when hash collision occur and Hashmap does not allow duplicate elements?

Why is a linkedlist required when hash collision occurs and HashMap does not allow duplicate elements? I was trying to understand following points in HashMap:
HashMap does not give order of elements. But following elements I am getting insertion order then LinkedHashMap is different with HashMap.
Map<String, Integer> ht2=new HashMap<String, Integer>();
ht2.put("A", 20);
ht2.put("B", 10);
ht2.put("C", 30);
ht2.put("D", 50);
ht2.put("E", 40);
ht2.put("F", 60);
ht2.put("G", 70);
for(Entry e:ht2.entrySet())
{
System.out.println(e.getKey() +"<<<key HashMap value>>>"+e.getValue());
}
HashMap does not allow duplicate keys , Yes I can get expected output. When we are storing object as a key we have to overwrite the equal method based on attribute, so same object or same object information will not be duplicate. So every bucket will have only one entry if entry same previous one will overwrite. I am not understanding how multiple entry are coming in a same bucket when collision occur it is overwriting the previous value. Why linked list is required here when duplicate are not allowing here. Please look into below example.
HashMap<Employee, Integer> hashMap = new HashMap<Employee, Integer>(4);
for (int i = 0; i < 100; i++) {
Employee myKey = new Employee(i,"XYZ",new Date());
hashMap.put(myKey, i);
}
System.out.println("myKey Size ::"+hashMap.size());
Here I am creating 100 Employee object so 100 buckets are created. I can see when hashcode value printed different value. So how linked list are coming here and how multiple entry are going in to same bucket.

There is a different between the number of buckets and the number of entries in the HashMap.
Two keys of the HashMap may have the same hashCode, even if they are not equal to each other, which means both of them will be stored in the same bucket. Therefore the linked list (or some other structure that can hold multiple entries) is required.
Even two keys having different hashCode may be stored in the same bucket, since the number of buckets is much smaller than the number of possible hashCode values. For example, if the HashMap has 16 buckets, keys with hashCode 0 and 16 will be mapped to the same bucket. Therefore the bucket must be able to hold multiple entries.
The first part of your question is not clear. If you meant to ask why you see different iteration order in HashMap vs. LinkedHashMap, the reason is HashMap doesn't maintain insertion order, and LinkedHashMap does maintain insertion order. If for some input you are seeing an iteration order matching the insertion order in HashMap, that's just coincidence (depending on the buckets that the inserted keys happen to be mapped to).

When a HashMap collision occurs, like you said in your question the .equals is involved. The linked list is used like this:
If a collision occurs and the .equals returns true, then the old reference (if the references are not identical, of course) is replaced by the new one
If the .equals() returns false against the existing value and only one object is in the current bucket, the HashMap inserts it to a linked list at index 0. Note that in java's standard HashMap implementation, the entries into this linked list are entirely internal, that is, you wouldn't even be able to access the list under normal circumstances
If there is more than one entry in the current bucket, it continues down the list until it finds a case where .equals() returns true on the existing object in the list and replaces, or it reaches the end of the list/bucket, in which case step 2 occurs
So you technically don't have to worry about the list, just make sure that your .hashcode minimizes the amount of collisions

Map and HashCode

What is the reason to make unique hashCode for hash-based collection to work faster?And also what is with not making hashCode mutable?
I read it here but didn't understand, so I read on some other resources and ended up with this question.
Thanks.

Hashcodes don't have to be unique, but they work better if distinct objects have distinct hashcodes.
A common use for hashcodes is for storing and looking objects in data structures like HashMap. These collections store objects in "buckets" and the hashcode of the object being stored is used to determine which bucket it's stored in. This speeds up retrieval. When looking for an object, instead of having to look through all of the objects, the HashMap uses the hashcode to determine which bucket to look in, and it looks only in that bucket.
You asked about mutability. I think what you're asking about is the requirement that an object stored in a HashMap not be mutated while it's in the map, or preferably that the object be immutable. The reason is that, in general, mutating an object will change its hashcode. If an object were stored in a HashMap, its hashcode would be used to determine which bucket it gets stored in. If that object is mutated, its hashcode would change. If the object were looked up at this point, a different hashcode would result. This might point HashMap to the wrong bucket, and as a result the object might not be found even though it was previously stored in that HashMap.

Hash codes are not required to be unique, they just have a very low likelihood of collisions.
As to hash codes being immutable, that is required only if an object is going to be used as a key in a HashMap. The hash code tells the HashMap where to do its initial probe into the bucket array. If a key's hash code were to change, then the map would no longer look in the correct bucket and would be unable to find the entry.

hashcode() is basically a function that converts an object into a number. In the case of hash based collections, this number is used to help lookup the object. If this number changes, it means the hash based collection may be storing the object incorrectly, and can no longer retrieve it.
Uniqueness of hash values allows a more even distribution of objects within the internals of the collection, which improves the performance. If everything hashed to the same value (worst case), performance may degrade.
The wikipedia article on hash tables provides a good read that may help explain some of this as well.

It has to do with the way items are stored in a hash table. A hash table will use the element's hash code to store and retrieve it. It's somewhat complicated to fully explain here but you can learn about it by reading this section: http://www.brpreiss.com/books/opus5/html/page206.html#SECTION009100000000000000000

Why searching by hashing is faster?
lets say you have some unique objects as values and you have a String as their keys. Each keys should be unique so that when the key is searched, you find the relevant object it holds as its value.
now lets say you have 1000 such key value pairs, you want to search for a particular key and retrieve its value. if you don't have hashing, you would then need to compare your key with all the entries in your table and look for the key.
But with hashing, you hash your key and put the corresponding object in a certain bucket on insertion. now when you want to search for a particular key, the key you want to search will be hashed and its hash value will be determined. And you can go to that hash bucket straight, and pick your object without having to search through the entire key entries.

hashCode is a tricky method. It is supposed to provide a shorthand to equality (which is what maps and sets care about). If many objects in your map share the same hashcode, the map will have to check equals frequently - which is generally much more expensive.
Check the javadoc for equals - that method is very tricky to get right even for immutable objects, and using a mutable object as a map key is just asking for trouble (since the object is stored for its "old" hashcode)

As long, as you are working with collections that you are retrieving elements from by index (0,1,2... collection.size()-1) than you don't need hashcode. However, if we are talking about associative collections like maps, or simply asking collection does it contain some elements than we are talkig about expensive operations.
Hashcode is like digest of provided object. It is robust and unique. Hashcode is generally used for binary comparisions. It is not that expensive to compare on binary level hashcode of every collection's member, as comparing every object by it's properties (more than 1 operation for sure). Hashcode needs to be like a fingerprint - one entity - one, and unmutable hashcode.

The basic idea of hashing is that if one is looking in a collection for an object whose hash code differs from that of 99% of the objects in that collection, one only need examine the 1% of objects whose hash code matches. If the hashcode differs from that of 99.9% of the objects in the collection, one only need examine 0.1% of the objects. In many cases, even if a collection has a million objects, a typical object's hash code will only match a very tiny fraction of them (in many cases, less than a dozen). Thus, a single hash computation may eliminate the need for nearly a million comparisons.
Note that it's not necessary for hash values to be absolutely unique, but performance may be very bad if too many instances share the same hash code. Note that what's important for performance is not the total number of distinct hash values, but rather the extent to which they're "clumped". Searching for an object which is in a collection of a million things in which half the items all have one hash value and each remaining items each have a different value will require examining on average about 250,000 items. By contrast, if there were 100,000 different hash values, each returned by ten items, searching for an object would require examining about five.

You can define a customized class extending from HashMap. Then you override the methods (get, put, remove, containsKey, containsValue) by comparing keys and values only with equals method. Then you add some constructors. Overriding correctly the hashcode method is very hard.
I hope I have helped everybody who wants to use easily a hashmap.

Why are immutable objects in hashmaps so effective?

So I read about HashMap. At one point it was noted:
"Immutability also allows caching the hashcode of different keys which makes the overall retrieval process very fast and suggest that String and various wrapper classes (e.g., Integer) provided by Java Collection API are very good HashMap keys."
I don't quite understand... why?

String#hashCode:
private int hash;
...
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
Since the contents of a String never change, the makers of the class chose to cache the hash after it had been calculated once. This way, time is not wasted recalculating the same value.

Quoting the linked blog entry:
final object with proper equals () and hashcode () implementation would act as perfect Java HashMap keys and improve performance of Java hashMap by reducing collision.
I fail to see how both final and equals() have anything to do with hash collisions. This sentence raises my suspicion about the credibility of the article. It seems to be a collection of dogmatic Java "wisdoms".
Immutability also allows caching there hashcode of different keys which makes overall retrieval process very fast and suggest that String and various wrapper classes e.g Integer provided by Java Collection API are very good HashMap keys.
I see two possible interpretations of this sentence, both of which are wrong:
HashMap caches hash codes of immutable objects. This is not correct. The map doesn't have the possibility to find out if an object is "immutable".
Immutability is required for an object to cache its own hash code. Ideally, an object's hash value should always just rely on non-mutating state of the object, otherwise the object couldn't be sensibly used as a key. So in this case, too, the author fails to make a point: If we assume that our object is not changing its state, we also don't have to recompute the hash value every time, even if our object is mutable!
Example
So if we are really crazy and actually decide to use a List as a key for a HashMap and make the hash value dependent on the contents, rather than the identity of the list, we could just decide to invalidate the cached hash value on every modification, thus limiting the number of hash computations to the number of modifications to the list.

It's very simple. Since an immutable object doesn't change over time, it only needs to perform the calculation of the hash code once. Calculating it again will yield the same value. Therefore it is common to calculate the hash code in the constructor (or lazily) and store it in a field. The hashcode function then returns just the value of the field, which is indeed very fast.

Basically immutability is achieved in Java by making the class not extendable and all the operations in the object will ideally not change the state of the object. If you see the operations of String like replace(), it does not change the state of the current object with which you are manipulating rather it gives you a new String object with the replaced string. So ideally if you maintain such objects as keys the state doesn't change and hence the hash code also remains unchanged. So caching the hash code will be performance effective during retrievals.

Think of the hashmap as a big array of numbered boxes. The number is the hashcode, and the boxes are ordered by number.
Now if the object can't change, the hash function will always reproduce the same value. Therefore the object will always stay in it's box.
Now suppose a changeable object. It is changed after adding it to the hash, so now it is sitting in the wrong box, like a Mrs. Jones which happened to marry Mister Doe, and which is now named Doe too, but in many registers still named Jones.

Immutable classes are unmodifiable, that's why those are used as keys in a Map.
For an example -
StringBuilder key1=new StringBuilder("K1");
StringBuilder key2=new StringBuilder("K2");
Map<StringBuilder, String> map = new HashMap<>();
map.put(key1, "Hello");
map.put(key2, "World");
key1.append("00");
System.out.println(map); // This line prints - {K100=Hello, K2=World}
You see the key K1 (which is an object of mutable class StringBuilder) inserted in the map is lost due to an inadvertent change to it. This won't happen if you use immutable classes as keys for the Map family members.

Hash tables will only work if the hash code of an object can never change while it is stored in the table. This implies that the hash code cannot take into account any aspect of the object which could change while it's in the table. If the most interesting aspects of an object are mutable, that implies that either:
The hash code will have to ignore most of the interesting aspects of the object, thus causing many hash collisions, or...
The code which owns the hash table will have to ensure that the objects therein are not exposed to anything that might change them while they are stored in the hash table.
If Java hash tables allowed clients to supply an EqualityComparer (the way .NET dictionaries do), code which knows that certain aspects of the objects in a hash table won't unexpectedly change could use a hash code which took those aspects into account, but the only way to accomplish that in Java would be to wrap each item stored in the hashcode in a wrapper. Such wrapping may not be the most evil thing in the world, however, since the wrapper would be able to cache hash values in a way which an EqualityComparer could not, and could also cache further equality-related information [e.g. if the things being stored were nested collections, it might be worthwhile to compute multiple hash codes, and confirm that all hash codes match before doing any detailed inspection of the elements].

Iterating through the union of several Java Map key sets efficiently

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...

Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.

How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}

Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.

You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.