This question already has answers here:
How does a Java HashMap handle different objects with the same hash code?
(15 answers)
Closed 9 years ago.
I am just trying to understand the algorithm behind the HashMap.get method in Java.
How is the search for a specific object carried out? How is the hashMap implemented in Java and what type of searching algorithm it uses?
Extract from How does a Java HashMap handle different objects with the same hash code?
A hashmap works like this (this is a little bit simplified, but it illustrates the basic mechanism):
It has a number of "buckets" which it uses to store key-value pairs in. Each bucket has a unique number - that's what identifies the bucket. When you put a key-value pair into the map, the hashmap will look at the hash code of the key, and store the pair in the bucket of which the identifier is the hash code of the key. For example: The hash code of the key is 235 -> the pair is stored in bucket number 235. (Note that one bucket can store more then one key-value pair).
When you lookup a value in the hashmap, by giving it a key, it will first look at the hash code of the key that you gave. The hashmap will then look into the corresponding bucket, and then it will compare the key that you gave with the keys of all pairs in the bucket, by comparing them with equals().
Now you can see how this is very efficient for looking up key-value pairs in a map: by the hash code of the key the hashmap immediately knows in which bucket to look, so that it only has to test against what's in that bucket.
Looking at the above mechanism, you can also see what requirements are necessary on the hashCode() and equals() methods of keys:
If two keys are the same (equals() returns true when you compare them), their hashCode() method must return the same number. If keys violate this, then keys that are equal might be stored in different buckets, and the hashmap would not be able to find key-value pairs (because it's going to look in the same bucket).
If two keys are different, then it doesn't matter if their hash codes are the same or not. They will be stored in the same bucket if their hash codes are the same, and in this case, the hashmap will use equals() to tell them apart.
Related
I am currently preparing for interviews, Java in particular.
A common question is to explain hash maps.
Every explanation says that in case there is more than a single value per key, the values are being linked listed to the bucket.
Now, in HashMap class, when we use put(), and the key is already in the map, the value is not being linked to the existing one (at list as I understand) but replacing it:
Map<String, Integer> map = new HashMap();
map.put("a", 1);
//map now have the pair ["a", 1]
map.put("a", 2);
//map now have the pair ["a", 2]
//And according to all hash maps tutorials, it should have been like: ["a", 1->2]
From the docs:
If the map previously contained a mapping for the key, the old value
is replaced.
What am I missing here? I am a little confused...
Thanks
You're confusing the behaviour of a Map with the implementation of a HashMap.
In a Map, there is only one value for a key -- if you put a new value for the same key, the old value will be replaced.
HashMaps are implemented using 'buckets' -- an array of cells of finite size, indexed by the hashCode of the key.
It's possible for two different keys to hash to the same bucket, a 'hash collision'. In case of a collision, one solution is to put the (key, value) pairs into a list, which is searched when getting a value from that bucket. This list is part of the internal implementation of the HashMap and is not visible to the user of the HashMap.
This is probably what you are thinking of.
Your basic understanding is correct: maps in general and hashmaps in particular only support one value per key. That value could be a list but that's something different. put("a", 2) will replace any value for key "a" that's already in the list.
So what are you missing?
Every explanation says that in case there is more than a single value per key, the values are being linked listed to the bucket.
Yes, that's basically the case (unless the list is replaced by a tree for efficiency reasons but let's ignore that here).
This is not about putting values for the same key into a list, however, but for the same bucket. Since keys could be mapped to the same bucket due to their hash code you'd need to handle that "collision".
Example:
Key "A" has a hash code of 65, key "P" has a hash code of 81 (assuming hashCode() just returns the ascii codes).
Let's assume our hashmap currently has 16 buckets. So when I put "A" into the map, which bucket does it go to? We calculate bucketIndex = hashCode % numBuckets (so 65 % 16) and we get the index 1 for the 2nd bucket.
Now we want to put "P" into the map. bucketIndex = hashCode % numBuckets also yields 1 (81 % 16) so the value for a different key goes to the same bucket at index 1.
To solve that a simple implementation is to use a linked list, i.e. the entry for "A" points to the next entry for "P" in the same bucket.
Any get("P") will then look for the bucket index first using the same calculation, so it gets 1. Then it iterates the list and calls equals() on each entry's key until it hits the one that matches (or none if none match).
in case there is more than a single value per key, the values are being linked listed to the bucket.
Maybe you mistake that with: Multiple keys can have the same hashCode value. (Collision)
For example let's consider 2 keys(key1, key2). Key1 references value1 and Key2 references value2.
If
hashcode(key1) = 1
hashcode(key2) = 1
The objects might have the same hashCode, but at the same time not be equal (a collision). In that situation both values will be put as List according to hashCode. Values will be retrieved by hashCode and than you'll get your value among that values by equals operation.
So I'm a bit confused about this one.
If Hashtables use separate chaining (or linear probing), why won't the following print out both values?
Hashtable<Character, Integer> map = new Hashtable<>();
map.put('h', 0);
map.put('h', 1);
System.out.println(map.remove('h')); // outputs 1
System.out.println(map.get('h')); // outputs null
I'm trying to understand why, given 2 identical keys, the hashtable won't use separate chaining in order to store both values. Did I understand this somewhat incorrectly or has Java just not implemented collision handling in their hashtable class?
Another question that might tie together would be, how does a hashtable using linear probing, given a key, know which value is the one we are looking for?
Thanks in advance!
I'm trying to understand why, given 2 identical keys, the hashtable won't use separate chaining in order to store both values.
The specification for Map (i.e. the javadoc) says that only one value is stored for each key. So that's what HashTable and HashMap implementations do.
Certainly the separate chaining doesn't stop someone from implementing a hash table with that property. The pseudo-code for put(key, value) on a hash table with separate chaining is broadly as follows:
Compute the hash value for the key.
Compute an index in the array of hash chains from the hash value. (The computation is index = hash % array.length or something similar.)
Search the hash chain at the computed index for an entry that matches the key.
If you found the entry for the key on the chain, update the value in the entry.
If you didn't find the entry, create an entry and add it to the chain.
If you repeat that for the same key, you will compute the same hash value, search the same chain, and find the same entry. You then update it, and there is still only one entry for that key ... as required by the specification.
In short, the above algorithm has no problem meeting the Map.put API requirements.
I think you are mis-understanding how hash tables work. Imagine I am looking for someone with an id of 227828. Say I have 1000 such people. I can search all 1000 and eventually find that ID and the person to whom it belongs.
But if their ids are used as keys in a hash table it is easier. Using the id as the key, say the hash function returns 0 for an even id and 1 for an odd id. Then all I have to do is find the box that contains even ids. Ideally I would then only have to search thru 500 entries to find the key - i.e. the id, and return the value associated with it.
But hash functions are more sophisticated and there are many such boxes or buckets. And the appropriate box or bucket can be identified and then be searched for the proper key. And then return its associated value.
I am currently implementing a Hashed Array-Mapped Table (HAMT) in Java and I've run into the following issue. When inserting new key-value pairs, there will obviously be collisions. In the original paper, the author suggests:
The existing key is then inserted in the new sub-hash table and the new key added. Each time 5 more bits of the hash are used the probability of a collision reduces by a factor of 1/32. Occasionally an entire 32 bit hash may be consumed and a new one must be computed to diļ¬erentiate the two keys.
... and also:
The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.
So i tried this in Java, with strings. It is known that:
"Ea".hashCode() == "FB".hashCode(); // true
... because of the String#hashCode() algorithm. Following the suggestion in the paper and extending the string with the tree depth to yield another, non-colliding hash code sadly doesn't work:
"Ea1".hashCode() == "FB1".hashCode(); // :( still the same!!
The above holds true for any integer you might concatenate the strings with, their hash codes will always collide.
My question is: how do you solve this situation? There has been this answer to a very similar question, but there has not been a real solution in the discussion. So how do we do this...?
You have to implements equals() method to compare if the values are equals.
Hashcode is just to sort data into data collections and it is usefull to binarySearch works. But hashcode() is nothing without equals().
Consider an int array variable x[]. The variable X will have starting address reference. When array is accessed with index 2 that is x[2].then its memory location is calculated as
address of x[2] is starting addr + index * size of int.
ie. x[2]=x + 2*4.
But in case of hashmap how the memory address will be mapped internally.
By reading many previous posts I observed that HashMap uses a linked list to store the key value list. But if that is the case, then to find a key, it generates a hashcode then it will checks for equal hash code in list and retrieves the value..
This takes O(n) complexity.
If i am wrong in above observation Please correct me... I am a beginner.
Thank you
The traditional implementation of a HashMap is to use a function to generate a key, then use that key to access a value directly. Think of it as generating something that will translate into an array index. It does not look through the hashmap comparing element hashes to the generated hash; it generates the hash, and uses the hash to access an element directly.
What I think you're talking about is the case where two values in the HashMap generate the SAME key. THEN it uses a list of those, and has to look through them to determine which one it wants. But that's not O(n) where n is the number of elements in the HashMap, just O(m) where m is the number of elements with the same hash. Clearly the name of the game is to find a hash function where the generated hash is unique for all the elements, as much as is feasible.
--- edit to expand on the explanation ---
In your post, you state:
By reading many previous posts I observed that HashMap uses a linked
list to store the key value list.
This is wrong for the overall HashMap. For a HashMap to work reasonably, there must be a way to use the key to calculate a way to access the corresponding element directly, not by searching through all the values in the HashMap.
A "perfect" hash calculation would translate each possible key into hash value that was not calculated for any other key. This is usually not feasible, and so it is usually possible that two different keys will result in the same result from the hash calculation. In this case, the HashMap implementation could use a linked list of values, and would need to look through all such values to find the one that it was looking for. But this number is FAR less than the number of values in the overall HashMap.
You can make a hash where strings are the keys, and in which the first character of the string is converted to a number which is then used as an array index. As long as your strings all have different first characters, then accessing the value is a simple calc plus an array access -- O(1). Or you could add all the character values of the string indices together and take the last two (or three) digits, and THAT would be your hash calc. As long as the addition produced unique values for each index string, you don't ever have to look through a list; again, O(1).
And, in fact, as long as the hash calculation is approximately perfect, the lookup is still O(1) overall, because the limited number of times you have to look through a short list does not alter the overall efficiency.
Suppose we wish to repeatedly search a linked list of length N elements, each of which contains a very long string key. How might we take advantage of the hash value when searching the list for an element with a given key?
Insert the keys into a hash table. Then you can search in (theoretically) constant time.
You need to have the hashes prepared before searching the list and you need to be able to access hash of a string in constant time. Then you can first compare the hashes and only compare strings when hashes match. You can use a hashtable instead of a linked list.
The hash value of a String (hashCode) is a bit like an id for the string. Not completely unique, but usually pretty unique. You can a HashMap to store the String keys and their values. HashMap, as its name suggests, uses the Strings' hash values to efficiently store and retrieve values.
Not sure what constraints you're working under here (i.e. this might take way too much memory depending on how big the strings are), but if you have to maintain the linked list you could create a HashMap that maps the strings to their position in the list which would allow you to retrieve any string from the list with 2 constant time operations.
Put it in HashSet. The search algorithms will use the hash for each String value inserted.