I am understanding hashMap internal implementation and i need some help regarding the same. HashMap store data by using linkedlist in java 8 it is using tree data structure.
below is the node Node class constructor could you please help me to understand why node is having hash value of the key?
Node(int hash, K key, V value, Node<K,V> next) {
this.hash = hash;
this.key = key;
this.value = value;
this.next = next;
}
You use the hashes to compute the placement of an element within the HashMap. It would be supremely inefficient if adding a key to your HashMap required re-computation of the hashes every time a collision needed to be resolved. Not only that, but some objects may have expensive hash functions (for example, String hashCode goes through every character in the String). It would always be desirable to compute such functions once and cache them for later (since you should not change hashCode/equals for an object placed within a Map).
Let's consider a HashMap that is half full (n/2), with independent elements being placing into the entry set. There is a 1/2 probability of collision (minimum) for any given element being added. The amount of entries possible to fill the HashMap is n, but the default load factor is 0.75, which means we have 3n/4 - n/2 = n/4 entries left to fill. All of these entries must be without hash collisions, as we save time for each collision (by caching). Assuming the maximum possible probability of 1/2 for not having a collision, we see that we have probability 1/2n/4 of no collisions occurring before HashMap expansion. This means that for any sizeable HashMap (n=29+, out of which only 0.5*29=(about 15) keys have to be filled), there is a greater than 99% chance that you get time savings off of caching hash values.
tl;dr It saves time, especially when collisions become frequent for addition/lookups.
Related
I know that the insertion into a HashMap takes O(1) time Complexity, so for inserting n elements the complexity should be O(n). I have a little doubt about the below method.
Code:
private static Map<Character, Character> mapCharacters(String message) {
Map<Character, Character> map = new HashMap<>();
char idx = 'a';
for (char ch : message.toCharArray()) { // a loop - O(n)
if (!map.containsKey(ch)) { // containsKey - take O(n) to check the whole map
map.put(ch, idx);
++idx;
}
}
return map;
}
What I am doing is to map the decrypted message with the sequential "abcd".
So, My confusion is whether the complexity of the above method is O(n^2) or O(n)? Please help me to clear my doubt.
It's amortised O(n). containsKey and put are both amortised O(1); see What is the time complexity of HashMap.containsKey() in java? in particular.
The time complexity of both containsKey() and put() will depend on the implementation of the equals/hashCode contract of the object that is used as a key.
Under the hood, HashMap maintains an array of buckets. And each bucket corresponds to a range of hashes. Each non-empty bucket contains nodes (each node contains information related to a particular map-entry) that form either a linked list or a tree (since Java 8).
The process of determining the right bucket based on the hash of the provided key is almost instant, unless the process of computing the hash is heavy, but anyway it's considered to have a constant time complexity O(1). Accessing the bucket (i.e. accessing array element) is also O(1), but then things can become tricky. Assuming that bucket contains only a few elements, checking them will not significantly increase the total cost (while performing both containsKey() and put() we should check whether provided key exists in the map) and overall time-complexity will be O(1).
But in the case when all the map-entries for some reason ended up in the same bucket then iterating over the linked list will result in O(n), and if nodes are stored as a Red-black tree the worse case cost of both containsKey() and put() will be O(log n).
In case Character as a key we have a proper implementation of the equals/hashCode. Proper hash function is important to ensure that objects would be evenly spread among buckets. But if you were using instead a custom object which hashCode() is broken, for instance always returning the same number, then all entries will end up in the same bucket and performance will be poor.
Collisions, situations when different objects produces hashes in the range that is being mapped to the same bucket, are inevitable, and we need a good hash function to make the number of collisions as fewer as possible, so that bucket would be "evenly" occupied and the map get expanded when the number of non-empty buckets exceeds load factor.
The time complexity of this particular method is linear O(n). But it might not be the case if you would change the type of the key.
Given your original code below:
private static Map<Character,Character> mapCharacters(String message){
Map<Character, Character> map = new HashMap<>();
char idx = 'a';
for (char ch : message.toCharArray()) {
if (!map.containsKey(ch)) {
map.put(ch, idx);
++idx;
}
}
return map;
}
Here are some observations about the time complexity:
Iterating each character in message will be O(n) – pretty clearly if there are 10 characters it takes 10 steps, 10,000 characters will take 10,000 steps. The time complexity is directly proportional to n.
Inside the loop, you're checking if the map contains that character already – using containsKey(), this is a constant-time operation which is defined as O(1) time complexity. The idea is that the actual time to evaluate the result of containsKey() should be the same whether your map has 10 keys or 10,000 keys.
Inside the if, you have map.put() which is another constant-time operation, another O(1) operation.
Combined together: for every nth character, you spend time iterating it in the loop, checking if the map has it already, and adding it; all of this would be 3n (three times n) which is just O(n).
Separately, you could make a small improvement to your code by replacing this block:
if (!map.containsKey(ch)) {
map.put(ch, idx);
++idx;
}
with something like below (which retains the same behavior of running ++idx only if you've put something, and specifically if the previous thing was either "not present" or perhaps had been set with "null" - though the "set as null" doesn't apply from your code sample):
Character previousValue = map.putIfAbsent(ch, idx);
if (previousValue == null) {
++idx;
}
Though the time complexity wouldn't change, using containsKey() makes the code clearer, and also takes advantage of well-tested, performant code that's available for free.
Under the covers, calling Map.putIfAbsent() is implemented like below (as of OpenJDK 17) showing an initial call to get() followed by put() if the key isn't present.
default V putIfAbsent(K key, V value) {
V v = get(key);
if (v == null) {
v = put(key, value);
}
return v;
}
I have a TreeMap, and want to get a set of K smallest keys (entries) larger than a given value.
I know we can use higherKey(givenValue) get the one key, but then how should I iterate from there?
One possible way is to get a tailMap from the smallest key larger than the given value, but that's an overkill if K is small compared with the map size.
Is there a better way with O(logn + K) time complexity?
The error here is thinking that tailMap makes a new map. It doesn't. It gives you a lightweight object that basically just contains a field pointing back to the original map, and the pivot point, that's all it has. Any changes you make to a tailMap map will therefore also affect the underlying map, and any changes made to the underlying map will also affect your tailMap.
for (KeyType key : treeMap.tailMap(pivot).keySet()) {
}
The above IS O(logn) complexity for the tailMap operation, and give that you loop, that adds K to the lot, for a grand total of O(logn+K) time complexity, and O(1) space complexity, where N is the size of the map and K is the number of keys that end up being on the selected side of the pivot point (so, worst case, O(nlogn)).
If you want an actual copied-over map, you'd do something like:
TreeMap<KeyType> tm = new TreeMap<KeyType>(original.tailMap(pivot));
Note that this also copies over the used comparator, in case you specified a custom one. That one really is O(logn + K) in time complexity (and O(K) in space complexity). Of course, if you loop through this code, it's.. still O(logn + K) (because 2K just boils down to K when talking about big-O notation).
As per the following link document: Java HashMap Implementation
I'm confused with the implementation of HashMap (or rather, an enhancement in HashMap). My queries are:
Firstly
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this.
How they are achieving a performance gain with this?
Secondly
If you see the source code of HashMap in JDK, you will find the following static inner class:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
HashMap.TreeNode<K, V> parent;
HashMap.TreeNode<K, V> left;
HashMap.TreeNode<K, V> right;
HashMap.TreeNode<K, V> prev;
boolean red;
TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
super(arg0, arg1, arg2, arg3);
}
final HashMap.TreeNode<K, V> root() {
HashMap.TreeNode arg0 = this;
while (true) {
HashMap.TreeNode arg1 = arg0.parent;
if (arg0.parent == null) {
return arg0;
}
arg0 = arg1;
}
}
//...
}
How is it used? I just want an explanation of the algorithm.
HashMap contains a certain number of buckets. It uses hashCode to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0 so the item goes in the first bucket, Bucket 1.
If our hashCode function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8 is for. If a bucket contains more than eight items, it should become a tree.
This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo method of Comparable if the objects implement that interface, else the identity hash code.
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6 is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
Finally, there is the MIN_TREEIFY_CAPACITY = 64.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.
To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.
It is designed to protect against bad hashCode implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.
To put it simpler (as much as I could simpler) + some more details.
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
int arrayIndex = hashCode % buckets; // will be negative
buckets[arrayIndex] = Entry; // obviously will fail
Instead there is a nice trick used instead of modulo:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like
entry.next.next....
MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.
TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.
From the implementation:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value
public class Key implements Comparable<Key>{
private String name;
public Key (String name){
this.name = name;
}
#Override
public int hashCode(){
return 1;
}
public String keyName(){
return this.name;
}
public int compareTo(Key key){
//returns a +ve or -ve integer
}
}
and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.
Map<Key, String> map = new HashMap<>();
Key key1 = new Key("key1");
map.put(key1, "one");
Key key2 = new Key("key2");
map.put(key2, "two");
Key key3 = new Key("key3");
map.put(key3, "three");
Key key4 = new Key("key4");
map.put(key4, "four");
Key key5 = new Key("key5");
map.put(key5, "five");
Key key6 = new Key("key6");
map.put(key6, "six");
Key key7 = new Key("key7");
map.put(key7, "seven");
Key key8 = new Key("key8");
map.put(key8, "eight");
//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry
Key key9 = new Key("key9");
map.put(key9, "nine");
threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.
key1
/ \
key2 key3
/ \ / \
Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.
The change in HashMap implementation was was added with JEP-180. The purpose was to:
Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class
However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.
To understand the internal implementation of hashmap, you need to understand the hashing.
Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.
A true hash function must follow this rule –
“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”
How Hashmap identifies that this bucket is full and it needs rehashing as it stores the value in linked list if two hashcodes are same, then as per my understanding this linkedlist does not have any fixed size and it can store as many elements it can so this bucket will never be full then how it will identify that it needs rehashing?
In a ConcurrentHashMap actually a red-black tree (for large number of elements) or a linked list (for small number of elements) is used when there is a collision (i.e. two different keys have the same hash code). But you are right when you say, the linked list (or red-black tree) can grow infinitely (assuming you have infinite memory and heap size).
The basic idea of a HashMap or ConcurrentHashMap is, you want to retrieve a value based on its key in O(1) time complexity. But in reality, collisions do happen and when that happens we put the nodes in a tree linked to the bucket (or array cell). So, Java could create a HashMap where the array size would remain fixed and rehashing would never happen, but if they did that all your key values need to be accommodated within the fixed-size array (along with their linked trees).
Let's say you have that kind of HashMap where your array size is fixed to 16 and you push 1000 key-value pairs in it. In that case, you can have at most 16 distinct hash codes. This in turn means that you would have collisions in all (1000-16) puts and those new nodes will end up in the tree and they can no longer be fetched in O(1) time complexity. In tree you'd need O(log n) time to search for keys.
To make sure that this doesn't happen, the HashMap uses a load factor calculation to determine how much of the array is filled with key-value pairs. If it is 75% (default settings) full, any new put would create a new larger array, copy the existing content into it and thus have more buckets or hash code space. This ensures that in most of the cases collisions won't happen and the tree won't be required and you would fetch most keys in O(1) time.
Hashmap maintain complexity of O(1) while inserting data in and
getting data from hashmap, but for 13th key-value pair, put request
will no longer be O(1), because as soon as map will realize that 13th
element came in, that is 75% of map is filled.
It will first double the bucket(array) capacity and then it will go
for Rehash. Rehashing requires re-computing hashcode of already placed
12 key-value pairs again and putting them at new index which requires
time.
Kindly refer this link, this will be helpful for you.
I have read in a book that if a hash function returns unique hash value for each distinct object, its most efficient. If hashcode() method in a class gives a unique hash value for each distinct object and I want to store n distinct instance of that class in a Hashmap, then there will be n buckets for storing n instances.The time complexity will be O(n). Then how does single entry(instance) for each hash value yield better performance?Is it related to the data structure of bucket?
You seem to think that having n buckets for n elements, time complexity will be O(n), which is wrong.
How about a different example, suppose you have an ArrayList with n elements, how much time it will take to perform a get(index) for example? O(1) right?
Now think about the HashMap, this index in the ArrayList example is actually the hashCode for the map. When we insert into a HashMap to find the location of where that element goes (the bucket), we use the hashcode (index). If there is an entry per bucket - the search time for a value from the map is O(1).
But even if there are multiple values in a single bucket, the general search complexity for a HashMap is still O(1).
The data structure of the bucket is important too. For the worse case scenarios for example. Under the current implementation of a HashMap it uses two types: LinkedNode and TreeNode; depending on a few things like how many there are in a bucket at this point in time. Linked is easy:
next.next.next...
TreeNode is
- left
node
- right
It is a red-black Tree. In such a data structure search complexity is O(logn) which is much better then O(n).
A Java HashMap associates a Key k with a Value v. Every Java Object has a method hashCode() that produces an integer which is not necessarily unique.
I have read in a book that if a hash function returns unique hash value for each distinct object, its most efficient.
Another definition would be that the best Hash Function is one that produces the least collisions.
If hashcode() method in a class gives a unique hash value for each distinct object and I want to store n distinct instance of that class in a Hashmap, then there will be n buckets for storing n instances.The time complexity will be O(n).
In our case HashMap contains a table of buckets of a certain size, let's say >= n for our purposes. It uses the hashCode of the object as a key and through a Hash Function returns an index to the table. If we have n objects and the Hash Function returns n different indexes we have zero collisions. This is the optimal case and the complexity to find and get any object is O(1).
Now, if the Hash Function returns the same index for 2 different keys (objects) then we have a collision and the table bucket on that index already contains a value. In that case the table bucket will point to another newly allocated bucket. In that order, a list is created on the index that the collision happened. So the worst case complexity will be O(m) where m is the size of the biggest list.
In conclusion the performance of the HashMap depends on the number of collisions. The fewer the better.
I believe this video will help you.