what is order of finding correct bucket in java hashmap??
In hashmap first bucket is located using hashcode method and then we iterate over it using equals method, so my question is on first part, what is complexity of finding the bucket in which desired key is present.
Looking up the bucket is O(1). Hashmap just computes the hashcode and uses it to index into the bucket slots.
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
Related
How Hashmap identifies that this bucket is full and it needs rehashing as it stores the value in linked list if two hashcodes are same, then as per my understanding this linkedlist does not have any fixed size and it can store as many elements it can so this bucket will never be full then how it will identify that it needs rehashing?
In a ConcurrentHashMap actually a red-black tree (for large number of elements) or a linked list (for small number of elements) is used when there is a collision (i.e. two different keys have the same hash code). But you are right when you say, the linked list (or red-black tree) can grow infinitely (assuming you have infinite memory and heap size).
The basic idea of a HashMap or ConcurrentHashMap is, you want to retrieve a value based on its key in O(1) time complexity. But in reality, collisions do happen and when that happens we put the nodes in a tree linked to the bucket (or array cell). So, Java could create a HashMap where the array size would remain fixed and rehashing would never happen, but if they did that all your key values need to be accommodated within the fixed-size array (along with their linked trees).
Let's say you have that kind of HashMap where your array size is fixed to 16 and you push 1000 key-value pairs in it. In that case, you can have at most 16 distinct hash codes. This in turn means that you would have collisions in all (1000-16) puts and those new nodes will end up in the tree and they can no longer be fetched in O(1) time complexity. In tree you'd need O(log n) time to search for keys.
To make sure that this doesn't happen, the HashMap uses a load factor calculation to determine how much of the array is filled with key-value pairs. If it is 75% (default settings) full, any new put would create a new larger array, copy the existing content into it and thus have more buckets or hash code space. This ensures that in most of the cases collisions won't happen and the tree won't be required and you would fetch most keys in O(1) time.
Hashmap maintain complexity of O(1) while inserting data in and
getting data from hashmap, but for 13th key-value pair, put request
will no longer be O(1), because as soon as map will realize that 13th
element came in, that is 75% of map is filled.
It will first double the bucket(array) capacity and then it will go
for Rehash. Rehashing requires re-computing hashcode of already placed
12 key-value pairs again and putting them at new index which requires
time.
Kindly refer this link, this will be helpful for you.
From what I understand, the time complexity to iterate through a Hash table with capacity "m" and number of entries "n" is O(n+m). I was wondering, intuitively, why this is the case? For instance, why isn't it n*m?
Thanks in advance!
You are absolutely correct. Iterating a HashMap is an O(n + m) operation, with n being the number of elements contained in the HashMap and m being its capacity. Actually, this is clearly stated in the docs:
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings).
Intuitively (and conceptually), this is because a HashMap consists of an array of buckets, with each element of the array pointing to either nothing (i.e. to null), or to a list of entries.
So if the array of buckets has size m, and if there are n entries in the map in total (I mean, n entries scattered throughout all the lists hanging from some bucket), then, iterating the HashMap is done by visiting each bucket, and, for buckets that have a list with entries, visiting each entry in the list. As there are m buckets and n elements in total, iteration is O(m + n).
Note that this is not the case for all hash table implementations. For example, LinkedHashMap is like a HashMap, except that it also has all its entries connected in a doubly-linked list fashion (to preserve either insertion or access order). If you are to iterate a LinkedHashMap, there's no need to visit each bucket. It would be enough to just visit the very first entry and then follow its link to the next entry, and then proceed to the next one, etc, and so on until the last entry. Thus, iterating a LinkedHashMap is just O(n), with n being the total number of entries.
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings)
n = the number of buckets
m = the number of key-value mappings
The complexity of a Hashmap is O(n+m) because the worst-case scenario is one array element contains the whole linked list, which could occur due to flawed implementation of hashcode function being used as a key.
Visialise the worst-case scenario
To iterate this scenario, first java needs to iterate the complete array O(n) and then iterate the linked list O(m), combining these gives us O(n+m).
I have read in a book that if a hash function returns unique hash value for each distinct object, its most efficient. If hashcode() method in a class gives a unique hash value for each distinct object and I want to store n distinct instance of that class in a Hashmap, then there will be n buckets for storing n instances.The time complexity will be O(n). Then how does single entry(instance) for each hash value yield better performance?Is it related to the data structure of bucket?
You seem to think that having n buckets for n elements, time complexity will be O(n), which is wrong.
How about a different example, suppose you have an ArrayList with n elements, how much time it will take to perform a get(index) for example? O(1) right?
Now think about the HashMap, this index in the ArrayList example is actually the hashCode for the map. When we insert into a HashMap to find the location of where that element goes (the bucket), we use the hashcode (index). If there is an entry per bucket - the search time for a value from the map is O(1).
But even if there are multiple values in a single bucket, the general search complexity for a HashMap is still O(1).
The data structure of the bucket is important too. For the worse case scenarios for example. Under the current implementation of a HashMap it uses two types: LinkedNode and TreeNode; depending on a few things like how many there are in a bucket at this point in time. Linked is easy:
next.next.next...
TreeNode is
- left
node
- right
It is a red-black Tree. In such a data structure search complexity is O(logn) which is much better then O(n).
A Java HashMap associates a Key k with a Value v. Every Java Object has a method hashCode() that produces an integer which is not necessarily unique.
I have read in a book that if a hash function returns unique hash value for each distinct object, its most efficient.
Another definition would be that the best Hash Function is one that produces the least collisions.
If hashcode() method in a class gives a unique hash value for each distinct object and I want to store n distinct instance of that class in a Hashmap, then there will be n buckets for storing n instances.The time complexity will be O(n).
In our case HashMap contains a table of buckets of a certain size, let's say >= n for our purposes. It uses the hashCode of the object as a key and through a Hash Function returns an index to the table. If we have n objects and the Hash Function returns n different indexes we have zero collisions. This is the optimal case and the complexity to find and get any object is O(1).
Now, if the Hash Function returns the same index for 2 different keys (objects) then we have a collision and the table bucket on that index already contains a value. In that case the table bucket will point to another newly allocated bucket. In that order, a list is created on the index that the collision happened. So the worst case complexity will be O(m) where m is the size of the biggest list.
In conclusion the performance of the HashMap depends on the number of collisions. The fewer the better.
I believe this video will help you.
I'm looking at this website that lists Big O complexities for various operations. For Dynamic Arrays, the removal complexity is O(n), while for Hash Tables it's O(1).
For Dynamic Arrays like ArrayLists to be O(n), that must mean the operation of removing some value from the center and then shifting each index over one to keep the block of data contiguous. Because if we're just deleting the value stored at index k and not shifting, it's O(1).
But in Hash Tables with linear probing, deletion is the same thing, you just run your value through the Hash function, go to the Dynamic Array holding your data, and delete the value stored in it.
So why do Hash Tables get O(1) credit while Dynamic Arrays get O(n)?
This is explained here. The key is that the number of values per Dynamic Array is kept under a constant value.
Edit: As Dukeling pointed out, my answer explains why a Hash Table with separate chaining has O(1) removal complexity. I should add that, on the website you were looking at, Hash Tables are credited with O(1) removal complexity because they analyse a Hash Table with separate chaining and not linear probing.
The point of hash tables is that they keep close to the best case, where the best case means a single entry per bucket. Clearly, you have no trouble accepting that to remove the sole entry from a bucket takes O(1) time.
When there are many hash conflicts, you certainly need to do a lot of shifting when using linear probing.
But the complexity for hash tables are under the assumption of Simply Uniform Hashing, meaning that it assumes that there will be a minimal number of hash conflicts.
When this happens, we only need to delete some value and shift either no values or a small (essentially constant) amount of values.
When you talk about the complexity of an algorithm, you actually need to discuss a concrete implementation.
There is no Java class called a "Hash Table" (obviously!) or "HashTable".
There are Java classes called HashMap and Hashtable, and these do indeed have O(1) deletion.
But they don't work the way that you seem to think (all?) hash tables work. Specifically, HashMap and Hashtable are organized as an array of pointers to "chains".
This means that deletion consists of finding the appropriate chain, and then traversing the chain to find the entry to remove. The first step is constant time (including the time to calculate the hash code. The second step is proportional to the length of the hash chains. But assuming that the hash function is good, the average length of the hash chain is a small constant. Hence the total time for deletion is O(1) on average.
The reason that the hash chains are short on average is that the HashMap and Hashtable classes automatically resize the main hash array when the "load factor" (the ratio of the array size to the number of entries) exceeds a predetermined value. Assuming that the hash function distributes the (actual) keys pretty evenly, you will find that the chains are roughly the same length. Assuming that the array size is proportional to the total number of entries, the actual load factor will the average hash chain length.
This reasoning breaks down if the hash function does not distribute the keys evenly. This leads to a situation where you get a lot of hash collisions. Indeed, the worst-case behaviour is when all keys have the same hash value, and they all end up on a single hash chain with all N entries. In that case, deletion involves searching a chain with N entries ... and that makes it O(N).
It turns out that the same reasoning can be applied to other forms of hash table, including those where the entries are stored in the hash array itself, and collisions are handled by rehashing scanning. (Once again, the "trick" is to expand the hash table when the load factor gets too high.)
I am trying to implement a plane sweep algorithm and for this I need to know the time complexity of java.util.HashMap class' keySet() method. I suspect that it is O(n log n). Am I correct?
Point of clarification: I am talking about the time complexity of the keySet() method; iterating through the returned Set will take obviously O(n) time.
Getting the keyset is O(1) and cheap. This is because HashMap.keyset() returns the actual KeySet object associated with the HashMap.
The returned Set is not a copy of the keys, but a wrapper for the actual HashMap's state. Indeed, if you update the set you can actually change the HashMap's state; e.g. calling clear() on the set will clear the HashMap!
... iterating through the returned Set will take obviously O(n) time.
Actually that is not always true:
It is true for a HashMap is created using new HashMap<>(). The worst case is to have all N keys land in the same hash chain. However if the map has grown naturally, there will still be N entries and O(N) slots in the hash array. Thus iterating the entry set will involve O(N) operations.
It is false if the HashMap is created with new HashMap<>(capacity) and a singularly bad (too large) capacity estimate. Then it will take O(Cap) + O(N) operations to iterate the entry set. If we treat Cap as a variable, that is O(max(Cap, N)), which could be worse than O(N).
There is an escape clause though. Since capacity is an int in the current HashMap API, the upper bound for Cap is 231. So for really large values of Cap and N, the complexity is O(N).
On the other hand, N is limited by the amount of memory available and in practice you need a heap in the order of 238 bytes (256GBytes) for N to exceed the largest possible Cap value. And for a map that size, you would be better off using a hashtable implementation tuned for huge maps. Or not using an excessively large capacity estimate!
Surely it would be O(1). All that it is doing is returning a wrapper object on the HashMap.
If you are talking about walking over the keyset, then this is O(n), since each next() call is O(1), and this needs to be performed n times.
This should be doable in O(n) time... A hash map is usually implemented as a large bucket array, the bucket's size is (usually) directly proportional to the size of the hash map. In order to retrieve the key set, the bucket must be iterated through, and for each set item, the key must be retrieved (either through an intermediate collection or an iterator with direct access to the bucket)...
**EDIT: As others have pointed out, the actual keyset() method will run in O(1) time, however, iterating over the keyset or transferring it to a dedicated collection will be an O(n) operation. Not quite sure which one you are looking for **
Java collections have a lot of space and thus don't take much time. That method is, I believe, O(1). The collection is just sitting there.
To address the "iterating through the returned Set will take obviously O(n) time" comment, this is not actually correct per the doc comments of HashMap:
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
So in other words, iterating over the returned Set will take O(n + c) where n is the size of the map and c is its capacity, not O(n). If an inappropriately sized initial capacity or load factor were chosen, the value of c could outweigh the actual size of the map in terms of iteration time.