How Hashmap identifies that this bucket is full and it needs rehashing as it stores the value in linked list if two hashcodes are same, then as per my understanding this linkedlist does not have any fixed size and it can store as many elements it can so this bucket will never be full then how it will identify that it needs rehashing?
In a ConcurrentHashMap actually a red-black tree (for large number of elements) or a linked list (for small number of elements) is used when there is a collision (i.e. two different keys have the same hash code). But you are right when you say, the linked list (or red-black tree) can grow infinitely (assuming you have infinite memory and heap size).
The basic idea of a HashMap or ConcurrentHashMap is, you want to retrieve a value based on its key in O(1) time complexity. But in reality, collisions do happen and when that happens we put the nodes in a tree linked to the bucket (or array cell). So, Java could create a HashMap where the array size would remain fixed and rehashing would never happen, but if they did that all your key values need to be accommodated within the fixed-size array (along with their linked trees).
Let's say you have that kind of HashMap where your array size is fixed to 16 and you push 1000 key-value pairs in it. In that case, you can have at most 16 distinct hash codes. This in turn means that you would have collisions in all (1000-16) puts and those new nodes will end up in the tree and they can no longer be fetched in O(1) time complexity. In tree you'd need O(log n) time to search for keys.
To make sure that this doesn't happen, the HashMap uses a load factor calculation to determine how much of the array is filled with key-value pairs. If it is 75% (default settings) full, any new put would create a new larger array, copy the existing content into it and thus have more buckets or hash code space. This ensures that in most of the cases collisions won't happen and the tree won't be required and you would fetch most keys in O(1) time.
Hashmap maintain complexity of O(1) while inserting data in and
getting data from hashmap, but for 13th key-value pair, put request
will no longer be O(1), because as soon as map will realize that 13th
element came in, that is 75% of map is filled.
It will first double the bucket(array) capacity and then it will go
for Rehash. Rehashing requires re-computing hashcode of already placed
12 key-value pairs again and putting them at new index which requires
time.
Kindly refer this link, this will be helpful for you.
Related
I am making a custom implementation of a HashMap without using the HashMap data structure as part of an assignment, currently I have the choice of working with two 1D arrays or using a 2D array to store my keys and values. I want to be able to check if a key exists and return the corresponding value in O(1) time complexity (assignment requirement) but i am assuming it is without the use of containsKey().
Also, when inserting key and value pairs to my arrays, i am confused because it should not be O(1) logically, since there would occasionally be cases where there is collision and i have to recalculate the index, so why is the assignment requirement for insertion O(1)?
A lot of questions in there, let me give it a try.
I want to be able to check if a key exists and return the
corresponding value in O(1) time complexity (assignment requirement)
but i am assuming it is without the use of containsKey().
That actually doesn't make a difference. O(1) means the execution time is independent of the input, it does not mean a single operation is used. If your containsKey() and put() implementations are both O(1), then so is your solution that uses both of them exactly once.
Also, when inserting key and value pairs to my arrays, i am confused
because it should not be O(1) logically, since there would
occasionally be cases where there is collision and i have to
recalculate the index, so why is the assignment requirement for
insertion O(1)?
O(1) is the best case, which assumes that there are no hash collisions. The worst case is O(n) if every key generates the same hash code. So when a hash map's lookup or insertion performance is calculated as O(1), that assumes a perfect hashCode implementation.
Finally, when it comes to data structures, the usual approach is to use a single array, where the array items are link list nodes. The array offsets correspond to hashcode() % array size (there are much more advances formulas than this, but this is a good starting point). In case of a hash collision, you will have to navigate the linked list nodes until you find the correct entry.
You're correct in that hash table insert is not guaranteed to be O(1) due to collisions. If you use the open addressing strategy to deal with collisions, the process to insert an item is going to take time proportional to 1/(1-a) where a is the proportion of how much of the table capacity has been used. As the table fills up, a goes to 1, and the time to insert grows without bound.
The secret to keeping time complexity as O(1) is making sure that there's always room in the table. That way a never grows too big. That's why you have to resize the table when it starts to run out capacity.
Problem: resizing a table with N item takes O(N) time.
Solution: increase the capacity exponentially, for example double it every time you need to resize. This way the table has to be resized very rarely. The cost of the occasional resize operations is "amortized" over a large number of insertions, and that's why people say that hash table insertion has "amortized O(1)" time complexity.
TLDR: Make sure you increase the table capacity when it's getting full, maybe 70-80% utilization. When you increase the capacity, make sure that it's by a constant factor, for example doubling it.
From what I understand, the time complexity to iterate through a Hash table with capacity "m" and number of entries "n" is O(n+m). I was wondering, intuitively, why this is the case? For instance, why isn't it n*m?
Thanks in advance!
You are absolutely correct. Iterating a HashMap is an O(n + m) operation, with n being the number of elements contained in the HashMap and m being its capacity. Actually, this is clearly stated in the docs:
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings).
Intuitively (and conceptually), this is because a HashMap consists of an array of buckets, with each element of the array pointing to either nothing (i.e. to null), or to a list of entries.
So if the array of buckets has size m, and if there are n entries in the map in total (I mean, n entries scattered throughout all the lists hanging from some bucket), then, iterating the HashMap is done by visiting each bucket, and, for buckets that have a list with entries, visiting each entry in the list. As there are m buckets and n elements in total, iteration is O(m + n).
Note that this is not the case for all hash table implementations. For example, LinkedHashMap is like a HashMap, except that it also has all its entries connected in a doubly-linked list fashion (to preserve either insertion or access order). If you are to iterate a LinkedHashMap, there's no need to visit each bucket. It would be enough to just visit the very first entry and then follow its link to the next entry, and then proceed to the next one, etc, and so on until the last entry. Thus, iterating a LinkedHashMap is just O(n), with n being the total number of entries.
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings)
n = the number of buckets
m = the number of key-value mappings
The complexity of a Hashmap is O(n+m) because the worst-case scenario is one array element contains the whole linked list, which could occur due to flawed implementation of hashcode function being used as a key.
Visialise the worst-case scenario
To iterate this scenario, first java needs to iterate the complete array O(n) and then iterate the linked list O(m), combining these gives us O(n+m).
To solve Dynamic programming problem I used two approaches to store table entries, one using multi dimension array ex:tb[m][n][p][q] and other using hashmap and using indexes of 1st approach to make string to be used as key as in "m,n,p,q". But on one input first approach completes in 2 minutes while other takes more than 3 minutes.
If access time of both hashmap and array is asymptotically equal than why so big difference in performance ?
Like mentioned here:
HashMap uses an array underneath so it can never be faster than using
an array correctly.
You are right, array's and HashMap's access time is in O(1) but this just says it is independent on input size or the current size of your collection. But it doesn't say anything about the actual work which has to be done for each action.
To access an entry of an array you have to calculate the memory address of your entry. This is easy as array's memory address + (index * size of entity).
To access an entry of a HashMap, you first have to hash the given key (which needs many cpu cycles), then access the entry of the HashMap's array using the hash which holds a list (depends on implementation details of the HashMap), and last you have to linear search the list for the correct entry (those lists are very short most of the time, so it is treated as O(1)).
So you see it is more like O(10) for arrays and O(5000) hash maps. Or more precise T(Array access) for arrays and T(hashing) + T(Array access) + T(linear search) for HashMaps with T(X) as actual time of action x.
When one creates a Map or a List in Java, they both have the same default initial capacity of 10. Their capacity grows as they get new elements. However, the List only grows when the 11th element is added and the Map grows already when the 8th element is added. It happens because the Map has a loadFactor field, which regulates how "saturated" it can be. When the saturation is higher than the loadFactor, the Map grows. The loadFactor is 0.75 by default.
I wonder why do Lists and Maps have different mechanisms for deciding when to increase their initial capacity?
Maps have load factor because the load factor determines their performance. When you lower the load factor, you get better performance (but you pay for it with increased memory usage).
Take HashMap for example. The capacity is the size of the backing array. However, each position of the array may contain multiple entries. The load factor controls how many entries in average would be stored in a single array position.
On the other hand, in ArrayList each index of the backing array holds a single element, so there's no meaning to a load factor.
Maps as such don't have a loadFactor - only implementations based on some kind of HashMap do have it (e.g. there's no loadFactor on a TreeMap).
Why is that?
A HashMap contains a number of "buckets" and when adding or retrieving an entry you take the key's hashcode and calculate which bucket you have to put it in or retrieve it from. Based on the quality of the hash-implementation two distinct objects might end up in the same bucket. When this happens the hashmap starts a Linked list that you have to go through when retrieving the entry.
The HashMap and List differ in some important points:
The capacity of the HashMap does not say how many elemnts could be stored in it, it's the number of buckets. In theory you could store more than capacity entries in a HashMap.
Too much items ending up in the same bucket are bad for the HashMap's performance. If you have fewer buckets you have in relation to the number of entries you'll increase the number of such collisions. Enter the loadFactor: if things get "too tight" and you fear you'll get too much collisions you start to grow the number of buckets - even if there are still some empty ones left.
I'm looking at this website that lists Big O complexities for various operations. For Dynamic Arrays, the removal complexity is O(n), while for Hash Tables it's O(1).
For Dynamic Arrays like ArrayLists to be O(n), that must mean the operation of removing some value from the center and then shifting each index over one to keep the block of data contiguous. Because if we're just deleting the value stored at index k and not shifting, it's O(1).
But in Hash Tables with linear probing, deletion is the same thing, you just run your value through the Hash function, go to the Dynamic Array holding your data, and delete the value stored in it.
So why do Hash Tables get O(1) credit while Dynamic Arrays get O(n)?
This is explained here. The key is that the number of values per Dynamic Array is kept under a constant value.
Edit: As Dukeling pointed out, my answer explains why a Hash Table with separate chaining has O(1) removal complexity. I should add that, on the website you were looking at, Hash Tables are credited with O(1) removal complexity because they analyse a Hash Table with separate chaining and not linear probing.
The point of hash tables is that they keep close to the best case, where the best case means a single entry per bucket. Clearly, you have no trouble accepting that to remove the sole entry from a bucket takes O(1) time.
When there are many hash conflicts, you certainly need to do a lot of shifting when using linear probing.
But the complexity for hash tables are under the assumption of Simply Uniform Hashing, meaning that it assumes that there will be a minimal number of hash conflicts.
When this happens, we only need to delete some value and shift either no values or a small (essentially constant) amount of values.
When you talk about the complexity of an algorithm, you actually need to discuss a concrete implementation.
There is no Java class called a "Hash Table" (obviously!) or "HashTable".
There are Java classes called HashMap and Hashtable, and these do indeed have O(1) deletion.
But they don't work the way that you seem to think (all?) hash tables work. Specifically, HashMap and Hashtable are organized as an array of pointers to "chains".
This means that deletion consists of finding the appropriate chain, and then traversing the chain to find the entry to remove. The first step is constant time (including the time to calculate the hash code. The second step is proportional to the length of the hash chains. But assuming that the hash function is good, the average length of the hash chain is a small constant. Hence the total time for deletion is O(1) on average.
The reason that the hash chains are short on average is that the HashMap and Hashtable classes automatically resize the main hash array when the "load factor" (the ratio of the array size to the number of entries) exceeds a predetermined value. Assuming that the hash function distributes the (actual) keys pretty evenly, you will find that the chains are roughly the same length. Assuming that the array size is proportional to the total number of entries, the actual load factor will the average hash chain length.
This reasoning breaks down if the hash function does not distribute the keys evenly. This leads to a situation where you get a lot of hash collisions. Indeed, the worst-case behaviour is when all keys have the same hash value, and they all end up on a single hash chain with all N entries. In that case, deletion involves searching a chain with N entries ... and that makes it O(N).
It turns out that the same reasoning can be applied to other forms of hash table, including those where the entries are stored in the hash array itself, and collisions are handled by rehashing scanning. (Once again, the "trick" is to expand the hash table when the load factor gets too high.)