HashMap Rehashing on the basis of Size

HashMap Rehashing on the basis of Size - java

I am having one question that rehashing in HashMap always occurs when size reaches loadFactor*capacity+1. Now Suppose Initially there are 16 Buckets, and I am inserting Object which has a bad implementation of hashCode means returning same value i.e. 1. Now at this point even after insertion of 13 elements, rehashing will take places but if you see then all elements are in one bucket in form of LinkedList and other 15 buckets are empty.So why Rehashing done on the basis of Size and Not Capacity. In HashMap, there is a re-calculation of hashCode using out provided HashCode but that will also return the same value.

HashMap doesn't know that all 13 elements returned the same hashCode().
They could have returned 13 different hash codes, that just happened to all map into the same hash bucket, and re-hashing may then map them all into different buckets.
Adding logic to detect a bad hashCode() implementation is not feasible.

Related

Resize hashtable when max chain length is reached

I am implementing a hashtable for educational purposes. The hashtable is implemented with an array and collision is dealt by using linked list. The instructions says that I can insert same items without checking to increase speed of insertion. But when chain length reaches max allowed, the hashtable needs to be resized. But I found resizing is not going to help at all because same items still go to the same bucket even array length is increased. Did I miss something here? Thank you very much.

Let's take an example: three objects with hashcodes 7, 23 and 47.
If the hashtable is of size 8, then by modular arithmetic, all of those objects would go into hash bucket 7.
On the other hand, if the hashtable is of size 16, then the first two would go into hash bucket 7 while the other would go into bucket 15.

The instructions says that I can insert same items without checking to increase speed of insertion.
You can't skip checking completely, because you would end up with duplicates on the same chain.
But I found resizing is not going to help at all because same items still go to the same bucket even array length is increased.
This would happen only for hash values below table size. For values above table size the % operator will often place the item in a different bucket, assuming that you avoid the aliasing problem.
In order to avoid aliasing, use table sizes corresponding to prime numbers. See this Q&A for additional information on this.

I can tell how jdk handles that. You entries (keys) override hashcode - which is an int (made from 32 bits). When you have 16 buckets (internal array has the length of 16), the operation that is performed internally to find out where the entry will go is:
hash_code & (array.lenght - 1) // this is the same as modulo operation
// if array.lenght is power of two.
that means that when you put an entry into the map, only the last 4 bits of the hash code of your entries are taken into account.
Now when you fill those 16 entries (of you implement a load factor): the internal array is made bigger (let's double it), so now it has 32 entries.
This means that deciding where the entry will go is computed:
hash_code & (32 - 1) // now there are 5 bits take into consideration
All your entries are now re-hashed (because there is one more bit now), and your entries might end this time in different buckets.

How does the distribution of hash codes affect when Java's HashMap will rehash?

I was going through Rehashing process in hashmap or hashtable
which explains the beautiful concept of
HashMap rehashing.
I have a question about it.
Consider a case where initial capacity of HashMap is 16 and load factor is 0.75. Now, 12 elements are added to the HashMap, but they all end up in the same bucket due to a poor hashCode implementation. The other 15 buckets do not contain any elements. Will the HashMap rehash?

Actually, at least with the current implementation of HashMap, collisions don't affect rehashing. See, for example, lines 661-662 at the end of HashMap.putVal, which simply check
if (++size > threshold)
resize();
where threshold is (capacity * load factor), cached in a field for speed (avoiding a floating-point operation).
HashMap deals with overfull buckets by converting the buckets from linked lists to balanced trees.

How is a hashMap in java populated when load factor is more than 1?

I tried to create a HashMap with the following details:-
HashMap<Integer,String> test = new HashMap<Integer,String>();
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
and I saw that every input was put in a different bucket. Which means a different hash code was calculated for each key.
Now,
if I modify my code as follows :-
HashMap<Integer,String> test = new HashMap<Integer,String>(16,2.0f);
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
I find that some of the values which were put in different buckets are now put in a bucket which already contains some values even though their hash value is different. Can anyone please help me understand the same ?
Thanks

So, if you initialize a HashMap without specifying an initial size and a load factor it will get initialized with a size of 16 and a load factor of 0.75. This means, once the HashMap is at least (initial size * load factor) big, so 12 elements big, it will be rehashed, which means, it will grow to about twice the size and all elements will be added anew.
You now set the load factor to 2, which means, now the Map will only get rehashed, when it is filled with at least 32 elements.
What happens now is that elements with the same hash mod bucketcount will be put into the same bucket. Each bucket with more then one element contains a list, where all the elements are put into. Now when you try to lookup one of the elements it first finds the bucket using the hash. Then it has to iterate over the whole list in that bucket to find the hash with the exact match. This is quite costly.
So in the end there is a trade-off: rehashing is pretty expensive, so you should try to avoid it. On the other hand, if you have multiple elements in a bucket, the lookup gets pretty expensive, so you should really try to avoid that as well. So you need a balance between those two. One other way to go is to set the initial size quite high, but that takes up more memory that is not used.

In your second test, the initial capacity is 16 and the load factor is 2. This means the HashMap will use an array of 16 elements to store the entries (i.e. there are 16 buckets), and this array will be resized only when the number of entries in the Map reaches 32 (16 * 2).
This means that some keys having different hashCodes must be stored in the same bucket, since the number of buckets (16) is smaller than the total number of entries (20 in your case).
The assignment of a key to a bucket is calculated in 3 steps :
First the hashCode method is called.
Then an additional function is applied on the hashCode to reduce the damage that may be caused by bad hashCode implementations.
Finally a modulus operation is applied on the result of the previous step to get a number between 0 and capacity - 1.
The 3rd step is where keys having different hashCodes may end up in the same bucket.

Lets check it with examples -
i) In first case, load factor is 0.75f and initialCapacity is 16 which means array resize will occur when number of buckets in HashMap reaches 16*0.75 = 12.
Now, every key has different HashCode so that HashCode modulo 16 is unique which causes all first 12 entries to go to different buckets after which resize occur and when new entries are put they also end up in different buckets (HashCode modulo 32 being unique.)
ii) In second case, load factor is 2.0f which means resize will happen when no. of buckets reaches 16*2 = 32.
You keep on putting entries in map and it never resizes (for the 20 entries) making multiple entries collide.
So, in nutshell in first example - HashCode modulo 16 for first 12 entries and HashCode modulo 32 for all entries is unique while in second case it is always HashCode modulo 16 for all entries which is not unique (cannot be as all 20 entries have to be accommodated in 16 buckets)

The javadoc explanation:
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of
buckets in the hash table, and the initial capacity is simply the
capacity at the time the hash table is created. The load factor is a
measure of how full the hash table is allowed to get before its
capacity is automatically increased. When the number of entries in the
hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice
the number of buckets.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
By default,initial capacity is 16 and load factor is 0.75.
So when number of entries goes beyond 12 (16 * 0.75),its capacity is increased to 32 and hashtable is rehashed. That is why in your first case, every different element is having its own bucket.
In your second case,only when the number of entries crossess 32(16*2), hash table will be resized. Even if the elements are having different hash code values, when hashcode%bucketsize is calculated, it may collide. That is the reason you are seeing more than one element in same bucket

Is retrieval order of elements in HashMap really randomized?

Is the retrieval of entries from HashMap really randomized? Or does it depend on buckets in which the entries are hashed and then these buckets are accessed in some predefined order ? or any other kind of order internally happening?

Answer b - it depends on buckets in which the entries are hashed and then these buckets are accessed in some predefined order

What happens when you call the get method on a HashMap for a non-null key:
The hashCode() method for your key is called.
The hash bucket for that hash is obtained by simply and'ing the hash with the number of buckets - 1 which works because the implementation ensures that the number of buckets is always a power of 2. Because the buckets are simply a java array of HashMap.Entry objects the resulting integer can be used as a plain array index. This is where the O(1) complexity comes from.
The entries in the bucket (a linked list) are iterated until one is found where the hash and the key match the one you want. This is where the O(1) complexity starts to deteriorate towards a worst case of O(n).
So you can see that an efficient use of the HashMap will try to minimise the number of entries that end up together in a bucket. All that matters in that respect is the distribution of the bits in your hashcode from bit 0 to bit N-1 where N is the power of 2 used to calculate the number of buckets. All bits in your hashcode above that limit are masked away and only get used again during the iteration of the bucket list.

hashmap hashcode to obtain worst case constant time?

I know that hashmap operations are O(1) amortized due to possible collisions. But in java, integer.hashCode is just its value. Then if you were to put m distinct integers into a hashmap where m = hashmap's INITIAL_CAPACITY (16 lets say) does that mean that there will be 16 different buckets with 1 integer each? Then would this guarantee O(1) lookup, deletion, insertion for worst time?

No because HashMap will rehash the hash for its own internal purposes.

No, because HashMap is going to have to take the very large number of possible values returned by hashCode and map them to the very small number of buckets, and there's no guarantee that that mapping will map different integers' hashcodes to different buckets.

You should look at the way Hashmap decides which bucket the key/object will belong to and yu will see that it does NOT use object's hashCode() as bucket number but manipulates it (bit shifting) a little to limit a number of buckets to less than Integer.MAX_VALUE

if you were to put m distinct integers into a hashmap where m =
hashmap's INITIAL_CAPACITY (16 lets say) does that mean that there
will be 16 different buckets with 1 integer each?
Depends on the values, probably the HashMap will create new buckets (increase the capacity) to keep the load factor under a minimum (Java HashMap increases size if load is over 75% by default).
What is load factor
would this guarantee O(1) lookup, deletion, insertion for worst time?
No, in particulary bad cases the lookup time can be up to O(n) (depending on the number of elements and their values). In the case of integers, all posible int values are mapped to the hashmap size, so it´s likely to be a lot of collisions for a small sized map.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.