Map load factor, how map grows - java

As per my understanding, and what I have read
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased
So, when loadfactor is .8(80%), with map size of 10, Map will grow by size 10, when 8 elements are put in Map.
So, now Map has size 20. My doubt is when next 10 element space will be added to Map.
when Map is again 80% full, that is when 16 elements are put in Map.
or
When 18 elements are put in Map.

That will be at 16. If you look at the java code for HashMap:
threshold = (int)(newCapacity * loadFactor);
where new capacity is the new size. Therefore the limit in your example will be 16.

Loadfactor of 80%, so 16 elements. It will calculate the resizing depending on the total amount of elements that are in there and the max capacity at that time.
It doesn't keep track of the last resizing.

A HashMap has a size() and a capacity, and these are two different things. Capacity is an internal size of a hash table and is always a power of two, so HashMap can't have capacity 20. Size is the number of hash entries which were put by user into this map.
When you declare a HashMap
Map map = new HashMap(20)
It's actual capacity is 32 and a threshold is 24. It's size is zero.
Map map = new HashMap()
For this case map has size 0 and the default capacity 16.
Threshold:
threshold = (int)(newCapacity * loadFactor) = 32 * 0.8 = 25;
Which is 25 for load factor 0.8. So as soon as your map reaches a size of 25 entries, it will be resized to capacity 64 containing same 25 entries.

Every time a resize of the map happens, threshold is recalculated as;
threshold = (int)(newCapacity * loadFactor);
So in your example, it will be 16.
Please refer the source of HashMap here.

Related

HASHMAP - threshold and loadfactor & capacity

I've always been told that hashmap will resize once the size of map > loadfactor * capacity like what JDK comments says for threshold:
But after reading the source code of HashMap in JDK8, like put method:
The map resize at the time next size > threshold and threshold = the power of capacity instead of capacity*loadfactor for first put opration. Even during resizing, the threshold will be just double of old threshold but not new capacity * loadfactor.
Is there any mismatch about the JDK doc? Or maybe i am totally misunderstand. Anyone please help give any suggestions?
Because the new capacity is double of old capacity

A HashMap with default capacity 16 can contain more than 11/16 objects without rehashing - Is this right?

This is a followup question to What is the initial size of Array in HashMap Architecture?.
From that question I understand the initial capacity of HashMap is 16 by default which allows up to 11 entries before resizing since the default load factor is 0.75.
When does the rehashing take place? After 11 entries in the array or 11 entries in one of the linked lists? I think after 11 entries in the array.
Since the array size is 16 a HashMap could contain many objects (maybe more than 16) in a linked list as long as the array size is less than or equal to 11. Hence, a HashMap with default capacity 16 can contain more than 11/16 objects without rehashing - is this right?
Hence, a HashMap with default capacity 16 can contain more than 11/16 objects(K,V) without rehashing
This is an obvious flaw in using the number of buckets occupied as a measure. Another problem is that you would need to maintain both a size and a number of buckets used.
Instead it's the size() which is used so the size() is the only thing which determines when rehashing occurs no matter how it is arranged.
From the source for Java 8
final void putMapEntries(Map<? extends K, ? extends V> m, boolean evict) {
int s = m.size();
if (s > 0) {
if (table == null) { // pre-size
float ft = ((float)s / loadFactor) + 1.0F;
int t = ((ft < (float)MAXIMUM_CAPACITY) ?
(int)ft : MAXIMUM_CAPACITY);
if (t > threshold)
threshold = tableSizeFor(t);
}
else if (s > threshold)
resize();
I think you're fixating a bit too much on the implementation of HashMap, which can and does change over time. Think in terms of the map itself, rather than the internal data structures.
When does the rehashing take place? After 11 entries in the array or 11 entries in one of the linked lists? I think after 11 entries in the array.
Neither; the map is resized once the map contains 11 entries. Those entries could all be in their own buckets or all chained 11-deep in a single bucket.
Since the array size is 16 a HashMap could contain many objects (maybe more than 16) in a linked list as long as the array size is less than or equal to 11. Hence, a HashMap with default capacity 16 can contain more than 11/16 objects without rehashing - is this right?
No. While you could create your own hash table implementation that stores more elements than you have buckets, you'd do so at the cost of efficiency. The JDK's HashMap implementation will resize the backing array as soon as the number of elements in the map exceeds the load factor. It again doesn't matter whether the elements are all in the same bucket or distributed among them. in From the docs:
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
For example if you have a HashMap (with default load and capacity) that currently contains 11 entries and you call .put() to insert a 12th entry, the map will be resized.

Set minimum size of a Map in Java

In my program key-value pairs are frequently added to a Map until 1G of pairs are added. Map resizing slows down the process. How can I set minimum Map size to, for example 1000000007 (which is a prime)?
The constructor of a HashMap takes the initial size of the map (and the load factor, if desired).
Map<K,V> map = new HashMap<>(1_000_000_007);
How can I set minimum Map size to, for example 1000000007 (which is a prime)?
Using the HashMap(int) or HashMap(int, float) constructor. The int parameter is the capacity.
HashMap should have a size that is prime to minimize clustering.
Past and current implementations of the HashMap constructor will all choose a capacity that is the smallest power of 2 (up to 230) which is greater or equal to the supplied capacity. So using a prime number has no effect.
Will the constructor prevent map from resizing down?
HashMaps don't resize down.
(Note that size and capacity are different things. The size() method returns the number of currently entries in the Map. You can't "set" the size.)
A could of things you should note. The number of buckets in a HashMap is a power of 2 (might not be in future), the next power of 2 is 2^30. The load factor determines at what size it should grow the Map. Typically this is 0.75.
If you set the capacity to be the expected size, it will;
round up to the next power of 2
might still resize when the capacity * 0.75 is reached.
is limited to 2^30 anyway as it is the largest power of 2 you can have for the size of an array.
Will the constructor prevent map from resizing down?
The only way to do this is to copy all the elements into a new Map. This is not done automatically.

How is a hashMap in java populated when load factor is more than 1?

I tried to create a HashMap with the following details:-
HashMap<Integer,String> test = new HashMap<Integer,String>();
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
and I saw that every input was put in a different bucket. Which means a different hash code was calculated for each key.
Now,
if I modify my code as follows :-
HashMap<Integer,String> test = new HashMap<Integer,String>(16,2.0f);
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
I find that some of the values which were put in different buckets are now put in a bucket which already contains some values even though their hash value is different. Can anyone please help me understand the same ?
Thanks
So, if you initialize a HashMap without specifying an initial size and a load factor it will get initialized with a size of 16 and a load factor of 0.75. This means, once the HashMap is at least (initial size * load factor) big, so 12 elements big, it will be rehashed, which means, it will grow to about twice the size and all elements will be added anew.
You now set the load factor to 2, which means, now the Map will only get rehashed, when it is filled with at least 32 elements.
What happens now is that elements with the same hash mod bucketcount will be put into the same bucket. Each bucket with more then one element contains a list, where all the elements are put into. Now when you try to lookup one of the elements it first finds the bucket using the hash. Then it has to iterate over the whole list in that bucket to find the hash with the exact match. This is quite costly.
So in the end there is a trade-off: rehashing is pretty expensive, so you should try to avoid it. On the other hand, if you have multiple elements in a bucket, the lookup gets pretty expensive, so you should really try to avoid that as well. So you need a balance between those two. One other way to go is to set the initial size quite high, but that takes up more memory that is not used.
In your second test, the initial capacity is 16 and the load factor is 2. This means the HashMap will use an array of 16 elements to store the entries (i.e. there are 16 buckets), and this array will be resized only when the number of entries in the Map reaches 32 (16 * 2).
This means that some keys having different hashCodes must be stored in the same bucket, since the number of buckets (16) is smaller than the total number of entries (20 in your case).
The assignment of a key to a bucket is calculated in 3 steps :
First the hashCode method is called.
Then an additional function is applied on the hashCode to reduce the damage that may be caused by bad hashCode implementations.
Finally a modulus operation is applied on the result of the previous step to get a number between 0 and capacity - 1.
The 3rd step is where keys having different hashCodes may end up in the same bucket.
Lets check it with examples -
i) In first case, load factor is 0.75f and initialCapacity is 16 which means array resize will occur when number of buckets in HashMap reaches 16*0.75 = 12.
Now, every key has different HashCode so that HashCode modulo 16 is unique which causes all first 12 entries to go to different buckets after which resize occur and when new entries are put they also end up in different buckets (HashCode modulo 32 being unique.)
ii) In second case, load factor is 2.0f which means resize will happen when no. of buckets reaches 16*2 = 32.
You keep on putting entries in map and it never resizes (for the 20 entries) making multiple entries collide.
So, in nutshell in first example - HashCode modulo 16 for first 12 entries and HashCode modulo 32 for all entries is unique while in second case it is always HashCode modulo 16 for all entries which is not unique (cannot be as all 20 entries have to be accommodated in 16 buckets)
The javadoc explanation:
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of
buckets in the hash table, and the initial capacity is simply the
capacity at the time the hash table is created. The load factor is a
measure of how full the hash table is allowed to get before its
capacity is automatically increased. When the number of entries in the
hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice
the number of buckets.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
By default,initial capacity is 16 and load factor is 0.75.
So when number of entries goes beyond 12 (16 * 0.75),its capacity is increased to 32 and hashtable is rehashed. That is why in your first case, every different element is having its own bucket.
In your second case,only when the number of entries crossess 32(16*2), hash table will be resized. Even if the elements are having different hash code values, when hashcode%bucketsize is calculated, it may collide. That is the reason you are seeing more than one element in same bucket

number of hash buckets

In the HashMap documentation, it is mentioned that:
the initial capacity is simply the capacity at the time the hash table is created
the capacity is the number of buckets in the hash table.
Now suppose we have intial capacity of 16 (default), and if we keep adding elements to 100 nos, the capacity of hashmap is 100 * loadfactor.
Will the number of hash buckets is 100 or 16?
Edit:
From the solution I read: buckets are more than the elements added.
Taking this as view point: so if we add Strings as key, we will get one element/bucket resulting in a lot of space consumption/complexity, is my understanding right ?
Neither 100 nor 16 buckets. Most likely there will be 256 buckets, but this isn't guaranteed by the documentation.
From the updated documentation link:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
(emphasis mine)
So, if we ignore the word "approximately" above, we determine that whenever the hash table becomes 75% full (or whichever load factor you specify in the constructor), the number of hash buckets doubles. That means that the number of buckets doubles whenever you insert the 12th, 24th, 48th, and 96th elements, leaving a total of 256 buckets.
However, as I emphasized in the documentation snippet, the number is approximately twice the previous size, so it may not be exactly 256. In fact, if the second-to-last doubling is replaced with a slightly larger increase, the last doubling may never happen, so the final hash table may be as small as 134 buckets, or may be larger than 256 elements.
N.B. I arrived at the 134 number because it's the smallest integer N such that 0.75 * N > 100.
Looking at the source code of HashMap we see the following:
threshold = capacity * loadfactor
size = number of elements in the map
if( size >= threshold ) {
double capacity
}
Thus, if the initial capacity is 16 and your load factor is 0.75 (the default), the initial threshold will be 12. If you add the 12th element, the capacity rises to 32 with a threshold of 24. The next step would be capacity 64 and threshold 48 etc. etc.
So with 100 elements, you should have a capacity of 256 and a threshold of 192.
Note that this applies only to the standard values. If you know the approximate number of elements your map will contain you should create it with a high enough initial capacity in order to prevent the copying around when the capacity is increased.
Update:
A word on the capacity: it will always be a power of two, even if you define a different initial capacity. The hashmap will then set the capacity to the smallest power of 2 that is greater than or equal to the provided initial capacity.
In doc
When the number of entries in the hash table exceeds the product of the
load factor and the current capacity, the capacity is roughly doubled by
calling the rehash method.
threshold=product of the load factor and the current capacity
Lets try.. initial size of hashmap is 16
and default load factor is 0.75 so 1st threshold is 12 so adding 12 element next capacity will be..
(16*2) =32
2st threshold is 24 so after adding 24th element next capacity will be (32*2)=64
and so on..
From your link :
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled by calling the rehash method.
That means if we have initial capacity 16 and when it exceeds, capacity will be increased by 32, next time by 64 and so on.
In your case, you are adding 100 nos. So when you come to 16th number, size will be added by 32 so now total size 48. Again you keep adding till 48th number now size will be increased by 64. Thus, in your case, total size of bucket is 112.
You are going to have at least one bucket per actual item. If you add items beyond 16, the table must be resized and rehashed.
Now suppose we have intial capacity of 16 (default), and if we keep
adding elements to 100 nos, the capacity of hashmap is 100 *
loadfactor.
Actually it says:
If the initial capacity is greater than the maximum number of entries
divided by the load factor, no rehash operations will ever occur.
Ie, if there is a maximum of 100 items and the capacity is 100/0.75 = 133, then no re-hashing should ever occur. Notice that this implies that even if the table is not full, it may have to be rehashed when close to full. So the ideal initial capacity to set using the default load factor, if you expect <=100 items, is ~135+.

Categories

Resources