In the HashMap documentation, it is mentioned that:
the initial capacity is simply the capacity at the time the hash table is created
the capacity is the number of buckets in the hash table.
Now suppose we have intial capacity of 16 (default), and if we keep adding elements to 100 nos, the capacity of hashmap is 100 * loadfactor.
Will the number of hash buckets is 100 or 16?
Edit:
From the solution I read: buckets are more than the elements added.
Taking this as view point: so if we add Strings as key, we will get one element/bucket resulting in a lot of space consumption/complexity, is my understanding right ?
Neither 100 nor 16 buckets. Most likely there will be 256 buckets, but this isn't guaranteed by the documentation.
From the updated documentation link:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
(emphasis mine)
So, if we ignore the word "approximately" above, we determine that whenever the hash table becomes 75% full (or whichever load factor you specify in the constructor), the number of hash buckets doubles. That means that the number of buckets doubles whenever you insert the 12th, 24th, 48th, and 96th elements, leaving a total of 256 buckets.
However, as I emphasized in the documentation snippet, the number is approximately twice the previous size, so it may not be exactly 256. In fact, if the second-to-last doubling is replaced with a slightly larger increase, the last doubling may never happen, so the final hash table may be as small as 134 buckets, or may be larger than 256 elements.
N.B. I arrived at the 134 number because it's the smallest integer N such that 0.75 * N > 100.
Looking at the source code of HashMap we see the following:
threshold = capacity * loadfactor
size = number of elements in the map
if( size >= threshold ) {
double capacity
}
Thus, if the initial capacity is 16 and your load factor is 0.75 (the default), the initial threshold will be 12. If you add the 12th element, the capacity rises to 32 with a threshold of 24. The next step would be capacity 64 and threshold 48 etc. etc.
So with 100 elements, you should have a capacity of 256 and a threshold of 192.
Note that this applies only to the standard values. If you know the approximate number of elements your map will contain you should create it with a high enough initial capacity in order to prevent the copying around when the capacity is increased.
Update:
A word on the capacity: it will always be a power of two, even if you define a different initial capacity. The hashmap will then set the capacity to the smallest power of 2 that is greater than or equal to the provided initial capacity.
In doc
When the number of entries in the hash table exceeds the product of the
load factor and the current capacity, the capacity is roughly doubled by
calling the rehash method.
threshold=product of the load factor and the current capacity
Lets try.. initial size of hashmap is 16
and default load factor is 0.75 so 1st threshold is 12 so adding 12 element next capacity will be..
(16*2) =32
2st threshold is 24 so after adding 24th element next capacity will be (32*2)=64
and so on..
From your link :
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled by calling the rehash method.
That means if we have initial capacity 16 and when it exceeds, capacity will be increased by 32, next time by 64 and so on.
In your case, you are adding 100 nos. So when you come to 16th number, size will be added by 32 so now total size 48. Again you keep adding till 48th number now size will be increased by 64. Thus, in your case, total size of bucket is 112.
You are going to have at least one bucket per actual item. If you add items beyond 16, the table must be resized and rehashed.
Now suppose we have intial capacity of 16 (default), and if we keep
adding elements to 100 nos, the capacity of hashmap is 100 *
loadfactor.
Actually it says:
If the initial capacity is greater than the maximum number of entries
divided by the load factor, no rehash operations will ever occur.
Ie, if there is a maximum of 100 items and the capacity is 100/0.75 = 133, then no re-hashing should ever occur. Notice that this implies that even if the table is not full, it may have to be rehashed when close to full. So the ideal initial capacity to set using the default load factor, if you expect <=100 items, is ~135+.
Related
This one is simple question, for those who know internal implementation of HashMap :)
Initial size is 16 buckets, load factor is 0.75. Meaning when it gets (watch that word) 12, it resizes to 32 buckets.
My question is does it resize from 16 to 32 buckets when it gets 12 key-value pairs or when it gets 12 'filled' buckets? I am asking that because it could happen that from those 16 buckets all 12 key-value pairs get inserted to same bucket. In that case it would be weird to resize as other 15 are totally empty.
Thanks, any opinion on this would be appreciated :)
As mentioned in this link.
It represents that 12th key-value pair of hashmap will keep its size to 16. As soon as 13th element (key-value pair) will come into the Hashmap, it will increase its size from default 2^4 = 16 buckets to 2^5 = 32 buckets.
Independent where each key was inserted, when the product of load factor and current capacity exceed, the table will be resized.
The HashMap doesn't care about how many buckets were used until the load factor has been reached, it knows that the probability of having collisions is becoming too big, and the map should be resized. Even though many collisions already happened.
From JavaDoc
https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html#put-K-V-
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
So, HashMap resize when it has 12 key-value pairs. It isn't weird, because after resizing entries will change their bucket.
HashMap has two important properties: size and load factor. I went through the Java documentation and it says 0.75f is the initial load factor. But I can't find the actual use of it.
Can someone describe what are the different scenarios where we need to set load factor and what are some sample ideal values for different cases?
The documentation explains it pretty well:
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
As with all performance optimizations, it is a good idea to avoid optimizing things prematurely (i.e. without hard data on where the bottlenecks are).
Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.
For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.
Actually, from my calculations, the "perfect" load factor is closer to log 2 (~ 0.7). Although any load factor less than this will yield better performance. I think that .75 was probably pulled out of a hat.
Proof:
Chaining can be avoided and branch prediction exploited by predicting if a
bucket is empty or not. A bucket is probably empty if the probability of it
being empty exceeds .5.
Let s represent the size and n the number of keys added. Using the binomial
theorem, the probability of a bucket being empty is:
P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)
Thus, a bucket is probably empty if there are less than
log(2)/log(s/(s - 1)) keys
As s reaches infinity and if the number of keys added is such that
P(0) = .5, then n/s approaches log(2) rapidly:
lim (log(2)/log(s/(s - 1)))/s as s -> infinity = log(2) ~ 0.693...
What is load factor ?
The amount of capacity which is to be exhausted for the HashMap to increase its capacity.
Why load factor ?
Load factor is by default 0.75 of the initial capacity (16) therefore 25% of the buckets will be free before there is an increase in the capacity & this makes many new buckets with new hashcodes pointing to them to exist just after the increase in the number of buckets.
Why should you keep many free buckets & what is the impact of keeping free buckets on the performance ?
If you set the loading factor to say 1.0 then something very interesting might happen.
Say you are adding an object x to your hashmap whose hashCode is 888 & in your hashmap the bucket representing the hashcode is free , so the object x gets added to the bucket, but now again say if you are adding another object y whose hashCode is also 888 then your object y will get added for sure BUT at the end of the bucket (because the buckets are nothing but linkedList implementation storing key,value & next) now this has a performance impact ! Since your object y is no longer present in the head of the bucket if you perform a lookup the time taken is not going to be O(1) this time it depends on how many items are there in the same bucket. This is called hash collision by the way & this even happens when your loading factor is less than 1.
Correlation between performance, hash collision & loading factor
Lower load factor = more free buckets = less chances of collision = high performance = high space requirement.
Higher load factor = fewer free buckets = higher chance of collision = lower performance = lower space requirement.
From the documentation:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased
It really depends on your particular requirements, there's no "rule of thumb" for specifying an initial load factor.
For HashMap DEFAULT_INITIAL_CAPACITY = 16 and DEFAULT_LOAD_FACTOR = 0.75f
it means that MAX number of ALL Entries in the HashMap = 16 * 0.75 = 12. When the thirteenth element will be added capacity (array size) of HashMap will be doubled!
Perfect illustration answered this question:
image is taken from here:
https://javabypatel.blogspot.com/2015/10/what-is-load-factor-and-rehashing-in-hashmap.html
If the buckets get too full, then we have to look through
a very long linked list.
And that's kind of defeating the point.
So here's an example where I have four buckets.
I have elephant and badger in my HashSet so far.
This is a pretty good situation, right?
Each element has zero or one elements.
Now we put two more elements into our HashSet.
buckets elements
------- -------
0 elephant
1 otter
2 badger
3 cat
This isn't too bad either.
Every bucket only has one element
.
So if I wanna know, does this contain panda?
I can very quickly look at bucket number 1 and it's not
there and
I known it's not in our collection.
If I wanna know if it contains cat, I look at bucket
number 3,
I find cat, I very quickly know if it's in our
collection.
What if I add koala, well that's not so bad.
buckets elements
------- -------
0 elephant
1 otter -> koala
2 badger
3 cat
Maybe now instead of in bucket number 1 only looking at
one element,
I need to look at two.
But at least I don't have to look at elephant, badger and
cat.
If I'm again looking for panda, it can only be in bucket
number 1 and
I don't have to look at anything other then otter and
koala.
But now I put alligator in bucket number 1 and you can
see maybe where this is going.
That if bucket number 1 keeps getting bigger and bigger
and
bigger, then I'm basically having to look through all of
those elements to find
something that should be in bucket number 1.
buckets elements
------- -------
0 elephant
1 otter -> koala ->alligator
2 badger
3 cat
If I start adding strings to other buckets,
right, the problem just gets bigger and bigger in every
single bucket.
How do we stop our buckets from getting too full?
The solution here is that
"the HashSet can automatically
resize the number of buckets."
There's the HashSet realizes that the buckets are getting
too full.
It's losing this advantage of this all of one lookup for
elements.
And it'll just create more buckets(generally twice as before) and
then place the elements into the correct bucket.
So here's our basic HashSet implementation with separate
chaining.
Now I'm going to create a "self-resizing HashSet".
This HashSet is going to realize that the buckets are
getting too full and
it needs more buckets.
loadFactor is another field in our HashSet class.
loadFactor represents the average number of elements per
bucket,
above which we want to resize.
loadFactor is a balance between space and time.
If the buckets get too full then we'll resize.
That takes time, of course, but
it may save us time down the road if the buckets are a
little more empty.
Let's see an example.
Here's a HashSet, we've added four elements so far.
Elephant, dog, cat and fish.
buckets elements
------- -------
0
1 elephant
2 cat ->dog
3 fish
4
5
At this point, I've decided that the loadFactor, the
threshold,
the average number of elements per bucket that I'm okay
with, is 0.75.
The number of buckets is buckets.length, which is 6, and
at this point our HashSet has four elements, so the
current size is 4.
We'll resize our HashSet, that is we'll add more buckets,
when the average number of elements per bucket exceeds
the loadFactor.
That is when current size divided by buckets.length is
greater than loadFactor.
At this point, the average number of elements per bucket
is 4 divided by 6.
4 elements, 6 buckets, that's 0.67.
That's less than the threshold I set of 0.75 so we're
okay.
We don't need to resize.
But now let's say we add woodchuck.
buckets elements
------- -------
0
1 elephant
2 woodchuck-> cat ->dog
3 fish
4
5
Woodchuck would end up in bucket number 3.
At this point, the currentSize is 5.
And now the average number of elements per bucket
is the currentSize divided by buckets.length.
That's 5 elements divided by 6 buckets is 0.83.
And this exceeds the loadFactor which was 0.75.
In order to address this problem, in order to make the
buckets perhaps a little
more empty so that operations like determining whether a
bucket contains
an element will be a little less complex, I wanna resize
my HashSet.
Resizing the HashSet takes two steps.
First I'll double the number of buckets, I had 6 buckets,
now I'm going to have 12 buckets.
Note here that the loadFactor which I set to 0.75 stays the same.
But the number of buckets changed is 12,
the number of elements stayed the same, is 5.
5 divided by 12 is around 0.42, that's well under our
loadFactor,
so we're okay now.
But we're not done because some of these elements are in
the wrong bucket now.
For instance, elephant.
Elephant was in bucket number 2 because the number of
characters in elephant
was 8.
We have 6 buckets, 8 minus 6 is 2.
That's why it ended up in number 2.
But now that we have 12 buckets, 8 mod 12 is 8, so
elephant does not belong in bucket number 2 anymore.
Elephant belongs in bucket number 8.
What about woodchuck?
Woodchuck was the one that started this whole problem.
Woodchuck ended up in bucket number 3.
Because 9 mod 6 is 3.
But now we do 9 mod 12.
9 mod 12 is 9, woodchuck goes to bucket number 9.
And you see the advantage of all this.
Now bucket number 3 only has two elements whereas before
it had 3.
So here's our code,
where we had our HashSet with separate chaining that
didn't do any resizing.
Now, here's a new implementation where we use resizing.
Most of this code is the same,
we're still going to determine whether it contains the
value already.
If it doesn't, then we'll figure it out which bucket it
should go into and
then add it to that bucket, add it to that LinkedList.
But now we increment the currentSize field.
currentSize was the field that kept track of the number
of elements in our HashSet.
We're going to increment it and then we're going to look
at the average load,
the average number of elements per bucket.
We'll do that division down here.
We have to do a little bit of casting here to make sure
that we get a double.
And then, we'll compare that average load to the field
that I've set as
0.75 when I created this HashSet, for instance, which was
the loadFactor.
If the average load is greater than the loadFactor,
that means there's too many elements per bucket on
average, and I need to reinsert.
So here's our implementation of the method to reinsert
all the elements.
First, I'll create a local variable called oldBuckets.
Which is referring to the buckets as they currently stand
before I start resizing everything.
Note I'm not creating a new array of linked lists just yet.
I'm just renaming buckets as oldBuckets.
Now remember buckets was a field in our class, I'm going
to now create a new array
of linked lists but this will have twice as many elements
as it did the first time.
Now I need to actually do the reinserting,
I'm going to iterate through all of the old buckets.
Each element in oldBuckets is a LinkedList of strings
that is a bucket.
I'll go through that bucket and get each element in that
bucket.
And now I'm gonna reinsert it into the newBuckets.
I will get its hashCode.
I will figure out which index it is.
And now I get the new bucket, the new LinkedList of
strings and
I'll add it to that new bucket.
So to recap, HashSets as we've seen are arrays of Linked
Lists, or buckets.
A self resizing HashSet can realize using some ratio or
I would pick a table size of n * 1.5 or n + (n >> 1), this would give a load factor of .66666~ without division, which is slow on most systems, especially on portable systems where there is no division in the hardware.
In my program key-value pairs are frequently added to a Map until 1G of pairs are added. Map resizing slows down the process. How can I set minimum Map size to, for example 1000000007 (which is a prime)?
The constructor of a HashMap takes the initial size of the map (and the load factor, if desired).
Map<K,V> map = new HashMap<>(1_000_000_007);
How can I set minimum Map size to, for example 1000000007 (which is a prime)?
Using the HashMap(int) or HashMap(int, float) constructor. The int parameter is the capacity.
HashMap should have a size that is prime to minimize clustering.
Past and current implementations of the HashMap constructor will all choose a capacity that is the smallest power of 2 (up to 230) which is greater or equal to the supplied capacity. So using a prime number has no effect.
Will the constructor prevent map from resizing down?
HashMaps don't resize down.
(Note that size and capacity are different things. The size() method returns the number of currently entries in the Map. You can't "set" the size.)
A could of things you should note. The number of buckets in a HashMap is a power of 2 (might not be in future), the next power of 2 is 2^30. The load factor determines at what size it should grow the Map. Typically this is 0.75.
If you set the capacity to be the expected size, it will;
round up to the next power of 2
might still resize when the capacity * 0.75 is reached.
is limited to 2^30 anyway as it is the largest power of 2 you can have for the size of an array.
Will the constructor prevent map from resizing down?
The only way to do this is to copy all the elements into a new Map. This is not done automatically.
I tried to create a HashMap with the following details:-
HashMap<Integer,String> test = new HashMap<Integer,String>();
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
and I saw that every input was put in a different bucket. Which means a different hash code was calculated for each key.
Now,
if I modify my code as follows :-
HashMap<Integer,String> test = new HashMap<Integer,String>(16,2.0f);
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
I find that some of the values which were put in different buckets are now put in a bucket which already contains some values even though their hash value is different. Can anyone please help me understand the same ?
Thanks
So, if you initialize a HashMap without specifying an initial size and a load factor it will get initialized with a size of 16 and a load factor of 0.75. This means, once the HashMap is at least (initial size * load factor) big, so 12 elements big, it will be rehashed, which means, it will grow to about twice the size and all elements will be added anew.
You now set the load factor to 2, which means, now the Map will only get rehashed, when it is filled with at least 32 elements.
What happens now is that elements with the same hash mod bucketcount will be put into the same bucket. Each bucket with more then one element contains a list, where all the elements are put into. Now when you try to lookup one of the elements it first finds the bucket using the hash. Then it has to iterate over the whole list in that bucket to find the hash with the exact match. This is quite costly.
So in the end there is a trade-off: rehashing is pretty expensive, so you should try to avoid it. On the other hand, if you have multiple elements in a bucket, the lookup gets pretty expensive, so you should really try to avoid that as well. So you need a balance between those two. One other way to go is to set the initial size quite high, but that takes up more memory that is not used.
In your second test, the initial capacity is 16 and the load factor is 2. This means the HashMap will use an array of 16 elements to store the entries (i.e. there are 16 buckets), and this array will be resized only when the number of entries in the Map reaches 32 (16 * 2).
This means that some keys having different hashCodes must be stored in the same bucket, since the number of buckets (16) is smaller than the total number of entries (20 in your case).
The assignment of a key to a bucket is calculated in 3 steps :
First the hashCode method is called.
Then an additional function is applied on the hashCode to reduce the damage that may be caused by bad hashCode implementations.
Finally a modulus operation is applied on the result of the previous step to get a number between 0 and capacity - 1.
The 3rd step is where keys having different hashCodes may end up in the same bucket.
Lets check it with examples -
i) In first case, load factor is 0.75f and initialCapacity is 16 which means array resize will occur when number of buckets in HashMap reaches 16*0.75 = 12.
Now, every key has different HashCode so that HashCode modulo 16 is unique which causes all first 12 entries to go to different buckets after which resize occur and when new entries are put they also end up in different buckets (HashCode modulo 32 being unique.)
ii) In second case, load factor is 2.0f which means resize will happen when no. of buckets reaches 16*2 = 32.
You keep on putting entries in map and it never resizes (for the 20 entries) making multiple entries collide.
So, in nutshell in first example - HashCode modulo 16 for first 12 entries and HashCode modulo 32 for all entries is unique while in second case it is always HashCode modulo 16 for all entries which is not unique (cannot be as all 20 entries have to be accommodated in 16 buckets)
The javadoc explanation:
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of
buckets in the hash table, and the initial capacity is simply the
capacity at the time the hash table is created. The load factor is a
measure of how full the hash table is allowed to get before its
capacity is automatically increased. When the number of entries in the
hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice
the number of buckets.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
By default,initial capacity is 16 and load factor is 0.75.
So when number of entries goes beyond 12 (16 * 0.75),its capacity is increased to 32 and hashtable is rehashed. That is why in your first case, every different element is having its own bucket.
In your second case,only when the number of entries crossess 32(16*2), hash table will be resized. Even if the elements are having different hash code values, when hashcode%bucketsize is calculated, it may collide. That is the reason you are seeing more than one element in same bucket
As per my understanding, and what I have read
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased
So, when loadfactor is .8(80%), with map size of 10, Map will grow by size 10, when 8 elements are put in Map.
So, now Map has size 20. My doubt is when next 10 element space will be added to Map.
when Map is again 80% full, that is when 16 elements are put in Map.
or
When 18 elements are put in Map.
That will be at 16. If you look at the java code for HashMap:
threshold = (int)(newCapacity * loadFactor);
where new capacity is the new size. Therefore the limit in your example will be 16.
Loadfactor of 80%, so 16 elements. It will calculate the resizing depending on the total amount of elements that are in there and the max capacity at that time.
It doesn't keep track of the last resizing.
A HashMap has a size() and a capacity, and these are two different things. Capacity is an internal size of a hash table and is always a power of two, so HashMap can't have capacity 20. Size is the number of hash entries which were put by user into this map.
When you declare a HashMap
Map map = new HashMap(20)
It's actual capacity is 32 and a threshold is 24. It's size is zero.
Map map = new HashMap()
For this case map has size 0 and the default capacity 16.
Threshold:
threshold = (int)(newCapacity * loadFactor) = 32 * 0.8 = 25;
Which is 25 for load factor 0.8. So as soon as your map reaches a size of 25 entries, it will be resized to capacity 64 containing same 25 entries.
Every time a resize of the map happens, threshold is recalculated as;
threshold = (int)(newCapacity * loadFactor);
So in your example, it will be 16.
Please refer the source of HashMap here.