HASHMAP - threshold and loadfactor & capacity - java

I've always been told that hashmap will resize once the size of map > loadfactor * capacity like what JDK comments says for threshold:
But after reading the source code of HashMap in JDK8, like put method:
The map resize at the time next size > threshold and threshold = the power of capacity instead of capacity*loadfactor for first put opration. Even during resizing, the threshold will be just double of old threshold but not new capacity * loadfactor.
Is there any mismatch about the JDK doc? Or maybe i am totally misunderstand. Anyone please help give any suggestions?

Because the new capacity is double of old capacity

Related

Resizing of ArrayDeque

Quote: Default initial capacity of ArrayDeque is 16. It will increase at a power of 2 (24, 25, 26 and so on) when size exceeds capacity.
Does this mean it behaves similar to ArrayList? Each time size exceeds capacity there is new array where older elements are copied to? Can I say internal implementation of ArrayDequeue and ArrayList is array (as their name says)? Just the resizing differs?
Yes ArrayDeque behaves similarly to ArrayList: Internally it uses an Object array. If the capacity is not enough, it creates a new, larger array, and copies items from the old array to the new.
The Java API specification does not require any particular resizing behavior. In fact the current implementation in OpenJDK doubles the size of the array if it's small (64), otherwise it grows by 50%:
// Double capacity if small; else grow by 50%
int jump = (oldCapacity < 64) ? (oldCapacity + 2) : (oldCapacity >> 1);
It seems that the"doubling" behavior is approximate: thanks to the "+2" after the first resize the capacity is 16+16+2 = 34. After the second resize it's 34+34+2 = 70. After that the array increases by 50% in every resize.

Set minimum size of a Map in Java

In my program key-value pairs are frequently added to a Map until 1G of pairs are added. Map resizing slows down the process. How can I set minimum Map size to, for example 1000000007 (which is a prime)?
The constructor of a HashMap takes the initial size of the map (and the load factor, if desired).
Map<K,V> map = new HashMap<>(1_000_000_007);
How can I set minimum Map size to, for example 1000000007 (which is a prime)?
Using the HashMap(int) or HashMap(int, float) constructor. The int parameter is the capacity.
HashMap should have a size that is prime to minimize clustering.
Past and current implementations of the HashMap constructor will all choose a capacity that is the smallest power of 2 (up to 230) which is greater or equal to the supplied capacity. So using a prime number has no effect.
Will the constructor prevent map from resizing down?
HashMaps don't resize down.
(Note that size and capacity are different things. The size() method returns the number of currently entries in the Map. You can't "set" the size.)
A could of things you should note. The number of buckets in a HashMap is a power of 2 (might not be in future), the next power of 2 is 2^30. The load factor determines at what size it should grow the Map. Typically this is 0.75.
If you set the capacity to be the expected size, it will;
round up to the next power of 2
might still resize when the capacity * 0.75 is reached.
is limited to 2^30 anyway as it is the largest power of 2 you can have for the size of an array.
Will the constructor prevent map from resizing down?
The only way to do this is to copy all the elements into a new Map. This is not done automatically.

Map load factor, how map grows

As per my understanding, and what I have read
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased
So, when loadfactor is .8(80%), with map size of 10, Map will grow by size 10, when 8 elements are put in Map.
So, now Map has size 20. My doubt is when next 10 element space will be added to Map.
when Map is again 80% full, that is when 16 elements are put in Map.
or
When 18 elements are put in Map.
That will be at 16. If you look at the java code for HashMap:
threshold = (int)(newCapacity * loadFactor);
where new capacity is the new size. Therefore the limit in your example will be 16.
Loadfactor of 80%, so 16 elements. It will calculate the resizing depending on the total amount of elements that are in there and the max capacity at that time.
It doesn't keep track of the last resizing.
A HashMap has a size() and a capacity, and these are two different things. Capacity is an internal size of a hash table and is always a power of two, so HashMap can't have capacity 20. Size is the number of hash entries which were put by user into this map.
When you declare a HashMap
Map map = new HashMap(20)
It's actual capacity is 32 and a threshold is 24. It's size is zero.
Map map = new HashMap()
For this case map has size 0 and the default capacity 16.
Threshold:
threshold = (int)(newCapacity * loadFactor) = 32 * 0.8 = 25;
Which is 25 for load factor 0.8. So as soon as your map reaches a size of 25 entries, it will be resized to capacity 64 containing same 25 entries.
Every time a resize of the map happens, threshold is recalculated as;
threshold = (int)(newCapacity * loadFactor);
So in your example, it will be 16.
Please refer the source of HashMap here.

ArrayList VS Vector JVM Memory usage

Does using an ArrayList take less memory when compared to Vector? I have read Vector doubles its internal array size when vector reaches the max size where as ArrayList does it only by half? Is this a true statement? I need the answer when I do not declare Vector with the values for initialcapacity and capacityIncrement.
Yes you are correct in terms of memory allocation of internal arrays:
Internally, both the ArrayList and Vector hold onto their contents using an Array. When an element is inserted into an ArrayList or a Vector, the object will need to expand its internal array if it runs out of room. A Vector defaults to doubling the size of its array, while the ArrayList increases its array size by 50 percent.
Correction
It is not always that these Vector will double up the capacity. It may just increase its size upto the increment mentioned in the constructor :
public Vector(int initialCapacity, int capacityIncrement)
The logic in grow method is to increase the capacity to double if increment not mentioned, otherwise use the capacityIncrement, here is the code of Vector grow method:
private void grow(int minCapacity) {
// overflow-conscious code
int oldCapacity = elementData.length;
int newCapacity = oldCapacity + ((capacityIncrement > 0) ?
capacityIncrement : oldCapacity);
if (newCapacity - minCapacity < 0)
newCapacity = minCapacity;
if (newCapacity - MAX_ARRAY_SIZE > 0)
newCapacity = hugeCapacity(minCapacity);
elementData = Arrays.copyOf(elementData, newCapacity);
}
There is no comparison between Vector and ArrayList as they fit different purposes. Vector was supposed to be a concurrency safe List implementation. However, the design of the class was severely flawed and did not provide concurrency guarantees for the most common use case of iteration.
Vector itself is easily replaced with Collections.synchronizedList(new ArrayList()). The result of course contains the same flaw as Vector. Vector should be considered deprecated.
The use of Vector is a now a mark for naivety in understanding Java and concurrent programming. Don't use it.
To answer the original question:
ArrayList by default will grow the capacity by half of the current capacity. However, at any time, the program may call ensureCapacity to set the capacity to an appropriately large value.
Vector by default will grow the capacity by doubling. However, there is a constructor that allows setting the grow amount. Using a small grow value will have a negative impact on performance. Additionally, you could actually get less capacity since each grow requires a duplicate array to exist in memory for a short period of time.
In a comment, the OP has stated:
The application pulls huge data set and we are currently facing out of memory due to maxing out the heap size
First, both Vector and ArrayList will throw an OutOfMemoryError if the program tries to grow the capacity beyond a set limit. You need to be sure that the OOME does not originate from the hugeCapacity method of the Vector class. If this is the case, then perhaps you could try a LinkedList.
Second, what is your current heap size? The default JVM heap size is rather small. The intent is to avoid pauses or choppy behavior from a full GC becoming apparent to the user of an applet. However, the heap size is also often far to small for a reasonably sophisticated application or a fairly dumb service. The -Xmx JVM arg could be used to increase the heap size.

Choosing the initial capacity of a HashSet with an expected number of unique values and insertions

Ok, here's my situation:
I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.
However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?
From googling, I have come up with:
String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.
So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.
I guess my questions really are:
What should you set the initial capacity to be when you don't know how many items are to be in the Set?
Does it really matter what it gets set to when the most it could contain is 50?
Should I even be worrying about it?
Assuming that you know there won't be more than 50 states (do you mean US States?), the
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);
quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.
I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.
So the best answer is probably to use:
new HashSet<String>();
That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.
Use the constructor where you don't need to specify these values, then reasonable defaults are chosen.
First, I'm going to say that in your case you're definitely overthinking it. However, there are probably situations where one would want to get it right. So here's what I understand:
1) The number of items you can hold in your HashSet = initial capacity x load factor. So if you want to be able to hold n items, you need to do what Zarkonnen did and divide n by the load factor.
2) Under the covers, the initial capacity is rounded up to a power of two per Oracle tutorial.
3) Load factor should be no more than .80 to prevent excessive collisions, as noted by Tom Hawtin - tackline.
If you just accept the default values (initial capacity = 16, load factor = .75), you'll end up doubling your set in size 3 times. (Initial max size = 12, first increase makes capacity 32 and max size 24 (32 * .75), second increase makes capacity 64 and max size 48 (64 * .75), third increase makes capacity 128 and max size 96 (128 * .75).)
To get your max size closer to 50, but keep the set as small as possible, consider an initial capacity of 64 (a power of two) and a load factor of .79 or more. 64 * .79 = 50.56, so you can get all 50 states in there. Specifying 32 < initial capacity < 64 will result in initial capacity being rounded up to 64, so that's the same as specifying 64 up front. Specifying initial capacity <= 32 will result in a size increase. Using a load factor < .79 will also result in a size increase unless your initial capacity > 64.
So my recommendation is to specify initial capacity = 64 and load factor = .79.
The safe bet is go for a size that is too small.
Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.
Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.
Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).
Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).
I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.
If you were to optimize this -- and it may be appropriate to do that -- some of your decision will depend on how many duplicates you expect the array to have.
If there are very many duplicates, you will want a smaller initial
capacity. Large, sparse hash tables are bad when iterating.
If there are not expected to be very many duplicates, you will want
an initial capacity such that the entire array could fit without
resizing.
My guess is that you want the latter, but this is something worth considering if you pursue this.

Categories

Resources