Is having hashtable efficient when there are large number of keys?

Is having hashtable efficient when there are large number of keys? - java

I am new to HashMaps. Can anyone tell me Is having hashtable efficient when there are a large number of keys?

For HashMap in order to be 'efficient' the map capacity should be larger then the keys number it holds. but more importantly the keys must have hashCode method with good hash distribution.
By default java HashMap have load factor 0.75, which means that when you fill it in 75% it will grow its capacity:
from: https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
This implementation provides constant-time performance for the basic
operations (get and put), assuming the hash function disperses the
elements properly among the buckets. Iteration over collection views
requires time proportional to the "capacity" of the HashMap instance
(the number of buckets) plus its size (the number of key-value
mappings). Thus, it's very important not to set the initial capacity
too high (or the load factor too low) if iteration performance is
important.
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of
buckets in the hash table, and the initial capacity is simply the
capacity at the time the hash table is created. The load factor is a
measure of how full the hash table is allowed to get before its
capacity is automatically increased. When the number of entries in the
hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice
the number of buckets.
But more important is the hashCode that you use with your key object.
It should have good (random) distribution equaly hitting all internal indexes (buckets), otherwise with poor hashCode your HashMap can behave in worst case even as list with linear search.
With this information you can estimate your performance (basing on available memory & hash key distribution)
Please also check this article:
https://www.linkedin.com/pulse/10-things-java-developer-should-know-hashmap-chinmay-parekh

Related

How is a hashMap in java populated when load factor is more than 1?

I tried to create a HashMap with the following details:-
HashMap<Integer,String> test = new HashMap<Integer,String>();
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
and I saw that every input was put in a different bucket. Which means a different hash code was calculated for each key.
Now,
if I modify my code as follows :-
HashMap<Integer,String> test = new HashMap<Integer,String>(16,2.0f);
test.put(1, "Value1");
test.put(2, "Value2");
test.put(3, "Value3");
test.put(4, "Value4");
test.put(5, "Value5");
test.put(6, "Value6");
test.put(7, "Value7");
test.put(8, "Value8");
test.put(9, "Value9");
test.put(10, "Value10");
test.put(11, "Value11");
test.put(12, "Value12");
test.put(13, "Value13");
test.put(14, "Value14");
test.put(15, "Value15");
test.put(16, "Value16");
test.put(17, "Value17");
test.put(18, "Value18");
test.put(19, "Value19");
test.put(20, "Value20");
I find that some of the values which were put in different buckets are now put in a bucket which already contains some values even though their hash value is different. Can anyone please help me understand the same ?
Thanks

So, if you initialize a HashMap without specifying an initial size and a load factor it will get initialized with a size of 16 and a load factor of 0.75. This means, once the HashMap is at least (initial size * load factor) big, so 12 elements big, it will be rehashed, which means, it will grow to about twice the size and all elements will be added anew.
You now set the load factor to 2, which means, now the Map will only get rehashed, when it is filled with at least 32 elements.
What happens now is that elements with the same hash mod bucketcount will be put into the same bucket. Each bucket with more then one element contains a list, where all the elements are put into. Now when you try to lookup one of the elements it first finds the bucket using the hash. Then it has to iterate over the whole list in that bucket to find the hash with the exact match. This is quite costly.
So in the end there is a trade-off: rehashing is pretty expensive, so you should try to avoid it. On the other hand, if you have multiple elements in a bucket, the lookup gets pretty expensive, so you should really try to avoid that as well. So you need a balance between those two. One other way to go is to set the initial size quite high, but that takes up more memory that is not used.

In your second test, the initial capacity is 16 and the load factor is 2. This means the HashMap will use an array of 16 elements to store the entries (i.e. there are 16 buckets), and this array will be resized only when the number of entries in the Map reaches 32 (16 * 2).
This means that some keys having different hashCodes must be stored in the same bucket, since the number of buckets (16) is smaller than the total number of entries (20 in your case).
The assignment of a key to a bucket is calculated in 3 steps :
First the hashCode method is called.
Then an additional function is applied on the hashCode to reduce the damage that may be caused by bad hashCode implementations.
Finally a modulus operation is applied on the result of the previous step to get a number between 0 and capacity - 1.
The 3rd step is where keys having different hashCodes may end up in the same bucket.

Lets check it with examples -
i) In first case, load factor is 0.75f and initialCapacity is 16 which means array resize will occur when number of buckets in HashMap reaches 16*0.75 = 12.
Now, every key has different HashCode so that HashCode modulo 16 is unique which causes all first 12 entries to go to different buckets after which resize occur and when new entries are put they also end up in different buckets (HashCode modulo 32 being unique.)
ii) In second case, load factor is 2.0f which means resize will happen when no. of buckets reaches 16*2 = 32.
You keep on putting entries in map and it never resizes (for the 20 entries) making multiple entries collide.
So, in nutshell in first example - HashCode modulo 16 for first 12 entries and HashCode modulo 32 for all entries is unique while in second case it is always HashCode modulo 16 for all entries which is not unique (cannot be as all 20 entries have to be accommodated in 16 buckets)

The javadoc explanation:
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of
buckets in the hash table, and the initial capacity is simply the
capacity at the time the hash table is created. The load factor is a
measure of how full the hash table is allowed to get before its
capacity is automatically increased. When the number of entries in the
hash table exceeds the product of the load factor and the current
capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice
the number of buckets.
As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs. Higher values decrease the
space overhead but increase the lookup cost (reflected in most of the
operations of the HashMap class, including get and put). The expected
number of entries in the map and its load factor should be taken into
account when setting its initial capacity, so as to minimize the
number of rehash operations. If the initial capacity is greater than
the maximum number of entries divided by the load factor, no rehash
operations will ever occur.
By default,initial capacity is 16 and load factor is 0.75.
So when number of entries goes beyond 12 (16 * 0.75),its capacity is increased to 32 and hashtable is rehashed. That is why in your first case, every different element is having its own bucket.
In your second case,only when the number of entries crossess 32(16*2), hash table will be resized. Even if the elements are having different hash code values, when hashcode%bucketsize is calculated, it may collide. That is the reason you are seeing more than one element in same bucket

Resizing bucket of hashmap

Performance of hashmap depends on Load factor(l) and Capacity(c). If the number of entries in a map are greater than or equal to (l*c) it changes the internal data structures i.e increases the capacity or size of bucket. My question is how does it calculate the number of entries in a hashmap to check the mentioned condition? Is it the total number of (key, value) pairs in map or the number of engaged locations in the bucket being used? If it's the number of engaged locations in bucket how do you keep track of those? I’m assuming chaining is being followed to avoid collisions.

The load factor is the ratio of the number of elements it holds and your HashMap capacity (i.e. how many buckets you have)
So using a simple array of 10 spaces with a load factor of .75 means that the moment your elements divided by your size is greater or equal to 75% (that will mean there are 8 elements in your Array), the data structure must regrow in order to lower the ratio.
The HashMap usually keeps track of the number of elements it holds on every add/remove operation and recalculates the load factor

Why hashCode method affect performance of comparison? [duplicate]

Suppose I need to store 1000 objects in Hashset, is it better that I have 1000 buckets containing each object( by generating unique value for hashcode for each object) or have 10 buckets roughly containing 100 objects?
1 advantage of having unique bucket is that I can save execution cycle on calling equals() method?
Why is it important to have set number of buckets and distribute the objects amoung them as evenly as possible?
What should be the ideal object to bucket ratio?

Why is it important to have set number of buckets and distribute the objects amoung them as evenly as possible?
A HashSet should be able to determine membership in O(1) time on average. From the documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
The algorithm a Hashset uses to achieve this is to retrieve the hash code for the object and use this to find the correct bucket. Then it iterates over all the items in the bucket until it finds one that is equal. If the number of items in the bucket is greater than O(1) then lookup will take longer than O(1) time.
In the worst case - if all items hash to the same bucket - it will take O(n) time to determine if an object is in the set.
What should be the ideal object to bucket ratio?
There is a space-time tradeoff here. Increasing the number of buckets decreases the chance of collisions. However it also increases memory requirements. The hash set has two parameters initialCapacity and loadFactor that allow you to adjust how many buckets the HashSet should create. The default load factor is 0.75 and this is fine for most purposes, but if you have special requirements you can choose another value.
More information about these parameters can be found in the documentation for HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled by calling the rehash method.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

Roughly one bucket per element is better for the processor, too many buckets is bad for the memory. Java will start with a small amount of buckets and automatically increase the capacity of your HashSet once it starts filling up, so you don't really need to care unless your application has issues performance and you've identified a hashset as the cause.
If you several elements in each bucket, lookups start taking longer. If you have lots of empty buckets, you're using more memory than you need and iterating over the elements takes longer.
This seems like a premature optimization waiting to happen though - the default constructor is fine in most cases.

Object.hashCode()are of type int, you can only have 2^32 different values that's why you create buckets and distribute objects among them.
Edit: If you are using 2^32 buckets to store 2^32 object then defiantly get operations will give you constant complexity but when you are inserting one by one element to store 2^32 objects then rehashing will perform than means if we are using Object[] as buckets then each time it exceeds the length of array it will create new array with greater size and copy elements into this. this process will increase complexity. That's why we make use of equals and hashcode in ratio and that is done by Hashsets itself by providing better hashing algorithm.

Is retrieval order of elements in HashMap really randomized?

Is the retrieval of entries from HashMap really randomized? Or does it depend on buckets in which the entries are hashed and then these buckets are accessed in some predefined order ? or any other kind of order internally happening?

Answer b - it depends on buckets in which the entries are hashed and then these buckets are accessed in some predefined order

What happens when you call the get method on a HashMap for a non-null key:
The hashCode() method for your key is called.
The hash bucket for that hash is obtained by simply and'ing the hash with the number of buckets - 1 which works because the implementation ensures that the number of buckets is always a power of 2. Because the buckets are simply a java array of HashMap.Entry objects the resulting integer can be used as a plain array index. This is where the O(1) complexity comes from.
The entries in the bucket (a linked list) are iterated until one is found where the hash and the key match the one you want. This is where the O(1) complexity starts to deteriorate towards a worst case of O(n).
So you can see that an efficient use of the HashMap will try to minimise the number of entries that end up together in a bucket. All that matters in that respect is the distribution of the bits in your hashcode from bit 0 to bit N-1 where N is the power of 2 used to calculate the number of buckets. All bits in your hashcode above that limit are masked away and only get used again during the iteration of the bucket list.

hashmap hashcode to obtain worst case constant time?

I know that hashmap operations are O(1) amortized due to possible collisions. But in java, integer.hashCode is just its value. Then if you were to put m distinct integers into a hashmap where m = hashmap's INITIAL_CAPACITY (16 lets say) does that mean that there will be 16 different buckets with 1 integer each? Then would this guarantee O(1) lookup, deletion, insertion for worst time?

No because HashMap will rehash the hash for its own internal purposes.

No, because HashMap is going to have to take the very large number of possible values returned by hashCode and map them to the very small number of buckets, and there's no guarantee that that mapping will map different integers' hashcodes to different buckets.

You should look at the way Hashmap decides which bucket the key/object will belong to and yu will see that it does NOT use object's hashCode() as bucket number but manipulates it (bit shifting) a little to limit a number of buckets to less than Integer.MAX_VALUE

if you were to put m distinct integers into a hashmap where m =
hashmap's INITIAL_CAPACITY (16 lets say) does that mean that there
will be 16 different buckets with 1 integer each?
Depends on the values, probably the HashMap will create new buckets (increase the capacity) to keep the load factor under a minimum (Java HashMap increases size if load is over 75% by default).
What is load factor
would this guarantee O(1) lookup, deletion, insertion for worst time?
No, in particulary bad cases the lookup time can be up to O(n) (depending on the number of elements and their values). In the case of integers, all posible int values are mapped to the hashmap size, so it´s likely to be a lot of collisions for a small sized map.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.