HashMap implementation in Java. How does the bucket index calculation work? - java

I am looking at the implementation of HashMap in Java and am stuck at one point.
How is the indexFor function calculated?
static int indexFor(int h, int length) {
return h & (length-1);
}
Thanks

The hash itself is calculated by the hashCode() method of the object you're trying to store.
What you see here is calculating the "bucket" to store the object based on the hash h. Ideally, to evade collisions, you would have the same number of buckets as is the maximum achievable value of h - but that could be too memory demanding. Therefore, you usually have a lower number of buckets with a danger of collisions.
If h is, say, 1000, but you only have 512 buckets in your underlying array, you need to know where to put the object. Usually, a mod operation on h would be enough, but that's too slow. Given the internal property of HashMap that the underlying array always has number of buckets equal to 2^n, the Sun's engineers could use the idea of h & (length-1), it does a bitwise AND with a number consisting of all 1's, practically reading only the n lowest bits of the hash (which is the same as doing h mod 2^n, only much faster).
Example:
hash h: 11 1110 1000 -- (1000 in decimal)
length l: 10 0000 0000 -- ( 512 in decimal)
(l-1): 01 1111 1111 -- ( 511 in decimal - it will always be all ONEs)
h AND (l-1): 01 1110 1000 -- ( 488 in decimal which is a result of 1000 mod 512)

It's not calculating the hash, it's calculating the bucket.
The expression h & (length-1) does a bit-wise AND on h using length-1, which is like a bit-mask, to return only the low-order bits of h, thereby making for a super-fast variant of h % length.

The above answer is very good but I want to explain more why Java can use indexFor for create index
Example, I have a HashMap like this (this test is on Java7, I see Java8 change HashMap a lot but I think this logic still very good)
// Default length of "budget" (table.length) after create is 16 (HashMap#DEFAULT_INITIAL_CAPACITY)
HashMap<String, Integer> hashMap = new HashMap<>();
hashMap.put("A",1); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("B",2); // hash("B")=70, indexFor(hash,table.length)=70&(16-1) = 6
hashMap.put("P",3); // hash("P")=85, indexFor(hash,table.length)=85&(16-1) = 5
hashMap.put("A",4); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("r", 4);// hash("r")=117, indexFor(hash,table.length)=117&(16-1) = 5
You can see the index of entry with key "A" and object with key "P" and object with key "r" have same index (= 5). And here is the debug result after I execute code above
Table in the image is here
public class HashMap<K, V> extends AbstractMap<K, V> implements Map<K, V>, Cloneable, Serializable {
transient HashMap.Entry<K, V>[] table;
...
}
=> I see
If index are different, new entry will add to table
If index is same and hash is same, new value will update
If index is same and hash is different, new entry will point to old entry (like a LinkedList). Then you know why Map.Entry have field next
static class Entry<K, V> implements java.util.Map.Entry<K, V> {
...
HashMap.Entry<K, V> next;
}
You can verify it again by read the code in HashMap.
As now, you can think that HashMap will never need to change the size (16) because indexFor() always return value <= 15 but it not correct.
If you look at HashMap code
if (this.size >= this.threshold ...) {
this.resize(2 * this.table.length);
HashMap will resize table (double table length) when size >= threadhold
What is threadhold? threadhold is calculated below
static final int DEFAULT_INITIAL_CAPACITY = 16;
static final float DEFAULT_LOAD_FACTOR = 0.75F;
...
this.threshold = (int)Math.min((float)capacity * this.loadFactor, 1.07374182E9F); // if capacity(table.length) = 16 => threadhold = 12
What is the size? size is calculated below.
Of course, size here is not table.length .
Any time you put new entry to HashMap and HashMap need to create new entry (note that HashMap don't create new entry when the key is same, it just override new value for existed entry) then size++
void createEntry(int hash, K key, V value, int bucketIndex) {
...
++this.size;
}
Hope it help

It is calculating the bucket of the hash map where the entry (key-value pair) will be stored. The bucket id is hashvalue/buckets length.
A hash map consists of buckets; objects will be placed in these buckets based on the bucket id.
Any number of objects can actually fall into the same bucket based on their hash code / buckets length value. This is called a 'collision'.
If many objects fall into the same bucket, while searching their equals() method will be called to disambiguate.
The number of collisions is indirectly proportional to the bucket's length.

bucket_index = (i.hashCode() && 0x7FFFFFFFF) % hashmap_size does the trick

Related

Word frequency hash table

Ok, I have a project that requires me to have a dynamic hash table that counts the frequency of words in a file. I must use java, however, we are not allowed to use any built in data types or built in classes at all except standard arrays. Also, I am not allowed to use any hash functions off the internet that are known to be fast. I have to make my own hash functions. Lastly, my instructor also wants my table to start as size "1" and double in size every time a new key is added.
My first idea was to sum the ASCII values of the letters composing a word and use that to make a hash function, but different words with the same letters will equal the same value.
How can I get started? Is the ASCII idea on the right track?
A hash table isn't expected to have in general a one-to-one mapping between a value and a hash. A hash table is expected to have collisions. That is, the domain of the hash-function is expected to be larger than the range (i.e., the hash value). However, the general idea is that you come up with a hash function where the probability of collision is drastically small. If your hash-function is uniform, i.e., if you have it designed such that each possible hash-value has the same probability of being generated, then you can minimize collisions this way.
Getting a collision isn't the end of the world. That just means that you have to search the list of values for that hash. If your hashing function is good, overall your performance for lookup should still be O(1).
Generating hashing functions is a subject of its own, and there is no one answer. But a good place for you to start could be to work with the bitwise representations of the characters in the string, and perform some sort of convolution operations on them (rotate, shift, XOR) in series. You could perform these in some way based on some initial seed-value, and then use the output of the first step of hashing as a seed for the next step. This way you can end up magnifying the effects of your convolution.
For example, let's say you get the character A, which is 41 in hex, or 0100 0001 in binary. You could designate each bit to mean some operation (maybe bit 0 is a ROR when it is 0, and a ROL when it is 1; bit 1 is an OR when it is 0, and a XOR when it is 1, etc.). You could even decide how much convolution you want to do based on the value itself. For example, you could say that the lower nibble specifies how much right-rotation you will do, and the upper nibble specifies how much left rotation you will do. Then once you have the final value, you will use that as the seed for the next character. These are just some ideas. Use your imagination as see what you get!
It does not matter how good your hash function is, you will always have collisions you need to resolve.
If you want to keep your approach by using the ASCII values of the you shouldn't just add the values this would lead to a lot collisions. You should work with the power of the values, for example for the word "Help" you just go like: 'H' * 256 + 'e' * 256 + 'l' * 256² + 'p' * 256³. Or in pseudocode:
int hash(String word, int hashSize)
int res = 0
int count = 0;
for char c in word
res += 'c' * 256^count
count++
count = count mod 5
return res mod hashSize
Now you just have to write your own Hashtable:
class WordCounterMap
Entry[] entrys = new Entry[1]
void add(String s)
int hash = hash(s, entrys.length)
if(entrys[hash] == null{
Entry[] temp = new Entry[entry.length * 2]
for(Entry e : entrys){
if(e != null)
int hash = hash(e.word, temp.length)
temp[hash] = e;
entrys = temp;
hash = hash(s, entrys.length)
while(true)
if(entrys[hash] != null)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
else
entrys[hash] = new Entry(s)
hash++
hash = hash mod entrys.length
int getCount(String s)
int hash = hash(s, length)
if(entrys[hash] == null)
return 0
while(true)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
hash++
hash = hash mod entrys.length
class Entry
int count
String word
Entry(String s)
this.word = s
count = 1

Why the class HashSet<T> has values already sorted when I use the iterator?

I have the following code on my main method, and when I iterate through the Set and print the values, the values are already sorted. What's the reason?
Set<Integer> set = new HashSet<Integer>();
set.add(2);
set.add(7);
set.add(3);
set.add(9);
set.add(6);
for(int i : set) {
System.out.println(i);
}
Output:
2
3
6
7
9
That's just coincidence. A HashSet does not preserve or guarantee any ordering.
It makes no guarantees as to the iteration order of the set; in
particular, it does not guarantee that the order will remain constant
over time.
I'm not sure calling it a coincidence is the right answer. There is no chance involved. It results of the hash functions being used, the small values you put in the HashSet and the small amount of elements you put in the Set.
For Integer, hashCode() is the int value of the Integer.
HashMap (and HashSet) do an additional hashing on the value returned by hashCode, but this additional hashing doesn't change the value for such small numbers as you added to the HashSet.
Finally, the bucket that each integer is put into is the modified hash code modulu the capacity of the HashSet. The initial capacity of a HashSet/HashMap is 16.
Therefore 2 is added to bucket 2, 7 is added to bucket 7, etc...
When you iterate over the elements of the HashSet, the buckets are visited in order, and since each bucket has at most a single element, you get the numbers sorted.
Here is how the bucket is computed :
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
static int hash(int h) { // for the small integers you put in the set, all the values being
// xored with h are 0, so hash(h) returns h
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
static int indexFor(int h, int length) {
return h & (length-1); // table.length is the initial capacity - 16,
// so for h <= 15, indexFor(h,table.length)=h
}
Therefore, the buckets of 2,7,3,9,6 are 2,7,3,9,6 respectively.
The enhanced for loop you use to iterate over the HashSet visits the buckets in order, and for each bucket iterates over its entries (which are stored in a linked list). Therefore, for your input, 2 is visited first, followed by 3, 6, 7 and 9.
If you add numbers higher than 15, both the hash method and the indexFor method (assuming you didn't change the default capacity of the HashSet) would prevent the numbers from being sorted when iterated by the HashSet iterator.
This is just an accident. I tried :
final Set<Integer> set = new HashSet<Integer>();
set.add(2);
set.add(17);
set.add(32);
set.add(92);
set.add(63);
and I got 17 32 2 92 63. It was not in the sorted order as HashSet does not preserve the sorted order or the order in which they are added.

Why the numbers like 4,20,12,7 used in hash function in `HashMap Class`

I was reading about the fact that How exactly the HashMap works in java .I found code in the hash method in the HashMap class the hashcode is one of the operand with Shift right zero fill operator .The other operands are like 12 7 4 20. Later some more processing is done on the result .My question is why only these four number are chossen for calculating the value in hash function which actually used for calculating the position in the bucket
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
It’s not that “only these four number are chosen for calculating the value in hash function”, the hash code returned by the hashCode method of the key object is the (very important) input. This method in the HashMap implementation just tries to improve this, given the knowledge about how the HashMap will use that value afterwards.
Typical implementations will only use the lower bits of a hash code as the internal table has a size which is a power of two. Therefore the improvement shall ensure that the likelihood of having different values in the lower bits is the same even if the original hash codes for different keys differ in upper bits only.
Take for example Integer instances used as keys: their hash code is identical to their value as this will spread the hash codes over the entire 2³² int range. But if you put the values 0xa0000000, 0xb0000000, 0xc0000000, 0xd0000000 into the map, a map using only the lower bits would have poor results. This improvement fixes this.
The numbers chosen for this bit manipulation, and the algorithm in general are a field of continuous investigations. You will see changes between JVM implementations as the development never stops.

How does HashMap make sure the index calculated using hashcode of key is within the available range?

I went through source code of HashMap and have a few questions. The PUT method takes the Key and Value and does
the hashing function of the hashcode of the key.
calculate bucket location for this pair using the hash obtained from the previous step
public V put(K key, V value) {
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
.....
}
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
static int indexFor(int h, int length) {
return h & (length-1);
}
Example:
Creating a HashMap with size 10.
call put(k,v) three times and assume these 3 occupies bucket loc 7 ,8 and 9
call put 4th K,V pair and following happens
hash() is called with key.hashcode() and hash calculated
indexFor is calculated based on hash
Question:
What if the calculated bucket location for the 4th k,v is out of the existing bounds? say location 11 ?
Thanks in advance
Akh
For your first question: the map always uses a power of two for the size (if you give it a capacity of 10, it will actually use 16), which means index & (length - 1) will always be in the range [0, length) so it's always in range.
It's not clear what your second and third question relate to. I don't think HashMap reallocates everything unless it needs to.
HashMaps will generally use the hash code mod the number of buckets. What happens when there is a collision depends on the implementation (not sure for Java's HashMap). There are two basic strategies: keeping a list of items that fall in the bucket, or skipping forward to other buckets if your bucket is full. My guess would be that HashMap uses the list bucket approach.
Let's go into more detail, How hashmap will initialize bucket size?
following code is from HashMap.java
while (i < paramInt)
i <<= 1;
If you pass initial 10 then above code is used to make a size of power 2.
So using above code HashMap initialize bucket size 16;
And below code is used to calculate bucket index,
static int indexFor(int h, int length) {
return h & (length - 1);
}

Regarding collision in Map

I was going through HashMap and read the following analysis ..
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor.
The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The default initial capacity is 16, the default load factor is 0.75. You can supply other values in the map's constructor.
Now suppose I have a map..
HashMap map=new HashMap();//HashMap key random order.
System.out.println("Amit".hashCode());
map.put("Amit","Java");
map.put("mAit","J2EE");
map.put("Saral","J2rrrEE");
I want collision to occur please advise how the collision would occur..!!
I believe the exact hashmap behavior is implementation dependent. Just look at however your class library is doing the hashing and construct a collision. It's pretty simple.
If you want collisions on arbitrary objects instead of strings, it's a lot easier. Just create a class with a custom hashCode() that always returns 0.
If you want really collision to be occured then it's better to write your own custom hash code. Say for example, if you want collision for Amit and mAit, you can do one thing, just use addition of ascii values of the chars as the hash code. You will get collision for different keys.
Collision will happend when 2 keys has the same hash key .
I didn't calc your keys hash keys , but i don't think they have the same hash key, so collision will not occurred if they don't have the same hash key.
If you will put the Same string as key than you will haves collision
Collision here is definitely possible and not tied to hash table implementation.
HashMap works internally by using Object.hashCode to map objects to buckets, and then uses a collision resolution mechanism (the OpenJDK implementation uses separate-chaining) with Object.equals.
To answer your question, String.hashCode is well-defined for compatibility...
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the i-th character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
Or, in code (from OpenJDK)
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
As with any hash function, collisions are possible. According to the Wikipedia article, it states that, for example, "FB" and "Ea" result in the same value.
If you want more, it should be a trivial bruteforce problem to find collisions which have the same hash value here.
As a side note, I'd thought I'd point out how this is very similar to the function as in the second edition of the The C Programming Language:
#define HASHSIZE 100
unsigned hash(char *s)
{
unsigned hashval;
for(hashval = 0; *s != '\0'; s++)
hashval = *s + 31 * hashval;
return hashval % HASHSIZE;
}

Categories

Resources