Regarding collision in Map

Regarding collision in Map - java

I was going through HashMap and read the following analysis ..
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor.
The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The default initial capacity is 16, the default load factor is 0.75. You can supply other values in the map's constructor.
Now suppose I have a map..
HashMap map=new HashMap();//HashMap key random order.
System.out.println("Amit".hashCode());
map.put("Amit","Java");
map.put("mAit","J2EE");
map.put("Saral","J2rrrEE");
I want collision to occur please advise how the collision would occur..!!

I believe the exact hashmap behavior is implementation dependent. Just look at however your class library is doing the hashing and construct a collision. It's pretty simple.
If you want collisions on arbitrary objects instead of strings, it's a lot easier. Just create a class with a custom hashCode() that always returns 0.

If you want really collision to be occured then it's better to write your own custom hash code. Say for example, if you want collision for Amit and mAit, you can do one thing, just use addition of ascii values of the chars as the hash code. You will get collision for different keys.

Collision will happend when 2 keys has the same hash key .
I didn't calc your keys hash keys , but i don't think they have the same hash key, so collision will not occurred if they don't have the same hash key.
If you will put the Same string as key than you will haves collision

Collision here is definitely possible and not tied to hash table implementation.
HashMap works internally by using Object.hashCode to map objects to buckets, and then uses a collision resolution mechanism (the OpenJDK implementation uses separate-chaining) with Object.equals.
To answer your question, String.hashCode is well-defined for compatibility...
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the i-th character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
Or, in code (from OpenJDK)
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
As with any hash function, collisions are possible. According to the Wikipedia article, it states that, for example, "FB" and "Ea" result in the same value.
If you want more, it should be a trivial bruteforce problem to find collisions which have the same hash value here.
As a side note, I'd thought I'd point out how this is very similar to the function as in the second edition of the The C Programming Language:
#define HASHSIZE 100
unsigned hash(char *s)
{
unsigned hashval;
for(hashval = 0; *s != '\0'; s++)
hashval = *s + 31 * hashval;
return hashval % HASHSIZE;
}

Related

Too many hashing function collisions

I'm trying to make a hashing function using the polynomial accumulation method (which is supposed to give you 5 collisions per 55k words or something) but when I run it with 1,000 words, I get ~190 collisions. Am I doing something wrong?
public int hashCode(String str) {
double hash_value = 0; // used for float
for (int i = 0; i < str.length(); i++){
hash_value = 33*hash_value + str.charAt(i);
}
return (int) (hash_value % array_size);
}

Generally, prime numbers are favoured for hash code generation. I suggest trying 109 or 251. 33 is a multiple of 3 which means you are more likely to have issues based on your inputs.
Also you should use an int for the calculations and call Math.abs on the result.

Either your data set is extremely "unlucky", or (which is more probable) the array_size is too small (hash function params are usually quoted without consideration of finite bucket array size).

You are generating a large number which is different for different word in the input. But there is still a chance of collisions, as for example
"bA" = 98+(33x65)=2243
"AB" = 65+(33x66)=2243
If you go for a large number greater then 57, there will be less chance of collision. 109 or 251 will be a good choice.

Word frequency hash table

Ok, I have a project that requires me to have a dynamic hash table that counts the frequency of words in a file. I must use java, however, we are not allowed to use any built in data types or built in classes at all except standard arrays. Also, I am not allowed to use any hash functions off the internet that are known to be fast. I have to make my own hash functions. Lastly, my instructor also wants my table to start as size "1" and double in size every time a new key is added.
My first idea was to sum the ASCII values of the letters composing a word and use that to make a hash function, but different words with the same letters will equal the same value.
How can I get started? Is the ASCII idea on the right track?

A hash table isn't expected to have in general a one-to-one mapping between a value and a hash. A hash table is expected to have collisions. That is, the domain of the hash-function is expected to be larger than the range (i.e., the hash value). However, the general idea is that you come up with a hash function where the probability of collision is drastically small. If your hash-function is uniform, i.e., if you have it designed such that each possible hash-value has the same probability of being generated, then you can minimize collisions this way.
Getting a collision isn't the end of the world. That just means that you have to search the list of values for that hash. If your hashing function is good, overall your performance for lookup should still be O(1).
Generating hashing functions is a subject of its own, and there is no one answer. But a good place for you to start could be to work with the bitwise representations of the characters in the string, and perform some sort of convolution operations on them (rotate, shift, XOR) in series. You could perform these in some way based on some initial seed-value, and then use the output of the first step of hashing as a seed for the next step. This way you can end up magnifying the effects of your convolution.
For example, let's say you get the character A, which is 41 in hex, or 0100 0001 in binary. You could designate each bit to mean some operation (maybe bit 0 is a ROR when it is 0, and a ROL when it is 1; bit 1 is an OR when it is 0, and a XOR when it is 1, etc.). You could even decide how much convolution you want to do based on the value itself. For example, you could say that the lower nibble specifies how much right-rotation you will do, and the upper nibble specifies how much left rotation you will do. Then once you have the final value, you will use that as the seed for the next character. These are just some ideas. Use your imagination as see what you get!

It does not matter how good your hash function is, you will always have collisions you need to resolve.
If you want to keep your approach by using the ASCII values of the you shouldn't just add the values this would lead to a lot collisions. You should work with the power of the values, for example for the word "Help" you just go like: 'H' * 256 + 'e' * 256 + 'l' * 256² + 'p' * 256³. Or in pseudocode:
int hash(String word, int hashSize)
int res = 0
int count = 0;
for char c in word
res += 'c' * 256^count
count++
count = count mod 5
return res mod hashSize
Now you just have to write your own Hashtable:
class WordCounterMap
Entry[] entrys = new Entry[1]
void add(String s)
int hash = hash(s, entrys.length)
if(entrys[hash] == null{
Entry[] temp = new Entry[entry.length * 2]
for(Entry e : entrys){
if(e != null)
int hash = hash(e.word, temp.length)
temp[hash] = e;
entrys = temp;
hash = hash(s, entrys.length)
while(true)
if(entrys[hash] != null)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
else
entrys[hash] = new Entry(s)
hash++
hash = hash mod entrys.length
int getCount(String s)
int hash = hash(s, length)
if(entrys[hash] == null)
return 0
while(true)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
hash++
hash = hash mod entrys.length
class Entry
int count
String word
Entry(String s)
this.word = s
count = 1

How to decide hashcode value?

How to decide hashcode value? Recently I faced an interview question that "Is 17 a valid hashcode?". Is there any mechanism to define hashcode value? or we can give any number for hashcode value?

Hashcodes should have good dispersion, so that different objects will be saved in different positions of the hash table (otherwise performance could be degraded).
From that point, while 17 us a "valid" hash code (in the sense that it is a 32-bit signed integer), it is suspicious how the hash function was defined.
For instance, a naive approach for hashing a string is just adding the value of each character. This result in similar hash values for simple strings (such as "tar" and "rat" that sum up to the same value).
A common trick is multiplying each value by a small prime, so that simple inputs will return different values, e.g.;
int result = 1;
result = 31 * result + a;
result = 31 * result + b;
or
int h=0;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
(the latter, from the JRE implementation of String.hashCode)

Yes, 17 is a perfectly valid hashcode.
Whatever method you select to derive the hashcode, it should always return the same integer for the object (as long as its state remains the same).

Ya 17 is valid. Usually prime numbers like shown in the link are used, you can implement hashcode using the id of that entity which is a primary key
public int hashCode()
{
int result = 17;
result = 37 * result + (getId() == null ? 0 : this.getId().hashCode());
return result;
}
this provides different ways for implementing the hascode

HashMap implementation in Java. How does the bucket index calculation work?

I am looking at the implementation of HashMap in Java and am stuck at one point.
How is the indexFor function calculated?
static int indexFor(int h, int length) {
return h & (length-1);
}
Thanks

The hash itself is calculated by the hashCode() method of the object you're trying to store.
What you see here is calculating the "bucket" to store the object based on the hash h. Ideally, to evade collisions, you would have the same number of buckets as is the maximum achievable value of h - but that could be too memory demanding. Therefore, you usually have a lower number of buckets with a danger of collisions.
If h is, say, 1000, but you only have 512 buckets in your underlying array, you need to know where to put the object. Usually, a mod operation on h would be enough, but that's too slow. Given the internal property of HashMap that the underlying array always has number of buckets equal to 2^n, the Sun's engineers could use the idea of h & (length-1), it does a bitwise AND with a number consisting of all 1's, practically reading only the n lowest bits of the hash (which is the same as doing h mod 2^n, only much faster).
Example:
hash h: 11 1110 1000 -- (1000 in decimal)
length l: 10 0000 0000 -- ( 512 in decimal)
(l-1): 01 1111 1111 -- ( 511 in decimal - it will always be all ONEs)
h AND (l-1): 01 1110 1000 -- ( 488 in decimal which is a result of 1000 mod 512)

It's not calculating the hash, it's calculating the bucket.
The expression h & (length-1) does a bit-wise AND on h using length-1, which is like a bit-mask, to return only the low-order bits of h, thereby making for a super-fast variant of h % length.

The above answer is very good but I want to explain more why Java can use indexFor for create index
Example, I have a HashMap like this (this test is on Java7, I see Java8 change HashMap a lot but I think this logic still very good)
// Default length of "budget" (table.length) after create is 16 (HashMap#DEFAULT_INITIAL_CAPACITY)
HashMap<String, Integer> hashMap = new HashMap<>();
hashMap.put("A",1); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("B",2); // hash("B")=70, indexFor(hash,table.length)=70&(16-1) = 6
hashMap.put("P",3); // hash("P")=85, indexFor(hash,table.length)=85&(16-1) = 5
hashMap.put("A",4); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("r", 4);// hash("r")=117, indexFor(hash,table.length)=117&(16-1) = 5
You can see the index of entry with key "A" and object with key "P" and object with key "r" have same index (= 5). And here is the debug result after I execute code above
Table in the image is here
public class HashMap<K, V> extends AbstractMap<K, V> implements Map<K, V>, Cloneable, Serializable {
transient HashMap.Entry<K, V>[] table;
...
}
=> I see
If index are different, new entry will add to table
If index is same and hash is same, new value will update
If index is same and hash is different, new entry will point to old entry (like a LinkedList). Then you know why Map.Entry have field next
static class Entry<K, V> implements java.util.Map.Entry<K, V> {
...
HashMap.Entry<K, V> next;
}
You can verify it again by read the code in HashMap.
As now, you can think that HashMap will never need to change the size (16) because indexFor() always return value <= 15 but it not correct.
If you look at HashMap code
if (this.size >= this.threshold ...) {
this.resize(2 * this.table.length);
HashMap will resize table (double table length) when size >= threadhold
What is threadhold? threadhold is calculated below
static final int DEFAULT_INITIAL_CAPACITY = 16;
static final float DEFAULT_LOAD_FACTOR = 0.75F;
...
this.threshold = (int)Math.min((float)capacity * this.loadFactor, 1.07374182E9F); // if capacity(table.length) = 16 => threadhold = 12
What is the size? size is calculated below.
Of course, size here is not table.length .
Any time you put new entry to HashMap and HashMap need to create new entry (note that HashMap don't create new entry when the key is same, it just override new value for existed entry) then size++
void createEntry(int hash, K key, V value, int bucketIndex) {
...
++this.size;
}
Hope it help

It is calculating the bucket of the hash map where the entry (key-value pair) will be stored. The bucket id is hashvalue/buckets length.
A hash map consists of buckets; objects will be placed in these buckets based on the bucket id.
Any number of objects can actually fall into the same bucket based on their hash code / buckets length value. This is called a 'collision'.
If many objects fall into the same bucket, while searching their equals() method will be called to disambiguate.
The number of collisions is indirectly proportional to the bucket's length.

bucket_index = (i.hashCode() && 0x7FFFFFFFF) % hashmap_size does the trick

Simple hash function techniques

I'm pretty new to hashing in Java and I've been getting stuck on a few parts. I have a list of 400 items (and stored in a list of 1.5x = 600), which the item id's range from 1-10k. I've been looking at a few hash functions and I initially copied the examples in the packet, which just used folding. I noticed that I've been getting about 50-60% null nodes, which is apparently too many. I also noticed that just modding the id by 600 tends to reduce it to a solid 50% nulls.
My current hash function looks something like, and for being as ugly as it is, it's only a 1% decrease in nulls from a simple modding, with an avg list length of 1.32...
public int getHash( int id )
{
int hash = id;
hash <<= id % 3;
hash += id << hash % 5;
/* let's go digit by digit! */
int digit;
for( digit = id % 10;
id != 0;
digit = id % 10, id /= 10 )
{
if ( digit == 0 ) /* prevent division by zero */
continue;
hash += digit * 2;
}
hash >>= 5;
return (hash % 600);
}
What are some good techniques for creating simple hash functions?

I would keep it simple. Return the id of your element as your hashcode, and let the hashtable worry about rehashing it if it feels it needs to. Your goal should be to make a hash code unique to your object.
The Java HashMap uses the following rehashing method:
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}

There's a nice review article here. Also, the Wikipedia article on hash functions is a good overview. It suggests using a chi-squared test to assess the quality of your hash function.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regarding collision in Map - java

If you want really collision to be occured then it's better to write your own custom hash code. Say for example, if you want collision for Amit and mAit, you can do one thing, just use addition of ascii values of the chars as the hash code. You will get collision for different keys.

Collision will happend when 2 keys has the same hash key . I didn't calc your keys hash keys , but i don't think they have the same hash key, so collision will not occurred if they don't have the same hash key. If you will put the Same string as key than you will haves collision

Related

Too many hashing function collisions

Word frequency hash table

How to decide hashcode value?

HashMap implementation in Java. How does the bucket index calculation work?

Simple hash function techniques

Categories

Resources