A faster hash function - java

I'm trying to implement my own hash function, i add up the ASCII numbers of each string, using java. I find the hash code by finding the mod of the size of the hash table and the sum. size%sum. I was wondering if there was a way to use the same process but reduce collisions, when searching for the string?
Thanks in advance.

I would look at the code for String and HashMap as these have a low collision rate and don't use % and handle negative numbers.
From the source for String
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
From the source for HashMap
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
final int hash(Object k) {
int h = 0;
if (useAltHashing) {
if (k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h = hashSeed;
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
As the HashMap is always a power of 2 in size you can use
hash = (null != key) ? hash(key) : 0;
bucketIndex = indexFor(hash, table.length);
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Using & is much faster than % and only return positive numbers as length is positive.

The Java String.hashcode() makes a tradeoff between being a really good hash function and being as efficient as possible. Simply adding up the character values in a string is not a reliable hash function.
For example, consider the two strings dog and god. Since they both contain a 'd', 'g', and an 'o', no method involving only addition will ever result in a different hash code.
Joshua Bloch, who implemented a good part of Java, discusses the String.hashCode() method in his book Effective Java and talks about how, in versions of Java prior to 1.3, the String.hashCode() function used to consider only 16 characters in a given String. This ran somewhat faster than the current implementation, but resulted is shockingly poor performance in certain situations.
In general, if your specific data set is very well-defined and you could exploit some uniqueness in it, you could probably make a better hash function. For general purpose Strings, good luck.

Related

i dont understand what is 0x7fffffff mean. is there any other way to code getHashValue method?

public int getHashValue(K key){
return (key.hashCode() & 0x7fffffff) % size;
}
i dont understand what is 0x7fffffff mean. is there any other way to code getHasValue method?
The constant 0x7FFFFFFF is a 32-bit integer in hexadecimal with all but the highest bit set.
Despite the name, this method isn't getting the hashCode, rather looking for which bucket the key should appear in for a hash set or map.
When you use % on negative value, you get a negative value. There are no negative buckets so to avoid this you can remove the sign bit (the highest bit) and one way of doing this is to use a mask e.g. x & 0x7FFFFFFF which keeps all the bits except the top one. Another way to do this is to shift the output x >>> 1 however this is slower.
A slightly better approach is to use "take the modulus and apply Math.abs". This uses all the bits of the hashCode which might be better.
e.g.
public int getBucket(K key) {
return Math.abs(key.hashCode() % size);
}
Even this is not ideal as some hashCode() have a poor distribution resulting in a higher collision rate. You might want to agitate the hashcode before the modulus etc.
public int getBucket(K key) {
return Math.abs(hash(key) % size);
}
HashMap in java 8 uses this
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
The function is simple as it handles collisions efficiently. In Java 7 it used this function.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
That's the Hexadecimal representation of the Max. Integer
You can check here
0x7fffffff just remove the signal getting the complement of the number.
Using Java REPL you can see the results of the operation
> -7 & 0x7fffffff
java.lang.Integer res1 = 2147483641
> 2147483641 & 0x7fffffff
java.lang.Integer res2 = 2147483641
> 2147483641 + 6
java.lang.Integer res3 = 2147483647
> 7 + 2147483641
java.lang.Integer res4 = -2147483648
In the binary representation, the first bit is the signal. If you set it to zero you will get positive complementary if the number is negative. Or the same number if positive.

Very fast universal hash function for 128 bit keys

I need a very fast universal hash function for a 128-bit key. The returned value needs to be about 32 bit (well, 16 bit would be sufficient; in most cases I only need 1-4 bits actually).
Universal hash means, there are two parameters: key (128 bit) and index (64 bit). For two keys, the universal hash function needs to return different result eventually, if called with different indexes. So with a different index, the universal hash should behave like a different hash function. For x = universalHash(k, i) and y = universalHash(k, i + 1), it would be best if on average 50% of all bits are different between x and y (randomly). The same for the case if the method is called with different keys. In practise, 5% off is OK for me.
It needs to be very fast (one or two multiplications at most). It is called millions of times. Please don't say: no, you won't need it to be fast. It also needs to return different values eventually.
What I have so far (Java code, but C is (due to the lack of a 128 bit data type, the key is the composite of a and b, which are 64 bit each):
int universalHash(long a, long b, long index) {
long x = a ^ Long.rotateLeft(b, (int) index) ^ index;
int y = (int) ((x >>> 32) ^ x);
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = (y >>> 16) ^ y;
return y;
}
int universalHash2(long a, long b, long index) {
long x = Long.rotateLeft(a, (int) index) ^
Long.rotateRight(b, (int) index) ^ index;
x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
return (int) ((x >>> 32) ^ x);
}
(The second method is actually broken for some values.)
I would like to have a hash function that is faster than those above, and is guaranteed to work in all cases (if possible provably correct, even thought that's not a strict requirement; it doesn't need to be cryptographically secure however).
I will call the universalHash method with incrementing index (first index 0, then index 1, and so on) for the same keys. It would be best if the next result could be calculated faster (e.g. without multiplication) from the previous result. But I also need to have a fast "direct access" if the index is some value (as in the example code).
Background
The problem I'm trying to solve is finding a MPHF (minimal perfect hash function) for a relatively small set of keys (up to 16 keys by directly mapping, and up to about 1024 keys by splitting into smaller subsets). For details on the algorithm, see my MinPerf project, specially the RecSplit algorithm. To support set of size 10^12 (like BBHash), I'm trying to internally use 128 bit signatures, which would simplify the algorithm.
You need a hash function that outputs 32 bits for 128 bits of inputs.
A simple way would be to just return "some" 32 bits out of the original 128 bits. There are many ways of choosing 32 bits and every choice will have collisions. But the index can decide which 32 bits to choose.
128/32 = 4, so 4 indices are enough to find at least one different bit.
For key 0 you choose the lower most 32 bits
For key 1 you choose the next 32 bits
and so on ..
The C implementation would be
uint32_t universal_hash(uint64_t key_higher, uint64_t key_lower, int index) {
// For a lack of portable 128 bit datatype we take the key in parts.
return 0xFFFFFFFF & ( index >=2 ? key_higher >> ((index - 2)*32) : key_lower >> (index*32));
}

Why the numbers like 4,20,12,7 used in hash function in `HashMap Class`

I was reading about the fact that How exactly the HashMap works in java .I found code in the hash method in the HashMap class the hashcode is one of the operand with Shift right zero fill operator .The other operands are like 12 7 4 20. Later some more processing is done on the result .My question is why only these four number are chossen for calculating the value in hash function which actually used for calculating the position in the bucket
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
It’s not that “only these four number are chosen for calculating the value in hash function”, the hash code returned by the hashCode method of the key object is the (very important) input. This method in the HashMap implementation just tries to improve this, given the knowledge about how the HashMap will use that value afterwards.
Typical implementations will only use the lower bits of a hash code as the internal table has a size which is a power of two. Therefore the improvement shall ensure that the likelihood of having different values in the lower bits is the same even if the original hash codes for different keys differ in upper bits only.
Take for example Integer instances used as keys: their hash code is identical to their value as this will spread the hash codes over the entire 2³² int range. But if you put the values 0xa0000000, 0xb0000000, 0xc0000000, 0xd0000000 into the map, a map using only the lower bits would have poor results. This improvement fixes this.
The numbers chosen for this bit manipulation, and the algorithm in general are a field of continuous investigations. You will see changes between JVM implementations as the development never stops.

How does HashMap make sure the index calculated using hashcode of key is within the available range?

I went through source code of HashMap and have a few questions. The PUT method takes the Key and Value and does
the hashing function of the hashcode of the key.
calculate bucket location for this pair using the hash obtained from the previous step
public V put(K key, V value) {
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
.....
}
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
static int indexFor(int h, int length) {
return h & (length-1);
}
Example:
Creating a HashMap with size 10.
call put(k,v) three times and assume these 3 occupies bucket loc 7 ,8 and 9
call put 4th K,V pair and following happens
hash() is called with key.hashcode() and hash calculated
indexFor is calculated based on hash
Question:
What if the calculated bucket location for the 4th k,v is out of the existing bounds? say location 11 ?
Thanks in advance
Akh
For your first question: the map always uses a power of two for the size (if you give it a capacity of 10, it will actually use 16), which means index & (length - 1) will always be in the range [0, length) so it's always in range.
It's not clear what your second and third question relate to. I don't think HashMap reallocates everything unless it needs to.
HashMaps will generally use the hash code mod the number of buckets. What happens when there is a collision depends on the implementation (not sure for Java's HashMap). There are two basic strategies: keeping a list of items that fall in the bucket, or skipping forward to other buckets if your bucket is full. My guess would be that HashMap uses the list bucket approach.
Let's go into more detail, How hashmap will initialize bucket size?
following code is from HashMap.java
while (i < paramInt)
i <<= 1;
If you pass initial 10 then above code is used to make a size of power 2.
So using above code HashMap initialize bucket size 16;
And below code is used to calculate bucket index,
static int indexFor(int h, int length) {
return h & (length - 1);
}

Simple hash function techniques

I'm pretty new to hashing in Java and I've been getting stuck on a few parts. I have a list of 400 items (and stored in a list of 1.5x = 600), which the item id's range from 1-10k. I've been looking at a few hash functions and I initially copied the examples in the packet, which just used folding. I noticed that I've been getting about 50-60% null nodes, which is apparently too many. I also noticed that just modding the id by 600 tends to reduce it to a solid 50% nulls.
My current hash function looks something like, and for being as ugly as it is, it's only a 1% decrease in nulls from a simple modding, with an avg list length of 1.32...
public int getHash( int id )
{
int hash = id;
hash <<= id % 3;
hash += id << hash % 5;
/* let's go digit by digit! */
int digit;
for( digit = id % 10;
id != 0;
digit = id % 10, id /= 10 )
{
if ( digit == 0 ) /* prevent division by zero */
continue;
hash += digit * 2;
}
hash >>= 5;
return (hash % 600);
}
What are some good techniques for creating simple hash functions?
I would keep it simple. Return the id of your element as your hashcode, and let the hashtable worry about rehashing it if it feels it needs to. Your goal should be to make a hash code unique to your object.
The Java HashMap uses the following rehashing method:
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
There's a nice review article here. Also, the Wikipedia article on hash functions is a good overview. It suggests using a chi-squared test to assess the quality of your hash function.

Categories

Resources