Simple hash function techniques - java

I'm pretty new to hashing in Java and I've been getting stuck on a few parts. I have a list of 400 items (and stored in a list of 1.5x = 600), which the item id's range from 1-10k. I've been looking at a few hash functions and I initially copied the examples in the packet, which just used folding. I noticed that I've been getting about 50-60% null nodes, which is apparently too many. I also noticed that just modding the id by 600 tends to reduce it to a solid 50% nulls.
My current hash function looks something like, and for being as ugly as it is, it's only a 1% decrease in nulls from a simple modding, with an avg list length of 1.32...
public int getHash( int id )
{
int hash = id;
hash <<= id % 3;
hash += id << hash % 5;
/* let's go digit by digit! */
int digit;
for( digit = id % 10;
id != 0;
digit = id % 10, id /= 10 )
{
if ( digit == 0 ) /* prevent division by zero */
continue;
hash += digit * 2;
}
hash >>= 5;
return (hash % 600);
}
What are some good techniques for creating simple hash functions?

I would keep it simple. Return the id of your element as your hashcode, and let the hashtable worry about rehashing it if it feels it needs to. Your goal should be to make a hash code unique to your object.
The Java HashMap uses the following rehashing method:
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}

There's a nice review article here. Also, the Wikipedia article on hash functions is a good overview. It suggests using a chi-squared test to assess the quality of your hash function.

Related

Why the calculation of hash in HashMap(JDK1.8) don't need to consider the negative hashCode as ConcurrentHashMap does?

In HashMap: (h = key.hashCode()) ^ (h >>> 16);
In ConcurrentHashMap: (h ^ (h >>> 16)) & HASH_BITS;
where HASH_BITS is 0x7fffffff, by & HASH_BITS it can always be a positive number.
Why the calculation of hash in HashMap(JDK1.8) don't need to consider the negative hashCode as ConcurrentHashMap does?
Ultimately, the case where the hash is negative (after spreading) does need to be considered in the HashMap case as well. It is just that this happens later in the code.
For example, in getNode (Java 8) you find this:
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
Since tab.length is a power of 2, tab.length - 1 is a suitable bitmask for reducing hash to a subscript for the array.
You can rest assured that in every implementation of HashMap or ConcurrentHashMap there is some code that reduces the hash code to a number that is suitable for use as a subscript. It will be there ... somewhere.
But also ... don't expect the code of these classes to be easy to read. All of the collection classes have been reworked / tuned multiple times get the best possible (average) performance over a wide range of test cases.
Actually it handles negative index calculations. It's not evident at first looking but there are calculations in some places while accessing the elements(key or value).
int index = (n - 1) & hash, in which n is length of the table
It simply handles negative indexing.
AFAIK, HashMap always uses arrays sized to a power of 2 (e.g. 16, 32, 64, etc.).
Let's assume we have capacity of 256(0x100) which is 2^8.
After subtraction 1, we get 256 - 1 = 255 which is 0x100 - 0x1 = 0xFF
The subtraction gives rise to get the proper bucket index between 0 to length-1 with exact bit mask needed to bitwise-and with the hash.
256 - 1 = 255
0x100 - 0x1 = 0xFF
A hash of 260 (0x104) gets bitwise-anded with 0xFF to yield a bucket number of 4.
A hash of 257 (0x101) gets bitwise-anded with 0xFF to yield a bucket number of 1.

Very fast universal hash function for 128 bit keys

I need a very fast universal hash function for a 128-bit key. The returned value needs to be about 32 bit (well, 16 bit would be sufficient; in most cases I only need 1-4 bits actually).
Universal hash means, there are two parameters: key (128 bit) and index (64 bit). For two keys, the universal hash function needs to return different result eventually, if called with different indexes. So with a different index, the universal hash should behave like a different hash function. For x = universalHash(k, i) and y = universalHash(k, i + 1), it would be best if on average 50% of all bits are different between x and y (randomly). The same for the case if the method is called with different keys. In practise, 5% off is OK for me.
It needs to be very fast (one or two multiplications at most). It is called millions of times. Please don't say: no, you won't need it to be fast. It also needs to return different values eventually.
What I have so far (Java code, but C is (due to the lack of a 128 bit data type, the key is the composite of a and b, which are 64 bit each):
int universalHash(long a, long b, long index) {
long x = a ^ Long.rotateLeft(b, (int) index) ^ index;
int y = (int) ((x >>> 32) ^ x);
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = (y >>> 16) ^ y;
return y;
}
int universalHash2(long a, long b, long index) {
long x = Long.rotateLeft(a, (int) index) ^
Long.rotateRight(b, (int) index) ^ index;
x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
return (int) ((x >>> 32) ^ x);
}
(The second method is actually broken for some values.)
I would like to have a hash function that is faster than those above, and is guaranteed to work in all cases (if possible provably correct, even thought that's not a strict requirement; it doesn't need to be cryptographically secure however).
I will call the universalHash method with incrementing index (first index 0, then index 1, and so on) for the same keys. It would be best if the next result could be calculated faster (e.g. without multiplication) from the previous result. But I also need to have a fast "direct access" if the index is some value (as in the example code).
Background
The problem I'm trying to solve is finding a MPHF (minimal perfect hash function) for a relatively small set of keys (up to 16 keys by directly mapping, and up to about 1024 keys by splitting into smaller subsets). For details on the algorithm, see my MinPerf project, specially the RecSplit algorithm. To support set of size 10^12 (like BBHash), I'm trying to internally use 128 bit signatures, which would simplify the algorithm.
You need a hash function that outputs 32 bits for 128 bits of inputs.
A simple way would be to just return "some" 32 bits out of the original 128 bits. There are many ways of choosing 32 bits and every choice will have collisions. But the index can decide which 32 bits to choose.
128/32 = 4, so 4 indices are enough to find at least one different bit.
For key 0 you choose the lower most 32 bits
For key 1 you choose the next 32 bits
and so on ..
The C implementation would be
uint32_t universal_hash(uint64_t key_higher, uint64_t key_lower, int index) {
// For a lack of portable 128 bit datatype we take the key in parts.
return 0xFFFFFFFF & ( index >=2 ? key_higher >> ((index - 2)*32) : key_lower >> (index*32));
}

A faster hash function

I'm trying to implement my own hash function, i add up the ASCII numbers of each string, using java. I find the hash code by finding the mod of the size of the hash table and the sum. size%sum. I was wondering if there was a way to use the same process but reduce collisions, when searching for the string?
Thanks in advance.
I would look at the code for String and HashMap as these have a low collision rate and don't use % and handle negative numbers.
From the source for String
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
From the source for HashMap
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
final int hash(Object k) {
int h = 0;
if (useAltHashing) {
if (k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h = hashSeed;
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
As the HashMap is always a power of 2 in size you can use
hash = (null != key) ? hash(key) : 0;
bucketIndex = indexFor(hash, table.length);
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Using & is much faster than % and only return positive numbers as length is positive.
The Java String.hashcode() makes a tradeoff between being a really good hash function and being as efficient as possible. Simply adding up the character values in a string is not a reliable hash function.
For example, consider the two strings dog and god. Since they both contain a 'd', 'g', and an 'o', no method involving only addition will ever result in a different hash code.
Joshua Bloch, who implemented a good part of Java, discusses the String.hashCode() method in his book Effective Java and talks about how, in versions of Java prior to 1.3, the String.hashCode() function used to consider only 16 characters in a given String. This ran somewhat faster than the current implementation, but resulted is shockingly poor performance in certain situations.
In general, if your specific data set is very well-defined and you could exploit some uniqueness in it, you could probably make a better hash function. For general purpose Strings, good luck.

How does HashMap make sure the index calculated using hashcode of key is within the available range?

I went through source code of HashMap and have a few questions. The PUT method takes the Key and Value and does
the hashing function of the hashcode of the key.
calculate bucket location for this pair using the hash obtained from the previous step
public V put(K key, V value) {
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
.....
}
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
static int indexFor(int h, int length) {
return h & (length-1);
}
Example:
Creating a HashMap with size 10.
call put(k,v) three times and assume these 3 occupies bucket loc 7 ,8 and 9
call put 4th K,V pair and following happens
hash() is called with key.hashcode() and hash calculated
indexFor is calculated based on hash
Question:
What if the calculated bucket location for the 4th k,v is out of the existing bounds? say location 11 ?
Thanks in advance
Akh
For your first question: the map always uses a power of two for the size (if you give it a capacity of 10, it will actually use 16), which means index & (length - 1) will always be in the range [0, length) so it's always in range.
It's not clear what your second and third question relate to. I don't think HashMap reallocates everything unless it needs to.
HashMaps will generally use the hash code mod the number of buckets. What happens when there is a collision depends on the implementation (not sure for Java's HashMap). There are two basic strategies: keeping a list of items that fall in the bucket, or skipping forward to other buckets if your bucket is full. My guess would be that HashMap uses the list bucket approach.
Let's go into more detail, How hashmap will initialize bucket size?
following code is from HashMap.java
while (i < paramInt)
i <<= 1;
If you pass initial 10 then above code is used to make a size of power 2.
So using above code HashMap initialize bucket size 16;
And below code is used to calculate bucket index,
static int indexFor(int h, int length) {
return h & (length - 1);
}

Order of items in a HashMap differ when the same program is run in JVM5 vs JVM6

I have an application which displays a collection of objects in rows, one object = one row. The objects are stored in a HashMap. The order of the rows does not affect the functionality of the application (that is why a HashMap was used instead of a sortable collection).
However I have noticed that the same application runs differently when run using two different versions of the Java Virtual Machine. The application is compiled using JDK 5, and can be run using either Java 5 or Java 6 runtimes, without any functional difference.
The object in question overrides java.lang.Object#hashCode() and obviously care has been taken to follow the contract specified in the Java API. This is evidenced by the fact that they always appear in the same order in every run of the application (in the same Java runtime).
For curiosity's sake, why does the choice of Java runtime affect the order?
The implementation details of HashMap can and do change. Most likely this package private method did (this is from JDK 1.6.0_16):
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
For reference, the analogue in JDK 1.5.0_06 is:
/**
* Returns a hash value for the specified object. In addition to
* the object's own hashCode, this method applies a "supplemental
* hash function," which defends against poor quality hash functions.
* This is critical because HashMap uses power-of two length
* hash tables.<p>
*
* The shift distances in this function were chosen as the result
* of an automated search over the entire four-dimensional search space.
*/
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Probably because a Map is not defined to have any particular iteration order; the order in which the elements come back is likely to be an artifact of its internal implementation and does not need to stay consistent.
If the implementation gets updated between Java 5 and 6 (especially for performance reasons), there's no benefit or obligation of Sun to make sure the iteration order stays consistent between the two.
EDIT: I just found an interesting snippet in one of the early Java 6 releases (unfortunately I'm not sure of the exact version but it's apparently HashMap 1.68 from June 2006):
/**
* Whether to prefer the old supplemental hash function, for
* compatibility with broken applications that rely on the
* internal hashing order.
*
* Set to true only by hotspot when invoked via
* -XX:+UseNewHashFunction or -XX:+AggressiveOpts
*/
private static final boolean useNewHash;
static { useNewHash = false; }
private static int oldHash(int h) {
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
private static int newHash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So it seems that despite my above assertions, Sun did in fact consider the consistency of iteration order - at some later point this code was presumably dropped and the new order made the definitive one.
HashMap is not married to any particular ordering, but LinkedHashMap implementation of Map should preserve the order.

Categories

Resources