Why hash function has done XOR on hascode? - java

I read the explanation but I could not understand what we are achieving by doing XOR on the hashCode. Can anyone give some example.
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
This code was taken from HashMap source code. I just wanted to know why they have used XOR, Marko has replied that properly that HashMap implementation uses lower-end bits. I think not only HashMap, other collections will also be doing same that is why I did not mentioned any collection name. I don't understand why people "rate down" this question.

This is a typical maneuver to protect from "bad" hashcodes: such whose lower-end bits are not variable enough. Java's HashMap implementation relies only on lower-end bits of the hashcode to select the bucket.
However, this code's motivation has expired long ago because HashMap already does its own bit spreading. It would make sense if used on Hashtable, but of course no code written since year 2000 should ever use it.

the code is openjdk java HashMap source code:HashMap.java
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
xor is to let hash result be more distributed

Related

hashCode values not same in debugger and output in netbeans 8.2

Map<String, Integer> map = new HashMap<>();
map.put("Naveen", 100);
System.out.println("Naveen".hashCode());
/* output (-1968696341) so index=(-1968696341&15)=11
but in netbeans 8.2 and jdk 1.8 debugger hashcode = -1968662205
so the index=(-1968662205&15)=3
*/
where is the problem my enviornment is
netbeans 8.2 jdk 1.8
The actual hashCode of the string "Naveen" is indeed -1968696341, and it must always be so by specification (despite comments to the contrary).
The HashMap implementation doesn't use the key's hashCode value directly. Instead, it "spreads" the bits using the formula h ^ (h >>> 16) in order to use the high-order bits to help reduce collisions. If you apply this formula to the string's hashCode, the result is -1968662205 which matches what you see in the debugger.
The JDK 8 code for this is here, along with an explanation in a comment, quoted here for convenience.
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Why and how does HashMap have its own internal implementation of hashCode() called hash()?

According to this blog entry, HashMap reinvokes its own implementation of hashCode() (called hash()) on a hashcode it already retrieved.
If key is not null then , it will call hashfunction on the key object , see line 4 in above method i.e. key.hashCode() ,so after key.hashCode() returns hashValue , line 4 looks like
int hash = hash(hashValue)
and now ,it applies returned hashValue into its own hashing function .
We might wonder why we are calculating the hashvalue again using hash(hashValue). Answer is ,It defends against poor quality hash >functions.
Can HashMap accurately reassign hashcodes? HashMap can store objects, but it doesn't have access to the logic that assigns a hashCode its objects. For example, hash() couldn't possibly integrate the logic behind the following hashCode() implementation:
public class Employee {
protected long employeeId;
protected String firstName;
protected String lastName;
public int hashCode(){
return (int) employeeId;
}
}
The hash() derives the "improved" hash code from the actual hash code, so equal input will always be equal output (from jdk1.8.0_51):
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
As to why the hash code needs improvement, read the javadoc of the method:
Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don't benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

why hash the hashcode in java hashmap? [duplicate]

I am reading the code of the HashMap class provided by the Java 1.6 API and unable to fully understand the need of the following operation (found in the body of put and get methods):
int hash = hash(key.hashCode());
where the method hash() has the following body:
private static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
This effectively recalculates the hash by executing bit operations on the supplied hashcode. I'm unable to understand the need to do so even though the API states it as follows:
This is critical
because HashMap uses power-of-two length hash tables, that
otherwise encounter collisions for hashCodes that do not differ
in lower bits.
I do understand that the key value pars are stored in an array of data structures, and that the index location of an item in this array is determined by its hash.
What I fail to understand is how would this function add any value to the hash distribution.
As Helper wrote, it is there just in case the existing hash function for the key objects is faulty and does not do a good-enough job of mixing the lower bits. According to the source quoted by pgras,
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
The hash is being ANDed in with a power-of-two length (therefore, length-1 is guaranteed to be a sequence of 1s). Due to this ANDing, only the lower bits of h are being used. The rest of h is ignored. Imagine that, for whatever reason, the original hash only returns numbers divisible by 2. If you used it directly, the odd-numbered positions of the hashmap would never be used, leading to a x2 increase in the number of collisions. In a truly pathological case, a bad hash function can make a hashmap behave more like a list than like an O(1) container.
Sun engineers must have run tests that show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits. Under these circumstances, the bit operations in HashMap's hash(int h) can provide a net improvement over most expected use-cases (due to lower collision rates), even though extra computation is required.
I somewhere read this is done to ensure a good distribution even if your hashCode implementation, well, err, sucks.
as you know with the hashmap, the underlying implementation is a hashtable, specifically a closed bucket hash table. The load factor determines the appropriate amount of objects in the collection / total number of buckets.
Lets say you keep adding more elements. Each time you do so, and it's not an update, it runs the object's hashcode method and uses the number of buckets with the modulo operator to decide which bucket the object should go in.
as n(the number of elements in the collection) / m(the number of buckets) gets larger, your performance for reads and writes gets worse and worse.
Assuming your hashcode algorithm is amazing, performance is still contingent upon this comparison n/m.
rehashing is used also to change the number of buckets, and still keep the same load factor as which the collection was constructed.
Remember, the main benefit of any hash implementation is the ideal O(1) performance for reads and writes.
As you know, object.hashCode() can be overridden by users, so a really bad implementation would throw up non random lower level bits. That would tend to crowd some of buckets and would leave many buckets unfilled.
I just created a visual map of what they are trying to do in hash. It seems that hash(int h) method is just creating a random number by doing bit level manuplation so that the resulting numbers are more randomly (and hence into buckets more uniformly) distributed.
Each bit is remapped to a different bit as follows:
h1 = h1 ^ h13 ^ h21 ^ h9 ^ h6
h2 = h2 ^ h14 ^ h22 ^ h10 ^ h7
h3 = h3 ^ h15 ^ h23 ^ h11 ^ h8
h4 = h4 ^ h16 ^ h24 ^ h12 ^ h9
h5 = h5 ^ h17 ^ h25 ^ h13 ^ h10
. . . .
till h12.
As you can see, each bit of h is going to be so so far away from itself. So it is going to be pretty much random and not going to crowd any particular bucket. Hope this help. Send me an email if you need full visual.

What hash function is better?

I write my implementation of HashMap in Java. I use open addressing for collision resolution. For better key distribution I want use a nice hash function for int hashcode of key. I dont know what hash function is better for it?
public int getIndex(K key) { return hash(key.hashCode()) % capacity; }
I need a hash function for hashcode of key.
Any hash that distributes the values you're expecting to receive evenly is a good hash function.
Your goal is to maximize performance (well, maximize performance while maintaining correctness). The primary concern there is to minimize bucket collisions. This means that the ideal hash is tailored to your input data - if you know what you'll receive, you can choose the hash the produces a minimal number of collisions and maybe even a cache-optimal access pattern.
However, that's not usually a realistic option, so you just choose a hash whose output is unbiased and unpredictable (one that behaves like a pseudorandom number generator, but deterministic). Some such functions are the "murmur" hash family.
The main problem with using % capacity is that it can return negative and positive values.
HashMap avoids this issue by using a power of 2 and uses the following approach
public int getIndex(K key) { return hash(key.hashCode()) & (capacity-1); }
If the capacity is not a power of 2, you can ignore the high bit (which is often no so random)
public int getIndex(K key) { return (hash(key.hashCode()) & 0x7FFFFFFF) % capacity; }
The hash function actually used can matter. HashMap uses the following
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I would use this, unless you have a good reason not to. E.g. for security reasons, if you have a service which could the subject of a denial of service attack, you will want to use a different hash to avoid a malicious user turning your HashMap into a LinkedList. Unfortunately you still have to use a different hashCode() as well as you can create a long list of Strings with the underlying hash code so mutating it later is too later.
Here is a list of strings with all have a hashCode() of 0, there is nothing a hash() function can do about that.
Why doesn't String's hashCode() cache 0?

Explanation of HashMap#hash(int) method

Can someone please explain to me the static HashMap#hash(int) method?
What's the justification behind it to generate uniformly distributed hashes?
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
An example would make it easier to digest.
Clarification
I'm aware of the operators, truth tables and bitwise operations. I just can't really decode the implementation nor the comment really. Or even the reasoning behind it.
>>> is the logical right shift (no sign-extension) (JLS 15.19 Shift Operators), and ^ is the bitwise exclusive-or (JLS 15.22.1 Integer Bitwise Operators).
As to why this is done, the documentation offers a hint: HashMap uses power-of-two length tables, and hashes keys by masking away the higher bits and taking only the lower bits of their hash code.
// HashMap.java -- edited for conciseness
static int indexFor(int h, int length) {
return h & (length-1);
}
public V put(K key, V value) {
int hash = hash(key.hashCode());
int index = indexFor(hash, table.length);
// ...
}
So hash() attempts to bring relevancy to the higher bits, which otherwise would get masked away (indexFor basically discards the higher bits of h and takes only the lower k bits where length == (1 << k)).
Contrast this with the way Hashtable (which should have NOT a power-of-two length table) uses a key's hash code.
// Hashtable.java -- edited for conciseness
public synchronized V get(Object key) {
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % table.length;
// ...
}
By doing the more expensive % operation (instead of simple bit masking), the performance of Hashtable is less sensitive to hash codes with poor distribution in the lower bits (especially if table.length is a prime number).
I don't know how all the shifting works, but the motivation is laid out in the comments:
The way the HashMap is implemented relies on the hashCode function being sufficiently well implemented. In particular, the lower bits of the hash value should be distributed evenly. If you have many collisions on the lower bits, the HashMap will not perform well.
Because the implementation of hashCode is outside of the control of HashMap (every object can implement their own), they supply an additional hash function that shifts the object's hashCode around a little to ensure that the lower bits are distributed more randomly. Again, I have no idea how this works exactly (or how effective it is), but I assume it depends on at least the higher bits being distributed equally (it seems to mesh the higher bits into the lower bits).
So what this does is to try to minimize collisions (and thus improve performance) in the presence of poorly implemented hashCode methods.

Categories

Resources