Can someone please explain to me the static HashMap#hash(int) method?
What's the justification behind it to generate uniformly distributed hashes?
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
An example would make it easier to digest.
Clarification
I'm aware of the operators, truth tables and bitwise operations. I just can't really decode the implementation nor the comment really. Or even the reasoning behind it.
>>> is the logical right shift (no sign-extension) (JLS 15.19 Shift Operators), and ^ is the bitwise exclusive-or (JLS 15.22.1 Integer Bitwise Operators).
As to why this is done, the documentation offers a hint: HashMap uses power-of-two length tables, and hashes keys by masking away the higher bits and taking only the lower bits of their hash code.
// HashMap.java -- edited for conciseness
static int indexFor(int h, int length) {
return h & (length-1);
}
public V put(K key, V value) {
int hash = hash(key.hashCode());
int index = indexFor(hash, table.length);
// ...
}
So hash() attempts to bring relevancy to the higher bits, which otherwise would get masked away (indexFor basically discards the higher bits of h and takes only the lower k bits where length == (1 << k)).
Contrast this with the way Hashtable (which should have NOT a power-of-two length table) uses a key's hash code.
// Hashtable.java -- edited for conciseness
public synchronized V get(Object key) {
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % table.length;
// ...
}
By doing the more expensive % operation (instead of simple bit masking), the performance of Hashtable is less sensitive to hash codes with poor distribution in the lower bits (especially if table.length is a prime number).
I don't know how all the shifting works, but the motivation is laid out in the comments:
The way the HashMap is implemented relies on the hashCode function being sufficiently well implemented. In particular, the lower bits of the hash value should be distributed evenly. If you have many collisions on the lower bits, the HashMap will not perform well.
Because the implementation of hashCode is outside of the control of HashMap (every object can implement their own), they supply an additional hash function that shifts the object's hashCode around a little to ensure that the lower bits are distributed more randomly. Again, I have no idea how this works exactly (or how effective it is), but I assume it depends on at least the higher bits being distributed equally (it seems to mesh the higher bits into the lower bits).
So what this does is to try to minimize collisions (and thus improve performance) in the presence of poorly implemented hashCode methods.
Related
Map<String, Integer> map = new HashMap<>();
map.put("Naveen", 100);
System.out.println("Naveen".hashCode());
/* output (-1968696341) so index=(-1968696341&15)=11
but in netbeans 8.2 and jdk 1.8 debugger hashcode = -1968662205
so the index=(-1968662205&15)=3
*/
where is the problem my enviornment is
netbeans 8.2 jdk 1.8
The actual hashCode of the string "Naveen" is indeed -1968696341, and it must always be so by specification (despite comments to the contrary).
The HashMap implementation doesn't use the key's hashCode value directly. Instead, it "spreads" the bits using the formula h ^ (h >>> 16) in order to use the high-order bits to help reduce collisions. If you apply this formula to the string's hashCode, the result is -1968662205 which matches what you see in the debugger.
The JDK 8 code for this is here, along with an explanation in a comment, quoted here for convenience.
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
I read the explanation but I could not understand what we are achieving by doing XOR on the hashCode. Can anyone give some example.
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
This code was taken from HashMap source code. I just wanted to know why they have used XOR, Marko has replied that properly that HashMap implementation uses lower-end bits. I think not only HashMap, other collections will also be doing same that is why I did not mentioned any collection name. I don't understand why people "rate down" this question.
This is a typical maneuver to protect from "bad" hashcodes: such whose lower-end bits are not variable enough. Java's HashMap implementation relies only on lower-end bits of the hashcode to select the bucket.
However, this code's motivation has expired long ago because HashMap already does its own bit spreading. It would make sense if used on Hashtable, but of course no code written since year 2000 should ever use it.
the code is openjdk java HashMap source code:HashMap.java
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
xor is to let hash result be more distributed
I am reading the code of the HashMap class provided by the Java 1.6 API and unable to fully understand the need of the following operation (found in the body of put and get methods):
int hash = hash(key.hashCode());
where the method hash() has the following body:
private static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
This effectively recalculates the hash by executing bit operations on the supplied hashcode. I'm unable to understand the need to do so even though the API states it as follows:
This is critical
because HashMap uses power-of-two length hash tables, that
otherwise encounter collisions for hashCodes that do not differ
in lower bits.
I do understand that the key value pars are stored in an array of data structures, and that the index location of an item in this array is determined by its hash.
What I fail to understand is how would this function add any value to the hash distribution.
As Helper wrote, it is there just in case the existing hash function for the key objects is faulty and does not do a good-enough job of mixing the lower bits. According to the source quoted by pgras,
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
The hash is being ANDed in with a power-of-two length (therefore, length-1 is guaranteed to be a sequence of 1s). Due to this ANDing, only the lower bits of h are being used. The rest of h is ignored. Imagine that, for whatever reason, the original hash only returns numbers divisible by 2. If you used it directly, the odd-numbered positions of the hashmap would never be used, leading to a x2 increase in the number of collisions. In a truly pathological case, a bad hash function can make a hashmap behave more like a list than like an O(1) container.
Sun engineers must have run tests that show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits. Under these circumstances, the bit operations in HashMap's hash(int h) can provide a net improvement over most expected use-cases (due to lower collision rates), even though extra computation is required.
I somewhere read this is done to ensure a good distribution even if your hashCode implementation, well, err, sucks.
as you know with the hashmap, the underlying implementation is a hashtable, specifically a closed bucket hash table. The load factor determines the appropriate amount of objects in the collection / total number of buckets.
Lets say you keep adding more elements. Each time you do so, and it's not an update, it runs the object's hashcode method and uses the number of buckets with the modulo operator to decide which bucket the object should go in.
as n(the number of elements in the collection) / m(the number of buckets) gets larger, your performance for reads and writes gets worse and worse.
Assuming your hashcode algorithm is amazing, performance is still contingent upon this comparison n/m.
rehashing is used also to change the number of buckets, and still keep the same load factor as which the collection was constructed.
Remember, the main benefit of any hash implementation is the ideal O(1) performance for reads and writes.
As you know, object.hashCode() can be overridden by users, so a really bad implementation would throw up non random lower level bits. That would tend to crowd some of buckets and would leave many buckets unfilled.
I just created a visual map of what they are trying to do in hash. It seems that hash(int h) method is just creating a random number by doing bit level manuplation so that the resulting numbers are more randomly (and hence into buckets more uniformly) distributed.
Each bit is remapped to a different bit as follows:
h1 = h1 ^ h13 ^ h21 ^ h9 ^ h6
h2 = h2 ^ h14 ^ h22 ^ h10 ^ h7
h3 = h3 ^ h15 ^ h23 ^ h11 ^ h8
h4 = h4 ^ h16 ^ h24 ^ h12 ^ h9
h5 = h5 ^ h17 ^ h25 ^ h13 ^ h10
. . . .
till h12.
As you can see, each bit of h is going to be so so far away from itself. So it is going to be pretty much random and not going to crowd any particular bucket. Hope this help. Send me an email if you need full visual.
I write my implementation of HashMap in Java. I use open addressing for collision resolution. For better key distribution I want use a nice hash function for int hashcode of key. I dont know what hash function is better for it?
public int getIndex(K key) { return hash(key.hashCode()) % capacity; }
I need a hash function for hashcode of key.
Any hash that distributes the values you're expecting to receive evenly is a good hash function.
Your goal is to maximize performance (well, maximize performance while maintaining correctness). The primary concern there is to minimize bucket collisions. This means that the ideal hash is tailored to your input data - if you know what you'll receive, you can choose the hash the produces a minimal number of collisions and maybe even a cache-optimal access pattern.
However, that's not usually a realistic option, so you just choose a hash whose output is unbiased and unpredictable (one that behaves like a pseudorandom number generator, but deterministic). Some such functions are the "murmur" hash family.
The main problem with using % capacity is that it can return negative and positive values.
HashMap avoids this issue by using a power of 2 and uses the following approach
public int getIndex(K key) { return hash(key.hashCode()) & (capacity-1); }
If the capacity is not a power of 2, you can ignore the high bit (which is often no so random)
public int getIndex(K key) { return (hash(key.hashCode()) & 0x7FFFFFFF) % capacity; }
The hash function actually used can matter. HashMap uses the following
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I would use this, unless you have a good reason not to. E.g. for security reasons, if you have a service which could the subject of a denial of service attack, you will want to use a different hash to avoid a malicious user turning your HashMap into a LinkedList. Unfortunately you still have to use a different hashCode() as well as you can create a long list of Strings with the underlying hash code so mutating it later is too later.
Here is a list of strings with all have a hashCode() of 0, there is nothing a hash() function can do about that.
Why doesn't String's hashCode() cache 0?
Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519
Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.
The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.
I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data).
At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better.
public class TestHashMap extends HashMap {
public TestHashMap(int size) {
super(size);
}
public TestHashMap() {
super();
}
public int countCollidingEntries() {
def fs = this.getClass().getSuperclass().getDeclaredFields();
def table;
def count =0 ;
for ( java.lang.reflect.Field field: fs ) {
if (field.getName() == "table") {
field.setAccessible(true);
table = field.get(super);
break;
}
}
for(Object e: table) {
if (e != null) {
while (e.next != null) {
count++
e = e.next;
}
}
}
return count;
}
}