Change to HashMap hash function in Java 8 - java

In java 8 java.util.Hashmap I noticed a change from:
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
to:
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
It appears from the code that the new function is a simpler XOR of the lower 16 bits with the upper 16 leaving the upper 16 bits unchanged, as opposed to several different shifts in the previous implementation, and from the comments that this is less effective at allocating the results of hash functions with a high number of collisions in lower bits to different buckets, but saves CPU cycles by having to do less operations.
The only thing I saw in the release notes was the change from linked lists to balanced trees to store colliding keys (which I thought might have changed the amount of time it made sense to spend calculating a good hash), I was specifically interested in seeing if there was any expected performance impact from this change on large hash maps. Is there any information about this change, or does anyone with a better knowledge of hash functions have an idea of what the implications of this change might be (if any, perhaps I just misunderstood the code) and if there was any need to generate hash codes in a different way to maintain performance when moving to Java 8?

As you noted: there is a significant performance improvement in HashMap in Java 8 as described in JEP-180. Basically, if a hash chain goes over a certain size, the HashMap will (where possible) replace it with a balanced binary tree. This makes the "worst case" behaviour of various operations O(log N) instead of O(N).
This doesn't directly explain the change to hash. However, I would hypothesize that the optimization in JEP-180 means that the performance hit due to a poorly distributed hash function is less important, and that the cost-benefit analysis for the hash method changes; i.e. the more complex version is less beneficial on average. (Bear in mind that when the key type's hashcode method generates high quality codes, then gymnastics in the complex version of the hash method are a waste of time.)
But this is only a theory. The real rationale for the hash change is most likely Oracle confidential.

When I ran hash implementation diffences I see time difference in nano seconds as below (not great but can have some effect when the size is huge ~1million+)-
7473 ns – java 7
3981 ns– java 8
If we are talking about well formed keys and hashmap of big size (~million), this might have some impact and this is because of simplified logic.

As Java documentation says that idea is to handle the situation where old implementation of Linked list performs O(n) instead of O(1). This happens when same hash code is generated for large set of data being inserted in HashMap.
This is not normal scenario though.
To handle a situation that once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a binary tree. In the case of high hash collisions, this will improve search performance from O(n) to O(log n) which is much better and solves the problem of performance.

Related

Why is time complexity of get() in HashMap O(1) [duplicate]

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

what are the options for obtaining k pair-wise independent hash functions that are fast

I ran into the needs of k pair-wise independent hash functions, each takes as input an integer, and produces a hash value in the range of 0-N. Need this for count-min sketch, which is similar to Bloom filter.
Formally, I need h_1,h_2, ..., h_k hash functions, pair-wise independent.
(h_i(n) mod N ) will give the hash value of n in the range of 0-N. The hashing needs to be time-efficient as I am working with a large set of data. At the same time, they should be as pair-wise independent as possible.
What I have tried so far:
1) xxhash: It is efficient, but it is not good in terms of pair-wise independent, meaning there are hash collisions between hash functions (meaning h1(n1)=h1(n2) then some h_k(n1) also = h_k(n2)) and the result i got was bad due to this.
2) Similarly, the famous integer hashing method ((a*n+b) mod p) mod N also has the same problem as xxhash. I believe this is called Universal hashing
3) The other one introduced in count-min-sketch produces quite good results, but takes too much time for a large input.
4) Also tried Murmur3, sha1 with similar problems in collisions.
Any idea would be greatly appreciated. C/C++ preferred, but Java would also be fine, or simply algorithm.
Thanks
I suspect that your problem with method 2 is that you tossed a_i and b_i that were correlated.
Work in large field (somewhere near 2^64) and for starters make sure all a_i and b_i are different (i.e., you get 2*k different numbers). If they are uniformly distributed inside the field this also wouldn't hurt :)
You might encountered the same problem in method 4 with SHA. Most cryptographic hash functions (including even the broken and older ones) are much more than enough for data structures needs, be it k-wise independence for any reasonable k, or almost any other property.
I would recheck - how you used it?

Java HashMap detect collision

Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519
Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.
The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.
I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data).
At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better.
public class TestHashMap extends HashMap {
public TestHashMap(int size) {
super(size);
}
public TestHashMap() {
super();
}
public int countCollidingEntries() {
def fs = this.getClass().getSuperclass().getDeclaredFields();
def table;
def count =0 ;
for ( java.lang.reflect.Field field: fs ) {
if (field.getName() == "table") {
field.setAccessible(true);
table = field.get(super);
break;
}
}
for(Object e: table) {
if (e != null) {
while (e.next != null) {
count++
e = e.next;
}
}
}
return count;
}
}

understanding of hash code

hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.
What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.
Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.
If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.
That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.
It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.

Performance of HashMap with different initial capacity and load factor

Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.

Categories

Resources