Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.
Related
I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.
I am implementing a hashtable for educational purposes. The hashtable is implemented with an array and collision is dealt by using linked list. The instructions says that I can insert same items without checking to increase speed of insertion. But when chain length reaches max allowed, the hashtable needs to be resized. But I found resizing is not going to help at all because same items still go to the same bucket even array length is increased. Did I miss something here? Thank you very much.
Let's take an example: three objects with hashcodes 7, 23 and 47.
If the hashtable is of size 8, then by modular arithmetic, all of those objects would go into hash bucket 7.
On the other hand, if the hashtable is of size 16, then the first two would go into hash bucket 7 while the other would go into bucket 15.
The instructions says that I can insert same items without checking to increase speed of insertion.
You can't skip checking completely, because you would end up with duplicates on the same chain.
But I found resizing is not going to help at all because same items still go to the same bucket even array length is increased.
This would happen only for hash values below table size. For values above table size the % operator will often place the item in a different bucket, assuming that you avoid the aliasing problem.
In order to avoid aliasing, use table sizes corresponding to prime numbers. See this Q&A for additional information on this.
I can tell how jdk handles that. You entries (keys) override hashcode - which is an int (made from 32 bits). When you have 16 buckets (internal array has the length of 16), the operation that is performed internally to find out where the entry will go is:
hash_code & (array.lenght - 1) // this is the same as modulo operation
// if array.lenght is power of two.
that means that when you put an entry into the map, only the last 4 bits of the hash code of your entries are taken into account.
Now when you fill those 16 entries (of you implement a load factor): the internal array is made bigger (let's double it), so now it has 32 entries.
This means that deciding where the entry will go is computed:
hash_code & (32 - 1) // now there are 5 bits take into consideration
All your entries are now re-hashed (because there is one more bit now), and your entries might end this time in different buckets.
There're many UUID which has 128bit, I want to set every UUID as a integer and flag it in bitset's each position. But It seems 128bit is too long.
How can I implement this function and there is no collision?
By using bitset, you will need 2^128 which is around 3.4 x 10^38 bits.
You want something that cost you less memory, it is possible. But if you want it to have absolutely no collision, it is impossible, simply by pigeon hole principle.
But why you want it to be "no collision"? For example, if you are going to use HashMap, a relatively normal hashing function, plus pre-initializing the HashMap to the expected size is going to save you a lot of collision. And even there are some collisions, I don't think it will be a big impact to performance (unless the hashing method is really poorly done).
A workaround if your "add to bitset" is explicit (hence you do not need to determine if a UUID is already in the bit set):
Assuming you need to store status of 100,000,000 devices, you will need at least 100,000,000 bits.
By using a reasonable hashing algorithm, make up a 27 bit hash, and the hash will determine which bit to use to store the status. Hence you will need a bitmap of 2^27 =134,217,728 bits ~=17MB.
Have 2 BitSets of such size (cost you around 34MB), one for keeping status, one for keeping "availability of bit".
Have a extra Map<UUID, Integer> as "exceptional device bit"
For a new UUID, calculate that 27bit hash. If the result value is not occupied in "bitAvailabilityBitset", turn that on.
For a new UUID, if the hash result is occupied in "bitAvailabilityBitSet", find the index of next unoccupied bit, turn that on in "bitAvailabilityBitSet", and add the UUID + index pair in "exceptional device bit" map.
Doing something in reverse when lookup/update: first check if UUID is in "exceptional device bit" map, if so, use the index in the map to lookup. If not, simply calculate the 27bit hash as the index to lookup
Given a relatively good hashing algorithm, the collision should not be frequent and hence, the extra overhead for that "exceptional device bit" map should not be big. You may further adjust the size of the bitset to tradeoff size for reduction of collision
In java 8 java.util.Hashmap I noticed a change from:
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
to:
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
It appears from the code that the new function is a simpler XOR of the lower 16 bits with the upper 16 leaving the upper 16 bits unchanged, as opposed to several different shifts in the previous implementation, and from the comments that this is less effective at allocating the results of hash functions with a high number of collisions in lower bits to different buckets, but saves CPU cycles by having to do less operations.
The only thing I saw in the release notes was the change from linked lists to balanced trees to store colliding keys (which I thought might have changed the amount of time it made sense to spend calculating a good hash), I was specifically interested in seeing if there was any expected performance impact from this change on large hash maps. Is there any information about this change, or does anyone with a better knowledge of hash functions have an idea of what the implications of this change might be (if any, perhaps I just misunderstood the code) and if there was any need to generate hash codes in a different way to maintain performance when moving to Java 8?
As you noted: there is a significant performance improvement in HashMap in Java 8 as described in JEP-180. Basically, if a hash chain goes over a certain size, the HashMap will (where possible) replace it with a balanced binary tree. This makes the "worst case" behaviour of various operations O(log N) instead of O(N).
This doesn't directly explain the change to hash. However, I would hypothesize that the optimization in JEP-180 means that the performance hit due to a poorly distributed hash function is less important, and that the cost-benefit analysis for the hash method changes; i.e. the more complex version is less beneficial on average. (Bear in mind that when the key type's hashcode method generates high quality codes, then gymnastics in the complex version of the hash method are a waste of time.)
But this is only a theory. The real rationale for the hash change is most likely Oracle confidential.
When I ran hash implementation diffences I see time difference in nano seconds as below (not great but can have some effect when the size is huge ~1million+)-
7473 ns – java 7
3981 ns– java 8
If we are talking about well formed keys and hashmap of big size (~million), this might have some impact and this is because of simplified logic.
As Java documentation says that idea is to handle the situation where old implementation of Linked list performs O(n) instead of O(1). This happens when same hash code is generated for large set of data being inserted in HashMap.
This is not normal scenario though.
To handle a situation that once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a binary tree. In the case of high hash collisions, this will improve search performance from O(n) to O(log n) which is much better and solves the problem of performance.
hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.
What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.
Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.
If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.
That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.
It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.