hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.
What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.
Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.
If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.
That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.
It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.
Related
From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.
For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.
I am looking for ways to compute a unique hash for a given String in Java. Looks like I cannot use MD5 or SHA1 because folks claim that they are broken and do not always guarantee uniqueness.
I should get the same hash (preferably a 32 character string like the MD5 Sum) for two String objects which are equal by the equals() method. And no other String should generate this hash - that's the tricky part.
Is there a way to achieve this in Java?
If guaranteed unique hash code is required then it is not possible (possible theoretically but not practically). Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires
an integer with 16 * N bits to represent all possible values. If you
write a hash function that produces integer with a smaller range (e.g.
less than 16 * N bits), you will eventually find cases where more than
one String hashes to the same integer; i.e. the hash codes cannot be
unique. This is called the Pigeonhole Principle, and there is a
straight forward mathematical proof. (You can't fight math and win!)
But if "probably unique" with a very small chance of non-uniqueness is
acceptable, then crypto hashes are a good answer. The math will tell
you how big (i.e. how many bits) the hash has to be to achieve a given
(low enough) probability of non-uniqueness.
Updated : check this another good answer : What is a good 64bit hash function in Java for textual strings?
I ran into the needs of k pair-wise independent hash functions, each takes as input an integer, and produces a hash value in the range of 0-N. Need this for count-min sketch, which is similar to Bloom filter.
Formally, I need h_1,h_2, ..., h_k hash functions, pair-wise independent.
(h_i(n) mod N ) will give the hash value of n in the range of 0-N. The hashing needs to be time-efficient as I am working with a large set of data. At the same time, they should be as pair-wise independent as possible.
What I have tried so far:
1) xxhash: It is efficient, but it is not good in terms of pair-wise independent, meaning there are hash collisions between hash functions (meaning h1(n1)=h1(n2) then some h_k(n1) also = h_k(n2)) and the result i got was bad due to this.
2) Similarly, the famous integer hashing method ((a*n+b) mod p) mod N also has the same problem as xxhash. I believe this is called Universal hashing
3) The other one introduced in count-min-sketch produces quite good results, but takes too much time for a large input.
4) Also tried Murmur3, sha1 with similar problems in collisions.
Any idea would be greatly appreciated. C/C++ preferred, but Java would also be fine, or simply algorithm.
Thanks
I suspect that your problem with method 2 is that you tossed a_i and b_i that were correlated.
Work in large field (somewhere near 2^64) and for starters make sure all a_i and b_i are different (i.e., you get 2*k different numbers). If they are uniformly distributed inside the field this also wouldn't hurt :)
You might encountered the same problem in method 4 with SHA. Most cryptographic hash functions (including even the broken and older ones) are much more than enough for data structures needs, be it k-wise independence for any reasonable k, or almost any other property.
I would recheck - how you used it?
My question is not about double hashing technique http://en.wikipedia.org/wiki/Double_hashing , which is a way to resolve collisions. It is about handling existing collisions in hash table of strings. Say, we have a collision: several strings in the same bucket, so now we must go through the bucket checking the strings. It seems it would make sense to calculate another hash function for fast string comparison (compare hash values for quick rejection). The hash key could be lazily computed and saved with the string. Did you use such technique? Could you provide a reference? If not, do you think it's not worth doing since perfomance gain is questionable? Some notes:
I put tag "Java" since I did measurements in Java: String.hashCode() in most cases outperforms String.equals() (and BTW greatly outperforms manual hash code calculation: hashCode = 31 * hashCode + strInTable.charAt(i));
Of course, the same could be asked about any string comparison, not necessarily strings in a hash table. But I am considering a specific situation with huge amount of strings which are kept in hash table.
This probably makes sense if the strings in the bucket are somewhat similar (like in Rabin-Karp algorithm). Looking for your opinion in general situation.
Many hash-based collections store the hash value of each item in the collection, on the premise that since every item's hash will have been computed when it was added to the collection, and code which is looking for an item in a hashed collection will have to know its hash, comparing hash values will be a quick and easy way of reducing the cost of false hits. For example, if one has a 16-bucket hash-table that contains four strings of 1,000 characters each, and will be searching for a lot of 1,000-character strings which match one of the table entries in all but the last few characters, more than 6% of of searches will hit on a bucket that contains a near-match string, but a much smaller fraction will hit a bucket that contains a string whose 32-bit hashCode matches that of the string being sought. Since comparisons of nearly-identical strings are expensive, comparing full 32-bit hash codes is helpful.
If one has large immutable collections which may need to be stored in hash tables and matched against other such collections, there may be some value in having such collections compute and cache longer hash functions, and having their equals methods compare the results of those longer hash functions before proceeding further. In such cases, computing a longer hash function will often be almost as fast as computing a shorter one. Further, not only will comparisons on the longer hash code greatly reduce the risks that false positives will cause needless "deep" comparisons, but computing longer hash functions and combining them into the reported hashCode() may greatly reduce the dangers of strongly-correlated hash collisions.
Comparing a hash only makes sense if the number of comparisons (lookups) is large compared to the number of entries. You would need a large hash (32 bits are not enough; you'd want at least 128 bits), and that would be expensive to calculate. You would want to amortize the cost of hashing each string over a large number of probes into the buckets.
As to whether it's worth it or not, it's highly context dependent. The only way to find out is to actually do it with your data and compare the performance of both methods.
Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at 0.75. The value of 0.75 has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.
If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys
It's not. From HashMap.java:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.