Bad idea to use String key in HashMap? - java

I understand that the String class' hashCode() method is not guarantied to generate unique hash codes for distinct String-s. I see a lot of usage of putting String keys into HashMap-s (using the default String hashCode() method). A lot of this usage could result in significant application issues if a map put displaced a HashMap entry that was previously put onto the map with a truely distinct String key.
What are the odds that you will run into the scenario where String.hashCode() returns the same value for distinct String-s? How do developers work around this issue when the key is a String?

Developers do not have to work around the issue of hash collisions in HashMap in order to achieve program correctness.
There are a couple of key things to understand here:
Collisions are an inherent feature of hashing, and they have to be. The number of possible values (Strings in your case, but it applies to other types as well) is vastly bigger than the range of integers.
Every usage of hashing has a way to handle collisions, and the Java Collections (including HashMap) is no exception.
Hashing is not involved in equality testing. It is true that equal objects must have equal hashcodes, but the reverse is not true: many values will have the same hashcode. So don't try using a hashcode comparison as a substitute for equality. Collections don't. They use hashing to select a sub-collection (called a bucket in the Java Collections world), but they use .equals() to actually check for equality.
Not only do you not have to worry about collisions causing incorrect results in a collection, but for most applications, you also *usually* don't have to worry about performance - Java hashed Collections do a pretty good job of managing hashcodes.
Better yet, for the case you asked about (Strings as keys), you don't even have to worry about the hashcodes themselves, because Java's String class generates a pretty good hashcode. So do most of the supplied Java classes.
Some more detail, if you want it:
The way hashing works (in particular, in the case of hashed collections like Java's HashMap, which is what you asked about) is this:
The HashMap stores the values you give it in a collection of sub-collections, called buckets. These are actually implemented as linked lists. There are a limited number of these: iirc, 16 to start by default, and the number increases as you put more items into the map. There should always be more buckets than values. To provide one example, using the defaults, if you add 100 entries to a HashMap, there will be 256 buckets.
Every value which can be used as a key in a map must be able to generate an integer value, called the hashcode.
The HashMap uses this hashcode to select a bucket. Ultimately, this means taking the integer value modulo the number of buckets, but before that, Java's HashMap has an internal method (called hash()), which tweaks the hashcode to reduce some known sources of clumping.
When looking up a value, the HashMap selects the bucket, and then searches for the individual element by a linear search of the linked list, using .equals().
So: you don't have to work around collisions for correctness, and you usually don't have to worry about them for performance, and if you're using native Java classes (like String), you don't have to worry about generating the hashcode values either.
In the case where you do have to write your own hashcode method (which means you've written a class with a compound value, like a first name/last name pair), things get slightly more complicated. It's quite possible to get it wrong here, but it's not rocket science. First, know this: the only thing you must do in order to assure correctness is to assure that equal objects yield equal hashcodes. So if you write a hashcode() method for your class, you must also write an equals() method, and you must examine the same values in each.
It is possible to write a hashcode() method which is bad but correct, by which I mean that it would satisfy the "equal objects must yield equal hashcodes" constraint, but still perform very poorly, by having a lot of collisions.
The canonical degenerate worst case of this would be to write a method which simply returns a constant value (e.g., 3) for all cases. This would mean that every value would be hashed into the same bucket.
It would still work, but performance would degrade to that of a linked list.
Obviously, you won't write such a terrible hashcode() method. If you're using a decent IDE, it's capable of generating one for you. Since StackOverflow loves code, here's the code for the firstname/lastname class above.
public class SimpleName {
private String firstName;
private String lastName;
public SimpleName(String firstName, String lastName) {
super();
this.firstName = firstName;
this.lastName = lastName;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ ((firstName == null) ? 0 : firstName.hashCode());
result = prime * result
+ ((lastName == null) ? 0 : lastName.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SimpleName other = (SimpleName) obj;
if (firstName == null) {
if (other.firstName != null)
return false;
} else if (!firstName.equals(other.firstName))
return false;
if (lastName == null) {
if (other.lastName != null)
return false;
} else if (!lastName.equals(other.lastName))
return false;
return true;
}
}

I direct you to the answer here. While it is not a bad idea to use strings( #CPerkins explained why, perfectly), storing the values in a hashmap with integer keys is better, since it is generally quicker (although unnoticeably) and has lower chance (actually, no chance) of collisions.
See this chart of collisions using 216553 keys in each case, (stolen from this post, reformatted for our discussion)
Hash Lowercase Random UUID Numbers
============= ============= =========== ==============
Murmur 145 ns 259 ns 92 ns
6 collis 5 collis 0 collis
FNV-1a 152 ns 504 ns 86 ns
4 collis 4 collis 0 collis
FNV-1 184 ns 730 ns 92 ns
1 collis 5 collis 0 collis*
DBJ2a 158 ns 443 ns 91 ns
5 collis 6 collis 0 collis***
DJB2 156 ns 437 ns 93 ns
7 collis 6 collis 0 collis***
SDBM 148 ns 484 ns 90 ns
4 collis 6 collis 0 collis**
CRC32 250 ns 946 ns 130 ns
2 collis 0 collis 0 collis
Avg Time per key 0.8ps 2.5ps 0.44ps
Collisions (%) 0.002% 0.002% 0%
Of course, the number of integers is limited to 2^32, where as there is no limit to the number of strings (and there is no theoretical limit to the amount of keys that can be stored in a HashMap). If you use a long (or even a float), collisions will be inevitable, and therefore no "better" than a string. However, even despite hash collisions, put() and get() will always put/get the correct key-value pair (See edit below).
In the end, it really doesn't matter, so use whatever is more convenient. But if convenience makes no difference, and you do not intend to have more than 2^32 entries, I suggest you use ints as keys.
EDIT
While the above is definitely true, NEVER use "StringKey".hashCode() to generate a key in place of the original String key for performance reasons- 2 different strings can have the same hashCode, causing overwriting on your put() method. Java's implementation of HashMap is smart enough to handle strings (any type of key, actually) with the same hashcode automatically, so it is wise to let Java handle these things for you.

I strongly suspect that the HashMap.put method does not determine whether the key is the same by just looking at String.hashCode.
There is definitely going to be a chance of a hash collision, so one would expect that the String.equals method will also be called to be sure that the Strings are truly equal, if there is indeed a case where the two Strings have the same value returned from hashCode.
Therefore, the new key String would only be judged to be the same key String as one that is already in the HashMap if and only if the value returned by hashCode is equal, and the equals method returns true.
Also to add, this thought would also be true for classes other than String, as the Object class itself already has the hashCode and equals methods.
Edit
So, to answer the question, no, it would not be a bad idea to use a String for a key to a HashMap.

This is not an issue, it's just how hashtables work. It's provably impossible to have distinct hashcodes for all distinct strings, because there are far more distinct strings than integers.
As others have written, hash collisions are resolved via the equals() method. The only problem this can cause is degeneration of the hashtable, leading to bad performance. That's why Java's HashMap has a load factor, a ratio between buckets and inserted elements which, when exceeded, will cause rehashing of the table with twice the number of buckets.
This generally works very well, but only if the hash function is good, i.e. does not result in more than the statistically expected number of collisions for your particular input set. String.hashCode() is good in this regard, but this was not always so. Allegedly, prior to Java 1.2 it only inlcuded every n'th character. This was faster, but caused predictable collisions for all String sharing each n'th character - very bad if you're unluck enough to have such regular input, or if someone want to do a DOS attack on your app.

You are talking about hash collisions. Hash collisions are an issue regardless of the type being hashCode'd. All classes that use hashCode (e.g. HashMap) handle hash collisions just fine. For example, HashMap can store multiple objects per bucket.
Don't worry about it unless you are calling hashCode yourself. Hash collisions, though rare, don't break anything.

Related

Java hashcode() collision for objects containing different but similar Strings

While verifying output data of my program, I identified cases for which hash codes of two different objects were identical. To get these codes, I used the following function:
int getHash( long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate ) {
int result = 17;
result = 31 * result + (int) (lID ^ (lID >>> 32));
long temp;
temp = Double.doubleToLongBits(dGR);
result = 31 * result + (int) (temp ^ (temp >>> 32));
temp = Double.doubleToLongBits(dSR);
result = 31 * result + (int) (temp ^ (temp >>> 32));
result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
result = 31 * result + (sCO != null ? sCO.hashCode() : 0);
result = 31 * result + (sSearchDate != null ? sSearchDate.hashCode() : 0);
return result;
}
These are two example cases:
getHash( 50122,"03/25/2015","03/26/2015",4.0,8.0,"03/24/15 06:01" )
getHash( 51114,"03/24/2015","03/25/2015",4.0,8.0,"03/24/15 06:01" )
I suppose, this issue arises, as I have three very similar strings present in my data, and the difference in the hashcode between String A to B and B to C are of the same size, leading to an identical returned hashcode.
The proposed hashcode() implementation by IntelliJ is using 31 as a multiplier for each variable that contributes to the final hashcode. I was wondering why one is not using different values for each variable (like 33, 37, 41 (which I have seen mentioned in other posts dealing with hashcodes))? In my case, this would lead to a differentiation between my two objects.
But I'm wondering whether this could then lead to issues in other cases?
Any ideas or hints on this? Thank you very much!
The hashCode() contract allows different objects to have the same hash code. From the documentation:
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
But, since you've got a bunch of parameters for your hash, you may consider using Objects.hash() instead of doing your own implementation:
#Override
int getHash(long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate) {
return Objects.hash(lID, sCI, sCO, dSR, dGR, sSearchDate);
}
For example:
Objects.hash(50122, "03/25/2015", "03/26/2015", 4.0, 8.0, "03/24/15 06:01")
Objects.hash(51114, "03/24/2015", "03/25/2015", 4.0, 8.0, "03/24/15 06:01")
Results in:
-733895022
-394580334
The code shown by your may add zero for example by
result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
When adding some zeros this may degenerate to a multiplation of
31 * 31 * 31 ...
which could destroy uniqueness.
However the hashCode method is not intended to return unique values. It simply should provide a uniform distribution of values and it should be easy to compute (or cache hashCode as the String class does).
From a more theoretical point of view a hashCode maps from a large set A into a smaller set B. Hence collisions (different elements from A map to the same value in B) are unavoidable. You could choose a set B which is bigger than A but this would violate the purpose of hashCode: performance optimization. Really, you could achieve anything with a linked list and some additional logic what you achieve with hashCode.
Prime numbers are chosen as they result in a better distribution. For example if using none primes 4*3 = 12 = 2*6 result in the same hashCode. The 31 is sometimes chosen as it is a Mersenne prime number 2^n-1 which is said to perform better on processors (I'm not sure about that).
As the hashCode method is specified not return unambiguously identify elements non-unique hashCodes are perfectly fine. Assuming uniqueness of hashCodes is a bug.
However a HashMap can be described as a set of buckets with each bucket holding a single linked list of elements. The buckets are indexed by the hashCode. Hence providing identical hashCodes leads to less buckets with longer lists. In the most extreme case (returning an arbitrary constant as hashCode) the map degenerates to a linked list.
When an object is searched in a hash data structure, the hashCode is used to get the bucket index. For each objetc in this bucket the equals method is invoked -> long lists means a large number of invocations of equals.
Conclusion: Assuming that the hashCode method is correctly used this can not cause a program to malfunction. However it may result in a severe performance penalty.
Ash the other answers explain well, it is allowed for hashCode to return same values for different objects. This is not a cryptographic hash value so it's easy to find examples of hashCode collisions.
However, I point out a problem in your code: if you have made the hashCode method yourself, you should definitely be using a better hash algorithm. Take a look at MurmurHash: http://en.wikipedia.org/wiki/MurmurHash. You want to use the 32-bit version. There are also Java implementations.
Yes, hash collisions can lead to performance issues. Therefore it's important to use a good hash algorithm. Additionally, for security MurmurHash allows a seed value to make hash collision denial of service attacks harder. You should generate that seed value you use randomly on the start of the program. Your implementation of the hashCode method is vulnerable to these hash collision DoS attacks.

Deal with collisions using another hash function?

My question is not about double hashing technique http://en.wikipedia.org/wiki/Double_hashing , which is a way to resolve collisions. It is about handling existing collisions in hash table of strings. Say, we have a collision: several strings in the same bucket, so now we must go through the bucket checking the strings. It seems it would make sense to calculate another hash function for fast string comparison (compare hash values for quick rejection). The hash key could be lazily computed and saved with the string. Did you use such technique? Could you provide a reference? If not, do you think it's not worth doing since perfomance gain is questionable? Some notes:
I put tag "Java" since I did measurements in Java: String.hashCode() in most cases outperforms String.equals() (and BTW greatly outperforms manual hash code calculation: hashCode = 31 * hashCode + strInTable.charAt(i));
Of course, the same could be asked about any string comparison, not necessarily strings in a hash table. But I am considering a specific situation with huge amount of strings which are kept in hash table.
This probably makes sense if the strings in the bucket are somewhat similar (like in Rabin-Karp algorithm). Looking for your opinion in general situation.
Many hash-based collections store the hash value of each item in the collection, on the premise that since every item's hash will have been computed when it was added to the collection, and code which is looking for an item in a hashed collection will have to know its hash, comparing hash values will be a quick and easy way of reducing the cost of false hits. For example, if one has a 16-bucket hash-table that contains four strings of 1,000 characters each, and will be searching for a lot of 1,000-character strings which match one of the table entries in all but the last few characters, more than 6% of of searches will hit on a bucket that contains a near-match string, but a much smaller fraction will hit a bucket that contains a string whose 32-bit hashCode matches that of the string being sought. Since comparisons of nearly-identical strings are expensive, comparing full 32-bit hash codes is helpful.
If one has large immutable collections which may need to be stored in hash tables and matched against other such collections, there may be some value in having such collections compute and cache longer hash functions, and having their equals methods compare the results of those longer hash functions before proceeding further. In such cases, computing a longer hash function will often be almost as fast as computing a shorter one. Further, not only will comparisons on the longer hash code greatly reduce the risks that false positives will cause needless "deep" comparisons, but computing longer hash functions and combining them into the reported hashCode() may greatly reduce the dangers of strongly-correlated hash collisions.
Comparing a hash only makes sense if the number of comparisons (lookups) is large compared to the number of entries. You would need a large hash (32 bits are not enough; you'd want at least 128 bits), and that would be expensive to calculate. You would want to amortize the cost of hashing each string over a large number of probes into the buckets.
As to whether it's worth it or not, it's highly context dependent. The only way to find out is to actually do it with your data and compare the performance of both methods.

Java HashMap detect collision

Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519
Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.
The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.
I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data).
At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better.
public class TestHashMap extends HashMap {
public TestHashMap(int size) {
super(size);
}
public TestHashMap() {
super();
}
public int countCollidingEntries() {
def fs = this.getClass().getSuperclass().getDeclaredFields();
def table;
def count =0 ;
for ( java.lang.reflect.Field field: fs ) {
if (field.getName() == "table") {
field.setAccessible(true);
table = field.get(super);
break;
}
}
for(Object e: table) {
if (e != null) {
while (e.next != null) {
count++
e = e.next;
}
}
}
return count;
}
}

understanding of hash code

hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.
What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.
Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.
If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.
That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.
It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.

Why doesn't String's hashCode() cache 0?

I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:
public class Main{
static void test(String s) {
long start = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
s.hashCode();
}
System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);
}
public static void main(String[] args) {
String z = "Allocator redistricts; strict allocator redistricts strictly.";
test(z);
test(z.toUpperCase());
}
}
Running this in ideone.com gives the following output:
Took 1470 ms.
Took 58 ms.
So my questions are:
Why doesn't String's hashCode() cache 0?
What is the probability that a Java string hashes to 0?
What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
Is this the best-practice way of caching values? (i.e. cache all except one?)
For your amusement, each line here is a string that hash to 0:
pollinating sandboxes
amusement & hemophilias
schoolworks = perversive
electrolysissweeteners.net
constitutionalunstableness.net
grinnerslaphappier.org
BLEACHINGFEMININELY.NET
WWW.BUMRACEGOERS.ORG
WWW.RACCOONPRUDENTIALS.NET
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never yodelled an apology, never preened vocalizing transsexuals.
You're worrying about nothing. Here's a way to think about this issue.
Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.
And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).
In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.
Hooray! Now your app is... let's see... 0.0015% faster!
What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!
And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.
Does this seem worth it to you?
EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!
It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)
I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.
As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.
EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.
I think there's something important that the other answers so far are missing: the zero value exists so that the hashCode-caching mechanism works robustly in a multi-threaded environment.
If you had two variables, like cachedHashCode itself and an isHashCodeCalculated boolean to indicate whether cachedHashCode had been calculated, you'd need thread synchronization for things to work in a multithreaded environment. And synchronization would be bad for performance, especially since Strings are very commonly reused in multiple threads.
My understanding of the Java memory model is a little sketchy, but here's roughly what's going on:
When multiple threads access a variable (like the cached hashCode), there's no guarantee that each thread will see the latest value. If a variable starts on zero, then A updates it (sets it to a non-zero value), then thread B reads it shortly afterwards, thread B could still see the zero value.
There's another problem with accessing shared values from multiple threads (without synchronization) - you can end up trying to use an object that's only been partly initialized (constructing an object is not an atomic process). Multi-threaded reads and writes of 64-bit primitives like longs and doubles are not necessarily atomic either, so if two threads try to read and change the value of a long or a double, one thread can end up seeing something weird and partially set. Or something like that anyway. There are similar problems if you try to use two variables together, like cachedHashCode and isHashCodeCalculated - a thread can easily come along and see the latest version of one of those variables, but an older version of another.
The usual way to get around these multi-threading issues is to use synchronization. For example, you could put all access to the cached hashCode inside a synchronized block, or you could use the volatile keyword (although be careful with that because the semantics are a little confusing).
However, synchronization slows things down. Bad idea for something like a string hashCode. Strings are very often used as keys in HashMaps, so you need the hashCode method to perform well, including in multi-threaded environments.
Java primitives that are 32-bits or less, like int, are special. Unlike, say, a long (64-bit value), you can be sure that you will never read a partially initialized value of an int (32 bits). When you read an int without synchronization, you can't be sure that you'll get the latest set value, but you can be sure that the value you do get is a value that has explicitly been set at some point by your thread or another thread.
The hashCode caching mechanism in java.lang.String is set up to rely on point 5 above. You might understand it better by looking at the source of java.lang.String.hashCode(). Basically, with multiple threads calling hashCode at once, hashCode might end up being calculated multiple times (either if the calculated value is zero or if multiple threads call hashCode at once and both see a zero cached value), but you can be sure that hashCode() will always return the same value. So it's robust, and it's performant too (because there's no synchronization to act as a bottleneck in multi-threaded environments).
Like I said, my understanding of the Java memory model is a little sketchy, but I'm pretty sure I've got the gist of the above right. Ultimately it's a very clever idiom for caching the hashCode without the overhead of synchronization.
0 isn't cached as the implementation interprets a cached value of 0 as "cached value not yet initialised". The alternative would have been to use a java.lang.Integer, whereby null implied that the value was not yet cached. However, this would have meant an additional storage overhead.
Regarding the probability of a String's hash code being computed as 0 I would say the probability is quite low and can happen in the following cases:
The String is empty (although recomputing this hash code each time is effectively O(1)).
An overflow occurs whereby the final computed hash code is 0 (e.g. Integer.MAX_VALUE + h(c1) + h(c2) + ... h(cn) == 0).
The String contains only Unicode character 0. Very unlikely as this is a control character with no meaning apart from in the "paper tape world" (!):
From Wikipedia:
Code 0 (ASCII code name NUL) is a
special case. In paper tape, it is the
case when there are no holes. It is
convenient to treat this as a fill
character without meaning otherwise.
This turns out to be a good question, related to a security vulnerability.
"When hashing a string, Java also caches the
hash value in the hash attribute, but only if the result is different from zero.
Thus, the target value zero is particularly interesting for an attacker as it prevents caching
and forces re-hashing."
Ten years after and things have changed. I honestly can't believe this (but the geek in me is ultra-happy).
As you have noted there are chances where some String::hashCode for some Strings is zero and this was not cached (will get to that). A lot of people argued (including in this Q&A) why there was no addition of a field in java.lang.String, something like : hashAlreadyComputed and simply use that. The problem is obvious : extra-space for every single String instance. There is btw a reason java-9 introduced compact Strings, for the simple fact that many benchmarks have shown that this is a rather (over)used class, in the majority of the applications. Adding more space? The decision was : no. Especially since the smallest possible addition would have been 1 byte, not 1 bit (for 32 bit JMVs, the extra space would have been 8 bytes : 1 for the flag, 7 for alignment).
So, Compact Strings came along in java-9, and if you look carefully (or care) they did add a field in java.lang.String : coder. Didn't I just argue against that? It's not that easy. It seems that the importance of compact strings out-weighted the "extra space" argument. It is also important to say that extra space matters for 32 bits VM only (because there was no gap in alignment). In contrast, in jdk-8 the layout of java.lang.String is:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 char[] String.value N/A
16 4 int String.hash N/A
20 4 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
Notice an important thing right there:
Space losses : ... 4 bytes total.
Because every java Object is aligned (to how much depends on the JVM and some start-up flags like UseCompressedOops for example), in String there is a gap of 4 bytes, un-used. So when adding coder, it simply took 1 byte without adding additional space. As such, after Compact Strings were added, the layout has changed:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 byte[] String.value N/A
16 4 int String.hash N/A
20 1 byte String.coder N/A
21 3 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 3 bytes external = 3 bytes total
coder eats 1 byte and the gap was shrank to 3 bytes. So the "damage" was already made in jdk-9. For 32 bits JVM there was an increase with 8 bytes : 1 coder + 7 gap and for 64 bit JVM - there was no increase, coder occupied some space from the gap.
And now, in jdk-13 they decided to leverage that gap, since it exists anyway. Let me just remind you that the probability to have a String with zero hashCode is 1 in 4 billion; still there are people that say : so what? let's fix this! Voilá: jdk-13 layout of java.lang.String:
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 byte[] String.value N/A
16 4 int String.hash N/A
20 1 byte String.coder N/A
21 1 boolean String.hashIsZero N/A
22 2 (loss due to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 2 bytes external = 2 bytes total
And here it is : boolean String.hashIsZero. And here it is in the code-base:
public int hashCode() {
int h = hash;
if (h == 0 && !hashIsZero) {
h = isLatin1() ? StringLatin1.hashCode(value)
: StringUTF16.hashCode(value);
if (h == 0) {
hashIsZero = true;
} else {
hash = h;
}
}
return h;
}
Wait! h == 0 and hashIsZero field? Shouldn't that be named something like : hashAlreadyComputed? Why isn't the implementation something along the lines of :
#Override
public int hashCode(){
if(!hashCodeComputed){
// or any other sane computation
hash = 42;
hashCodeComputed = true;
}
return hash;
}
Even if I read the comment under the source code:
// The hash or hashIsZero fields are subject to a benign data race,
// making it crucial to ensure that any observable result of the
// calculation in this method stays correct under any possible read of
// these fields. Necessary restrictions to allow this to be correct
// without explicit memory fences or similar concurrency primitives is
// that we can ever only write to one of these two fields for a given
// String instance, and that the computation is idempotent and derived
// from immutable state
It only made sense after I read this. Rather tricky, but this does one write at a time, lots more details in the discussion above.
Why doesn't String's hashCode() cache 0?
The value zero is reserved as meaning "the hash code is not cached".
What is the probability that a Java string hashes to 0?
According to the Javadoc, the formula for a String's hashcode is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string and n is the length of the string. (The hash of the empty String is defined to be zero as a special case.)
My intuition is that the hashcode function as above gives a uniform spread of String hash values across the range of int values. A uniform spread that would mean that the probability of a randomly generated String hashing to zero was 1 in 2^32.
What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
The best strategy is to ignore the issue. If you are repeatedly hashing the same String value, there is something rather strange about your algorithm.
Is this the best-practice way of caching values? (i.e. cache all except one?)
This is a space versus time trade-off. AFAIK, the alternatives are:
Add a cached flag to each String object, making every Java String take an extra word.
Use the top bit of the hash member as the cached flag. That way you can cache all hash values, but you only have half as many possible String hash values.
Don't cache hashcodes on Strings at all.
I think that the Java designers have made the right call for Strings, and I'm sure that they have done extensive profiling that confirms the soundness of their decision. However, it does not follow that this would always be the best way to deal with caching.
(Note that there are two "common" String values which hash to zero; the empty String, and the String consisting of just a NUL character. However, the cost of calculating the hashcodes for these values is small compared with the cost of calculating the hashcode for a typical String value.)
Well folks, it keeps 0 because if it is zero length, it will end up as zero anyways.
And it doesn't take long to figure out that the len is zero and so must the hashcode be.
So, for your code-reviewz! Here it is in all it's Java 8 glory:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
As you can see, this will always return a quick zero if the string is empty:
if (h == 0 && value.length > 0) ...
The "avoid 0" suggestion seems appropriate to recommend as best practice as it helps a genuine problem (seriously unexpected performance degradation in constructible cases that can be attacker supplied) for the meager cost of a branch operation prior to a write. There is some remaining 'unexpected performance degradation' that can be exercised if the only things going into a set hash to the special adjusted value. But this is at worst a 2x degradation rather than unbounded.
Of course, String's implementation can't be changed but there is no need to perpetuate the problem.

Categories

Resources