Java hashcode() collision for objects containing different but similar Strings - java

While verifying output data of my program, I identified cases for which hash codes of two different objects were identical. To get these codes, I used the following function:
int getHash( long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate ) {
int result = 17;
result = 31 * result + (int) (lID ^ (lID >>> 32));
long temp;
temp = Double.doubleToLongBits(dGR);
result = 31 * result + (int) (temp ^ (temp >>> 32));
temp = Double.doubleToLongBits(dSR);
result = 31 * result + (int) (temp ^ (temp >>> 32));
result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
result = 31 * result + (sCO != null ? sCO.hashCode() : 0);
result = 31 * result + (sSearchDate != null ? sSearchDate.hashCode() : 0);
return result;
}
These are two example cases:
getHash( 50122,"03/25/2015","03/26/2015",4.0,8.0,"03/24/15 06:01" )
getHash( 51114,"03/24/2015","03/25/2015",4.0,8.0,"03/24/15 06:01" )
I suppose, this issue arises, as I have three very similar strings present in my data, and the difference in the hashcode between String A to B and B to C are of the same size, leading to an identical returned hashcode.
The proposed hashcode() implementation by IntelliJ is using 31 as a multiplier for each variable that contributes to the final hashcode. I was wondering why one is not using different values for each variable (like 33, 37, 41 (which I have seen mentioned in other posts dealing with hashcodes))? In my case, this would lead to a differentiation between my two objects.
But I'm wondering whether this could then lead to issues in other cases?
Any ideas or hints on this? Thank you very much!

The hashCode() contract allows different objects to have the same hash code. From the documentation:
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
But, since you've got a bunch of parameters for your hash, you may consider using Objects.hash() instead of doing your own implementation:
#Override
int getHash(long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate) {
return Objects.hash(lID, sCI, sCO, dSR, dGR, sSearchDate);
}
For example:
Objects.hash(50122, "03/25/2015", "03/26/2015", 4.0, 8.0, "03/24/15 06:01")
Objects.hash(51114, "03/24/2015", "03/25/2015", 4.0, 8.0, "03/24/15 06:01")
Results in:
-733895022
-394580334

The code shown by your may add zero for example by
result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
When adding some zeros this may degenerate to a multiplation of
31 * 31 * 31 ...
which could destroy uniqueness.
However the hashCode method is not intended to return unique values. It simply should provide a uniform distribution of values and it should be easy to compute (or cache hashCode as the String class does).
From a more theoretical point of view a hashCode maps from a large set A into a smaller set B. Hence collisions (different elements from A map to the same value in B) are unavoidable. You could choose a set B which is bigger than A but this would violate the purpose of hashCode: performance optimization. Really, you could achieve anything with a linked list and some additional logic what you achieve with hashCode.
Prime numbers are chosen as they result in a better distribution. For example if using none primes 4*3 = 12 = 2*6 result in the same hashCode. The 31 is sometimes chosen as it is a Mersenne prime number 2^n-1 which is said to perform better on processors (I'm not sure about that).
As the hashCode method is specified not return unambiguously identify elements non-unique hashCodes are perfectly fine. Assuming uniqueness of hashCodes is a bug.
However a HashMap can be described as a set of buckets with each bucket holding a single linked list of elements. The buckets are indexed by the hashCode. Hence providing identical hashCodes leads to less buckets with longer lists. In the most extreme case (returning an arbitrary constant as hashCode) the map degenerates to a linked list.
When an object is searched in a hash data structure, the hashCode is used to get the bucket index. For each objetc in this bucket the equals method is invoked -> long lists means a large number of invocations of equals.
Conclusion: Assuming that the hashCode method is correctly used this can not cause a program to malfunction. However it may result in a severe performance penalty.

Ash the other answers explain well, it is allowed for hashCode to return same values for different objects. This is not a cryptographic hash value so it's easy to find examples of hashCode collisions.
However, I point out a problem in your code: if you have made the hashCode method yourself, you should definitely be using a better hash algorithm. Take a look at MurmurHash: http://en.wikipedia.org/wiki/MurmurHash. You want to use the 32-bit version. There are also Java implementations.
Yes, hash collisions can lead to performance issues. Therefore it's important to use a good hash algorithm. Additionally, for security MurmurHash allows a seed value to make hash collision denial of service attacks harder. You should generate that seed value you use randomly on the start of the program. Your implementation of the hashCode method is vulnerable to these hash collision DoS attacks.

Related

use case for Double as HashMap Keys [duplicate]

I was thinking of using a Double as the key to a HashMap but I know floating point comparisons are unsafe, that got me thinking. Is the equals method on the Double class also unsafe? If it is then that would mean the hashCode method is probably also incorrect. This would mean that using Double as the key to a HashMap would lead to unpredictable behavior.
Can anyone confirm any of my speculation here?
Short answer: Don't do it
Long answer: Here is how the key is going to be computed:
The actual key will be a java.lang.Double object, since keys must be objects. Here is its hashCode() method:
public int hashCode() {
long bits = doubleToLongBits(value);
return (int)(bits ^ (bits >>> 32));
}
The doubleToLongBits() method basically takes the 8 bytes and represent them as long. So it means that small changes in the computation of double can mean a great deal and you will have key misses.
If you can settle for a given number of points after the dot - multiply by 10^(number of digits after the dot) and convert to int (for example - for 2 digits multiply by 100).
It will be much safer.
I think you are right. Although the hash of the doubles are ints, the double could mess up the hash. That is why, as Josh Bloch mentions in Effective Java, when you use a double as an input to a hash function, you should use doubleToLongBits(). Similarly, use floatToIntBits for floats.
In particular, to use a double as your hash, following Josh Bloch's recipe, you would do:
public int hashCode() {
int result = 17;
long temp = Double.doubleToLongBits(the_double_field);
result = 37 * result + ((int) (temp ^ (temp >>> 32)));
return result;
}
This is from Item 8 of Effective Java, "Always override hashCode when you override equals". It can be found in this pdf of the chapter from the book.
Hope this helps.
It depends on how you would be using it.
If you're happy with only being able to find the value based on the exact same bit pattern (or potentially an equivalent one, such as +/- 0 and various NaNs) then it might be okay.
In particular, all NaNs would end up being considered equal, but +0 and -0 would be considered different. From the docs for Double.equals:
Note that in most cases, for two
instances of class Double, d1 and d2,
the value of d1.equals(d2) is true if
and only if
d1.doubleValue() ==
d2.doubleValue() also has the value
true. However, there are two
exceptions:
If d1 and d2 both represent
Double.NaN, then the equals method
returns true, even though
Double.NaN==Double.NaN has the value
false.
If d1 represents +0.0 while d2
represents -0.0, or vice versa, the
equal test has the value false, even
though +0.0==-0.0 has the value true.
This definition allows hash tables to
operate properly.
Most likely you're interested in "numbers very close to the key" though, which makes it a lot less viable. In particular if you're going to do one set of calculations to get the key once, then a different set of calculations to get the key the second time, you'll have problems.
The problem is not the hash code but the precision of the doubles. This will cause some strange results. Example:
double x = 371.4;
double y = 61.9;
double key = x + y; // expected 433.3
Map<Double, String> map = new HashMap<Double, String>();
map.put(key, "Sum of " + x + " and " + y);
System.out.println(map.get(433.3)); // prints null
The calculated value (key) is "433.29999999999995" which is not EQUALS to 433.3 and so you don't find the entry in the Map (the hash code probably is also different, but that is not the main problem).
If you use
map.get(key)
it should find the entry...
[]]
Short answer: It probably won't work.
Honest answer: It all depends.
Longer answer: The hash code isn't the issue, it's the nature of equal comparisons on floating point. As Nalandial and the commenters on his post point out, ultimately any match against a hash table still ends up using equals to pick the right value.
So the question is, are your doubles generated in such a way that you know that equals really means equals? If you read or compute a value, store it in the hash table, and then later read or compute the value using exactly the same computation, then Double.equals will work. But otherwise it's unreliable: 1.2 + 2.3 does not necessarily equal 3.5, it might equal 3.4999995 or whatever. (Not a real example, I just made that up, but that's the sort of thing that happens.) You can compare floats and doubles reasonably reliably for less or greater, but not for equals.
Maybe BigDecimal get you where you want to go?
The hash of the double is used, not the double itself.
Edit: Thanks, Jon, I actually didn't know that.
I'm not sure about this (you should just look at the source code of the Double object) but I would think any issues with floating point comparisons would be taken care of for you.
It depends on how you store and access you map, yes similar values could end up being slightly different and therefore not hash to the same value.
private static final double key1 = 1.1+1.3-1.6;
private static final double key2 = 123321;
...
map.get(key1);
would be all good, however
map.put(1.1+2.3, value);
...
map.get(5.0 - 1.6);
would be dangerous

Can two keys having different hashCode be a part of same bucket in HashMap in Java?

I have a HashMap. There are 16 buckets in it (by default). Now is it possible that two keys having different hashCodes be part of the same bucket? Or is it always a new bucket is created for a different hashCode and in this way the HashMap expands the bucket size?
Read many posts, but only confused myself.
Yes, it is possible. Since the number of buckets is much smaller than the number of possible hashCodes (the number of buckets is proportional to the number of entries in the HashMap while the number of possible hashCodes is the number of possible int values, which is much larger), the final mapping of a hashCode to a bucket is done by some modulus operator, so multiple hashCodes may be mapped to the same bucket (if, for example, you have 16 buckets, both the hashCodes 1 and 17 will be mapped to the same bucket (note that by hashCode I don't mean the value returned by the hashCode method, since HashMap applies an additional function on that hashCode in order to improve the distribution of the hash codes)).
That's why hashCode alone is not enough to determine if the key we are looking for is present in the map - we have to use equals as well.
Taken from How HashMap works in Java:
Since the internal array of HashMap is of fixed size, and if you keep storing objects, at some point of time hash function will return same bucket location for two different keys, this is called collision in HashMap. In this case, a linked list is formed at that bucket location and a new entry is stored as next node.
And then when there if we want to get that object from the list we need equals():
If we try to retrieve an object from this linked list, we need an extra check to search correct value, this is done by equals() method. Since each node contains an entry, HashMap keeps comparing entry's key object with the passed key using equals() and when it return true, Map returns the corresponding value.
hashcode() returns interger in java so you have to map integer range to bucket size. If you are mapping from bigger set to a smaller set so you will always have collisions.
If you look at HashMap source code you will find following method to map int to bucket length.
static int indexFor(int h, int length) {
return h & (length-1);
}
The hash code is preprocessed to produce uniform distribution using:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits. Note: Null keys always map to hash 0, thus index 0.
HashMap source

What hash function is better?

I write my implementation of HashMap in Java. I use open addressing for collision resolution. For better key distribution I want use a nice hash function for int hashcode of key. I dont know what hash function is better for it?
public int getIndex(K key) { return hash(key.hashCode()) % capacity; }
I need a hash function for hashcode of key.
Any hash that distributes the values you're expecting to receive evenly is a good hash function.
Your goal is to maximize performance (well, maximize performance while maintaining correctness). The primary concern there is to minimize bucket collisions. This means that the ideal hash is tailored to your input data - if you know what you'll receive, you can choose the hash the produces a minimal number of collisions and maybe even a cache-optimal access pattern.
However, that's not usually a realistic option, so you just choose a hash whose output is unbiased and unpredictable (one that behaves like a pseudorandom number generator, but deterministic). Some such functions are the "murmur" hash family.
The main problem with using % capacity is that it can return negative and positive values.
HashMap avoids this issue by using a power of 2 and uses the following approach
public int getIndex(K key) { return hash(key.hashCode()) & (capacity-1); }
If the capacity is not a power of 2, you can ignore the high bit (which is often no so random)
public int getIndex(K key) { return (hash(key.hashCode()) & 0x7FFFFFFF) % capacity; }
The hash function actually used can matter. HashMap uses the following
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
I would use this, unless you have a good reason not to. E.g. for security reasons, if you have a service which could the subject of a denial of service attack, you will want to use a different hash to avoid a malicious user turning your HashMap into a LinkedList. Unfortunately you still have to use a different hashCode() as well as you can create a long list of Strings with the underlying hash code so mutating it later is too later.
Here is a list of strings with all have a hashCode() of 0, there is nothing a hash() function can do about that.
Why doesn't String's hashCode() cache 0?

Java HashMap detect collision

Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519
Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.
The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.
I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data).
At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better.
public class TestHashMap extends HashMap {
public TestHashMap(int size) {
super(size);
}
public TestHashMap() {
super();
}
public int countCollidingEntries() {
def fs = this.getClass().getSuperclass().getDeclaredFields();
def table;
def count =0 ;
for ( java.lang.reflect.Field field: fs ) {
if (field.getName() == "table") {
field.setAccessible(true);
table = field.get(super);
break;
}
}
for(Object e: table) {
if (e != null) {
while (e.next != null) {
count++
e = e.next;
}
}
}
return count;
}
}

Bad idea to use String key in HashMap?

I understand that the String class' hashCode() method is not guarantied to generate unique hash codes for distinct String-s. I see a lot of usage of putting String keys into HashMap-s (using the default String hashCode() method). A lot of this usage could result in significant application issues if a map put displaced a HashMap entry that was previously put onto the map with a truely distinct String key.
What are the odds that you will run into the scenario where String.hashCode() returns the same value for distinct String-s? How do developers work around this issue when the key is a String?
Developers do not have to work around the issue of hash collisions in HashMap in order to achieve program correctness.
There are a couple of key things to understand here:
Collisions are an inherent feature of hashing, and they have to be. The number of possible values (Strings in your case, but it applies to other types as well) is vastly bigger than the range of integers.
Every usage of hashing has a way to handle collisions, and the Java Collections (including HashMap) is no exception.
Hashing is not involved in equality testing. It is true that equal objects must have equal hashcodes, but the reverse is not true: many values will have the same hashcode. So don't try using a hashcode comparison as a substitute for equality. Collections don't. They use hashing to select a sub-collection (called a bucket in the Java Collections world), but they use .equals() to actually check for equality.
Not only do you not have to worry about collisions causing incorrect results in a collection, but for most applications, you also *usually* don't have to worry about performance - Java hashed Collections do a pretty good job of managing hashcodes.
Better yet, for the case you asked about (Strings as keys), you don't even have to worry about the hashcodes themselves, because Java's String class generates a pretty good hashcode. So do most of the supplied Java classes.
Some more detail, if you want it:
The way hashing works (in particular, in the case of hashed collections like Java's HashMap, which is what you asked about) is this:
The HashMap stores the values you give it in a collection of sub-collections, called buckets. These are actually implemented as linked lists. There are a limited number of these: iirc, 16 to start by default, and the number increases as you put more items into the map. There should always be more buckets than values. To provide one example, using the defaults, if you add 100 entries to a HashMap, there will be 256 buckets.
Every value which can be used as a key in a map must be able to generate an integer value, called the hashcode.
The HashMap uses this hashcode to select a bucket. Ultimately, this means taking the integer value modulo the number of buckets, but before that, Java's HashMap has an internal method (called hash()), which tweaks the hashcode to reduce some known sources of clumping.
When looking up a value, the HashMap selects the bucket, and then searches for the individual element by a linear search of the linked list, using .equals().
So: you don't have to work around collisions for correctness, and you usually don't have to worry about them for performance, and if you're using native Java classes (like String), you don't have to worry about generating the hashcode values either.
In the case where you do have to write your own hashcode method (which means you've written a class with a compound value, like a first name/last name pair), things get slightly more complicated. It's quite possible to get it wrong here, but it's not rocket science. First, know this: the only thing you must do in order to assure correctness is to assure that equal objects yield equal hashcodes. So if you write a hashcode() method for your class, you must also write an equals() method, and you must examine the same values in each.
It is possible to write a hashcode() method which is bad but correct, by which I mean that it would satisfy the "equal objects must yield equal hashcodes" constraint, but still perform very poorly, by having a lot of collisions.
The canonical degenerate worst case of this would be to write a method which simply returns a constant value (e.g., 3) for all cases. This would mean that every value would be hashed into the same bucket.
It would still work, but performance would degrade to that of a linked list.
Obviously, you won't write such a terrible hashcode() method. If you're using a decent IDE, it's capable of generating one for you. Since StackOverflow loves code, here's the code for the firstname/lastname class above.
public class SimpleName {
private String firstName;
private String lastName;
public SimpleName(String firstName, String lastName) {
super();
this.firstName = firstName;
this.lastName = lastName;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ ((firstName == null) ? 0 : firstName.hashCode());
result = prime * result
+ ((lastName == null) ? 0 : lastName.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
SimpleName other = (SimpleName) obj;
if (firstName == null) {
if (other.firstName != null)
return false;
} else if (!firstName.equals(other.firstName))
return false;
if (lastName == null) {
if (other.lastName != null)
return false;
} else if (!lastName.equals(other.lastName))
return false;
return true;
}
}
I direct you to the answer here. While it is not a bad idea to use strings( #CPerkins explained why, perfectly), storing the values in a hashmap with integer keys is better, since it is generally quicker (although unnoticeably) and has lower chance (actually, no chance) of collisions.
See this chart of collisions using 216553 keys in each case, (stolen from this post, reformatted for our discussion)
Hash Lowercase Random UUID Numbers
============= ============= =========== ==============
Murmur 145 ns 259 ns 92 ns
6 collis 5 collis 0 collis
FNV-1a 152 ns 504 ns 86 ns
4 collis 4 collis 0 collis
FNV-1 184 ns 730 ns 92 ns
1 collis 5 collis 0 collis*
DBJ2a 158 ns 443 ns 91 ns
5 collis 6 collis 0 collis***
DJB2 156 ns 437 ns 93 ns
7 collis 6 collis 0 collis***
SDBM 148 ns 484 ns 90 ns
4 collis 6 collis 0 collis**
CRC32 250 ns 946 ns 130 ns
2 collis 0 collis 0 collis
Avg Time per key 0.8ps 2.5ps 0.44ps
Collisions (%) 0.002% 0.002% 0%
Of course, the number of integers is limited to 2^32, where as there is no limit to the number of strings (and there is no theoretical limit to the amount of keys that can be stored in a HashMap). If you use a long (or even a float), collisions will be inevitable, and therefore no "better" than a string. However, even despite hash collisions, put() and get() will always put/get the correct key-value pair (See edit below).
In the end, it really doesn't matter, so use whatever is more convenient. But if convenience makes no difference, and you do not intend to have more than 2^32 entries, I suggest you use ints as keys.
EDIT
While the above is definitely true, NEVER use "StringKey".hashCode() to generate a key in place of the original String key for performance reasons- 2 different strings can have the same hashCode, causing overwriting on your put() method. Java's implementation of HashMap is smart enough to handle strings (any type of key, actually) with the same hashcode automatically, so it is wise to let Java handle these things for you.
I strongly suspect that the HashMap.put method does not determine whether the key is the same by just looking at String.hashCode.
There is definitely going to be a chance of a hash collision, so one would expect that the String.equals method will also be called to be sure that the Strings are truly equal, if there is indeed a case where the two Strings have the same value returned from hashCode.
Therefore, the new key String would only be judged to be the same key String as one that is already in the HashMap if and only if the value returned by hashCode is equal, and the equals method returns true.
Also to add, this thought would also be true for classes other than String, as the Object class itself already has the hashCode and equals methods.
Edit
So, to answer the question, no, it would not be a bad idea to use a String for a key to a HashMap.
This is not an issue, it's just how hashtables work. It's provably impossible to have distinct hashcodes for all distinct strings, because there are far more distinct strings than integers.
As others have written, hash collisions are resolved via the equals() method. The only problem this can cause is degeneration of the hashtable, leading to bad performance. That's why Java's HashMap has a load factor, a ratio between buckets and inserted elements which, when exceeded, will cause rehashing of the table with twice the number of buckets.
This generally works very well, but only if the hash function is good, i.e. does not result in more than the statistically expected number of collisions for your particular input set. String.hashCode() is good in this regard, but this was not always so. Allegedly, prior to Java 1.2 it only inlcuded every n'th character. This was faster, but caused predictable collisions for all String sharing each n'th character - very bad if you're unluck enough to have such regular input, or if someone want to do a DOS attack on your app.
You are talking about hash collisions. Hash collisions are an issue regardless of the type being hashCode'd. All classes that use hashCode (e.g. HashMap) handle hash collisions just fine. For example, HashMap can store multiple objects per bucket.
Don't worry about it unless you are calling hashCode yourself. Hash collisions, though rare, don't break anything.

Categories

Resources