What Exactly is Hash Collision

What Exactly is Hash Collision - java

Hash Collision or Hashing Collision in HashMap is not a new topic and I've come across several blogs and discussion boards explaining how to produce Hash Collision or how to avoid it in an ambiguous and detailed way. I recently came across this question in an interview. I had lot of things to explain but I think it was really hard to precisely give the right explanation. Sorry if my questions are repeated here, please route me to the precise answer:
What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?
What exactly causes Hash Collision - the bad definition of custom class' hashCode() method, OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone, OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?
Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?
Does Java generate or at least try to generate unique hashCode per class during object initiation? If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?
I'll be greateful if you could please share you answers for one or all of these questions.

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?
It's a feature. It arises out of the nature of a hashCode: a mapping from a large value space to a much smaller value space. There are going to be collisions, by design and intent.
What exactly causes Hash Collision - the bad definition of custom class' hashCode() method,
A bad design can make it worse, but it is endemic in the notion.
OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone,
No.
OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?
This doesn't really make sense. Hashes are bound to collide sooner or later, and poor algorithms can make it sooner. That's about it.
Does anything go wrong or unexpected when Hash Collision happens?
Not if the hash table is competently written. A hash collision only means that the hashCode is not unique, which puts you into calling equals(), and the more duplicates there are the worse the performance.
I mean is there any reason why we should avoid Hash Collision?
You have to trade off ease of computation against spread of values. There is no single black and white answer.
Does Java generate or atleast try to generate unique hasCode per class during object initiation?
No. 'Unique hash code' is a contradiction in terms.
If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?
The question is meaningless. If you're using String you don't have any choice about the hashing algorithm, and you are also using a class whose hashCode has been slaved over by experts for twenty or more years.

Actually I think the hash collision is Normal. Let talk about a case to think. We have 1000000 big numbers(the set S of x), say x is in 2^64. And now we want to do a map for this number set. lets map this number set S to [0,1000000] .
But how? use hash!!
Define a hash function f(x) = x mod 1000000. And now the x in S will be converted into [0,1000000), OK, But you will find that many numbers in S will convert into one number. for example. the number k * 1000000 + y will all be located in y which because (k * 1000000 + y ) % x = y. So this is a hash collision.
And how to deal with collision? In this case we talked above, it is very difficult to delimiter the collision because the math computing has some posibillity. We can find a more complex, more good hash function, but can not definitely say we eliminate the collision. We should do our effort to find a more good hash function to decrease the hash collision. Because the hash collision increase the time cost we use hash to find something.
Simplely there are two ways to deal with hash collision. the linked list is a more direct way, for example: if two numbers above get same value after the hash_function, we create a linkedlist from this value bucket, and all the same value is put the value's linkedlist. And another way is that just find a new position for the later number. for example, if number 1000005 has took the position in 5 and when 2000005 get value 5, it can not be located at position 5, it then go ahead and find a empty position to took.
For the last question : Does Java generate or at least try to generate unique hashCode per class during object initiation?
the hashcode of Object is typically implemented by converting the internal address of the object into an integer. So you can think different objects has different hashcode, if you use the Object's hashcode().

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?
a hash collision is exactly that, a collision of that field hashcode on objects...
What exactly causes Hash Collision - the bad definition of custom class' hashCode() method, OR to leave the equals() method
un-overridden while imperfectly overriding the hashCode() method
alone, OR is it not up to the developers and many popular java
libraries also has classes which can cause Hash Collision?
no, collision may happen because they are ruled by math probability and in such cases the birthday paradox is the best way to explain that.
Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?
no, String class in java is very well developed class, and you dont need to search too much to find a collision (check the hascode of this Strings "Aa" and "BB" -> both have a collision to 2112)
to summarize:
hashcode collision is harmless is you know what is that for and why is not the same as an id used to prove equality

What exactly is Hash Collision - is it a feature, or common phenomenon
which is mistakenly done but good to avoid?
Neither... both... it is a common phenomenon, but not mistakenly done, that is good to avoid.
What exactly causes Hash Collision - the bad definition of custom
class' hashCode() method, OR to leave the equals() method
un-overridden while imperfectly overriding the hashCode() method
alone, OR is it not up to the developers and many popular java
libraries also has classes which can cause Hash Collision?
by poorly designing your hashCode() method, you can produce too many collisions, leaving you equals method un-overridden should not directly affect the number of collisions, many popular java libraries have classes that can cause collisions (nearly all classes actually).
Does anything go wrong or unexpected when Hash Collision happens? I
mean is there any reason why we should avoid Hash Collision?
There is degradation in performance, that is a reason to avoid them, but the program should continue to work.
Does Java generate or at least try to generate unique hashCode per
class during object initiation? If no, is it right to rely on Java
alone to ensure that my program would not run into Hash Collision for
JRE classes? If not right, then how to avoid hash collision for
hashmaps with final classes like String as key?
Java doesn't try to generate a unique hash code during object initialization, but it has a default implementation of hashCode() and equals(). The default implementation works to know whether two object references point to the same instance or not, and doesn't rely on the content (field values) of the objects. Therefore, the String class has its own implementation.

Related

Is it necessary to override hashcode method in non hashed datastructure

What I understand from hashcode method for objects in Java: it is required to calculate the hashcode of objects which in turn is used to calculate the index / bucket location of the object in a hashed datastructure like hashMap.
So will it be correct to say that for a class that is not to be used along with a hashed datastructure doesnt need to have hashCode() method implemented in it? in other words is overriding the equals() method enough for non hashed datastructures?
Also please correct me if my assumption is wrong .

In theory, you are correct: if you know that your object will never be used in any way that requires the hash code, then not implementing hashCode will not cause anything bad to happen.
In practice there are reasons not to rely on this fact:
code changes and an objects that were initially planned to only ever exist in non-hashed structures get put into sets or used as the key to maps, because requirements change. If you don't implement hashCode then, things can go bad.
if you implement equals and you don't implement hashCode then you are almost certainly breaking the contract of hashCode that requires it to be consistent to equals. If your code breaks a contract that other code depends on, then that other code can silently fail in unexpected/weird ways.
in order to avoid mistakes it's usually best to let your IDE generate equals anyway and if you do that generating the appropriate hashCode at the same time is no extra effort.
Note that all of this assumes you even want to implement equals: If you don't care about equality of an object entirely (which is actually very common, not every type needs a specific equality definition), then you can just leave equals and hashCode out of your code and it's all conforming ('though your type might not match the "intuitive" equality definition of your type then).

Efficient de-deduplication (memoization) of objects in Java, shortcomings of HashSet

So I am working on some data-structure, whose operations tend to generate lots of different instances of a certain type. The datasets can be large enough to potentially yield millions of such objects. It is crucial that I "memoize" them, since there are recurring patterns in these calculations.
Typically the way memoization is done is that you simply have a set (say, a HashMap with its keys as its values) of all instances ever created. Any time an operation would return a result, instead the set is searched for an existing, identical object. If the object is found, it is returned instead, and the result we looked up with instantly becomes garbage.
Now, this is straight-forward enough. However, HashMap is not a perfect fit for this use case. There are two points I will address here:
Lack of support for custom equality and hashing function
Lack of support for "equivalent key" lookup.
The first has already been discussed (question 5453226), though not sufficiently for my purposes; solutions in the form of "just wrap your key types with another object" are a non-starter. If the key is a relatively small object (e.g. an array or a string of small size) the overhead cost of these wrappers can be nearly 2X.
To illustrate the second point, let's say the data type I'd like to memoize is a (typically small) int[]. Suppose I have made an operation and it has yielded a result, but it is not exactly in an int[], rather, it consists of a int[] x and separately a length int length. What I would like to do now is to look up the set for an array that equals (as a sequence of ints) to the data in x[0..length-1]. In principle, there should be no problem to accomplish this, provided that the user supplies equality and hashing predicates that match the ones used for the key type itself. In principle, the "compatible key" (x and length here) doesn't even have to be materialized as an object.
The lack of "compatible key" lookups may cause to program to create temporary key objects that are likely to be thrown away afterwards.
Problem 1 prevents me from using something like int[] as a key type in a Map, since it doesn't have the hashing and equality functions that I want. Further, I actually want to use shallow / reference-only comparisons and hashing on these objects once I'm past the memoization step, since then x is an identical object to y iff x == y. Here we have the same key type, but different desired equality/hashing behavior.
In C++, the functionality in (1) is offered out of the box by most hash-table based containers. (2) is given by some container types, for example the associative container templates in the Boost library.
A third function I'd be interested in (less important) is the insert_check/insert_commit idea, where we first check if a matching key exist, and we also get some implementation-defined marker back (e.g. bucket index). Then if we do create a new key and want to insert it, we give back the marker and it's inserted to the right place in the data structure. There is no reason why the hash of the same key should be computed twice.
If the compiler (or JIT) is clever enough to avoid double lookup in this scenario, that is a preferable solution - it's easier not to have to worry about it. I just never actually tested if it is.
Are there any libraries known to offer any of these capabilities in Java? Do I have to roll my own? Am I missing something?

hashcode collisions for different classes

I have two simple wrapper classes around an Integer field, where I had to override equals() and hashCode(). In the end, they both use the same algorithm for hashCode(), so if the Integer field is the same, the hash codes collide.
Since the Objects are different types does this matter, or should I only care if I expect to mix them as keys in the same HashMap?

hashCode() being equal for two objects says "there's a chance these objects are equal, take a closer look by calling equals()". As long as the equals() methods for those classes are correct, the hash codes being the same is not a problem.
The general rule for hashCode() is that if two objects are equal, their hash codes should also be equal. Note that the rule is not "if two objects have the same hash code, then they should be equal."

If you are likely to have a hash map with objects of both types with the same values, then that is obviously going to be a potential performance problem. HashMap and the like don't look at the actual runtime class - and indeed there isn't a standard way to tell whether two objects of different classes can be equal (for instance, Lists with the same values in the same order generated by ArrayList and Arrays.asList should compare equal). For HashMap, I'm guessing the hit wont be too bad, but could be worse for, say, a probing implementation where there is a significant gain for getting a hit on first inspection.

How can I hash composite classes?

Let Abstract be an abstract class, and A1,A2,...,An concrete classes that inherit from Abstact. Each one of Ai has a list of Abstract and a pre-defined, known at compile time, set of primitive types, let's assume we have a hush function for them, and there are no 'loops' in the structure of each concrete element.
Two elements e1 and e2 are identical if they have the same values for the predefined primitives, and if for each Abstract in e1, there exists an Abstract in e2 such that e1 and e2 are identical. (in other words, order is not important).
I am looking for a good hash heuristic for this kind of problem. It shouldn't (and as far as I know, can't be) a perfect hash function, but it should be good and easy to compute at run time.
I'll be glad if someone can give me some guidelines how to implement such a function, or direct me to an article that addresses this problem.
PS I am writing in Java, and I assume (correct me if I am wrong) the built in hash() won't be good enough for this problem.
EDIT :
the lists and primitives are fixed after construction, but are unknown at compile time.

If these lists can change after they are constructed, it would be a bad idea to base the hash function on them. Imagine if you stuck your object into a HashMap, and then changed part of it. You would no longer be able to locate it in the HashMap because its hashCode would be different.
You should only base the result of hashCode on immutable values. If you don't have any immutable values in your object, your best bet would probably be to simply use the basic Object.hashCode(), although you'll lose out on equality testing.
If these objects are immutable, however, then I recommend choosing some kind of sort order for your elements. Then you can compute a hash code across your lists, knowing that it will be the same even if the lists are in different orders, because you are sorting the values before hashing.

Use Google Guava's utilities... Objects.hashCode() is great. Also, the source is available, and they have solved the problem you state, so you can take a look at their solution.

Writing hashCode methods for heterogeneous keys

I have a Java HashMap whose keys are instances of java.lang.Object, that is: the keys are of different types. The hashCode values of two key objects of different types are likely to be the same when they contain identical variable values.
In order to improve the performance of the get method for my HashMap, I'm inclined to mix the name of the Java type into the hashCode methods of my key objects. I have not seen examples of this elsewhere, and so my this-might-be-wacky alarm went off. Do you think mixing the type into hashCode is a good idea? Should I mix in the class name, or the hashCode of the relevant Class object?

I wouldn't mix the type name in - but if you're controlling the hashCode algorithm already, why not just change it so that they won't clash? For example, if you're using the common "add and multiply" approach, you could start off with different base cases or use different multipliers.
Before you worry about this too much though, have you actually measured how often you're really getting collisions with real data? Is this definitely a problem, or are you just concerned that it might be a problem?

I think your this-might-be-wacky alarm should have gone off when you decided to have keys of different types. But let's assume this is a case where Object is really the way to go.
You should try it without mixing in the type name and stress test the performance if you find that this particular lookup is determined to be a hotspot in the system. Chances are the performance doesn't matter that much.
Like Jon implied, the performance of the hash map is improved by reducing collisions. Mixing in the type name is just as likely to increase collisions as it is to reduce them. To keep your hashmap in peak condition, you want the likelihood of any particular hashcode to be about that same as any other over the domain of valid key values. So the probability of a hashcode of 10 should be about the same as the probability of 100 or any other number. That way the hash table buckets fill evenly (in all likelihood). so whether you have an object of type A or type B should not matter. just the probability distribution of the hashcodes of all occurring key values.

Years later...
Apart from it being a premature optimization, it's not a bad idea and the overhead is tiny. Choy's recommendation to profile first is surely good in general, but sometimes a simple optimization takes much less time than the profiling. This seems to be such a case.
I'd use a different multiplier as already suggested and mix in getClass().getHashCode().
Or maybe getClass().getName().getHashCode() as it stays consistent across JVM invocations, which might be helpful if you want a reproducible HashMap iteration order for easier debugging. Note that you should never rely on such a reproducibility and that there are quite many things destroying it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.