This question already has answers here:
How to implement the Java comparable interface?
(9 answers)
Closed 6 years ago.
I have a token class that uses object identity (as in equals just returns tokenA == tokenB). I'd like to use it in a TreeSet. This means that I need to implement a comparison between two tokens that is compatible with reference equality.I don't care about the specific implementation, so long as it is consistent with equals and fulfills the contract (as per TreeSet: "Note that the ordering maintained by a set (whether or not an explicit comparator is provided) must be consistent with equals if it is to correctly implement the Set interface.")
Note: these tokens are created on multiple threads, and may be compared on different threads than they were created on.
What would be the best method to go about doing so?
Ideas I've tried:
Using System.identityHashCode - the problem with this is that it is not guaranteed that two different objects will always have a different hashcode. And due to the birthday paradox you only need about 77k tokens before two will collide (assuming that System.identityHashCode is uniformly distributed over the entire 32-bit range, which may not be true...)
Using a comparator over the default Object.toString method for each token. Unfortunately, under the hood this just uses the hash code (same thing as above).
Using an int or long unique value (read: static counter -> instance variable). This bloats the size, and makes multithreading a pain (not to mention making object creation effectively singlethreaded) (AtomicInteger / AtomicLong for the static counter helps somewhat, but its the size bloat that's more annoying here).
Using System.identityHashCode and a static disambiguation map for any collisions. This works, but is rather complex. Also, Java by default doesn't have a ConcurrentWeakValueHashMultiMap (isn't that a mouthful), which means that I've got to pull in an external dependency (or write my own - probably using something similar to this) to do so, or suffer a (slow) memory leak, or use finalizers (ugh). (And I don't know if anyone implements such a thing...)
By the way, I can't simply punt the problem and assume unique objects have unique hash codes. That's what I was doing, but the assertion fired in the comparator, and so I dug into it, and, lo and behold, on my machine the following:
import java.util.*;
import java.util.Collections.*;
import java.lang.*;
public class size {
public static void main(String[] args) {
Map<Integer, Integer> soFar = new HashMap<>();
for (int i = 1; i <= 1_000_000; i++) {
TokenA t = new TokenA();
int ihc = System.identityHashCode(t);
if (soFar.containsKey(ihc)) {
System.out.println("Collision: " + ihc +" # object #" + soFar.get(ihc) + " & " + i);
break;
}
soFar.put(ihc, i);
}
}
}
class TokenA {
}
prints
Collision: 2134400190 # object #62355 & 105842
So collisions definitely do exist.
So, any suggestions?
There is no magic:
Here is the problem tokenA == tokenB compares identity, tokenA.equals(tokenB) compares whatever is defined in .equals() for that class regardless of identity.
So two objects can have .equals() return true and not be the same object instance, they don't even have to the the same type or share a super type.
There is no short cuts:
Implementing compareTo() is whatever you want to compare that are attributes of the objects. You just have to write the code and make it do what you want, but compareTo() is probably not what you want. compareTo() is for comparisons, if you two things are not < or > each other in some meaningful way then Comparable and Comparator<T> are not what you want.
Equals that is identity is simple:
public boolean equals(Object o)
{
return this == o;
}
Related
Had a discussion with an interviewer regarding internal implementation of Java Hashmaps and how it would behave if we override equals() but not the HashCode() method for an Employee<Emp_ID, Emp_Name> object.
I was told that hashCode for two different objects would never be the same for the default object.hashCode() implementation, unless we overrode the hashCode() ourselves.
From what I remembered, I told him that Java Hashcode contracts says that two different objects "may" have the same hashcode() not that it "must".
According to my interviewer, the default object.hashcode() never returns the same hashcode() for two different objects, Is this true?
Is it even remotely possible to write a code that demonstrates this. From what I understand, Object.hashcode() can produce 2^30 unique values, how does one produce a collision, with such low possibility of collision to demonstrate that two different objects can get the same hashcode() with the Object classes method.
Or is he right, with the default Object.HashCode() implementation, we will never have a collision i.e two different objects can never have the same HashCode. If so, why do so many java manuals don't explicitly say so.
How can I write some code to demonstrate this? Because on demonstrating this, I can also prove that a bucket in a hashmap can contain different HashCodes(I tried to show him the debugger where the hashMap was expanded but he told me that this is just logical Implementation and not the internal algo?)
2^30 unique values sounds like a lot but the birthday problem means we don't need many objects to get a collision.
The following program works for me in about a second and gives a collision between objects 196 and 121949. I suspect it will heavily depend on your system configuration, compiler version etc.
As you can see from the implementation of the Hashable class, every one is guarenteed to be unique and yet there are still collisions.
class HashCollider
{
static class Hashable
{
private static int curr_id = 0;
public final int id;
Hashable()
{
id = curr_id++;
}
}
public static void main(String[] args)
{
final int NUM_OBJS = 200000; // birthday problem suggests
// this will be plenty
Hashable objs[] = new Hashable[NUM_OBJS];
for (int i = 0; i < NUM_OBJS; ++i) objs[i] = new Hashable();
for (int i = 0; i < NUM_OBJS; ++i)
{
for (int j = i + 1; j < NUM_OBJS; ++j)
{
if (objs[i].hashCode() == objs[j].hashCode())
{
System.out.println("Objects with IDs " + objs[i].id
+ " and " + objs[j].id + " collided.");
System.exit(0);
}
}
}
System.out.println("No collision");
}
}
If you have a large enough heap (assuming 64 bit address space) and objects are small enough (the smallest object size on a 64 bit JVM is 8 bytes), then you will be able to represent more than 2^32 objects that are reachable at the same time. At that point, the objects' identity hashcodes cannot be unique.
However, you don't need a monstrous heap. If you create a large enough pool of objects (e.g. in a large array) and randomly delete and recreate them, it is (I think) guaranteed that you will get a hashcode collision ... if you continue doing this long enough.
The default algorithm for hashcode in older versions of Java is based on the address of the object when hashcode is first called. If the garbage collector moves an object, and another one is created at the original address of the first one, and identityHashCode is called, then the two objects will have the same identity hashcode.
The current (Java 8) default algorithm uses a PRNG. The "birthday paradox" formula will tell you the probability that one object's identity hashcode is the same as one more of the other's.
The -XXhashCode=n option that #BastianJ mentioned has the following behavior:
hashCode == 0: Returns a freshly generated pseudo-random number
hashCode == 1: XORs the object address with a pseudo-random number that changes occasionally.
hashCode == 2: The hashCode is 1! (Hence #BastianJ's "cheat" answer.)
hashCode == 3: The hashcode is an ascending sequence number.
hashCode == 4: the bottom 32 bits of the object address
hashCode >= 5: This is the default algorithm for Java 8. It uses Marsaglia's xor-shift PRNG with a thread specific seed.
If you have downloaded the OpenJDK Java 8 source code, you will find the implementation in hotspot/src/share/vm/runtime/synchronizer.cp. Look for the get_next_hash() method.
So that is another way to prove it. Show him the source code!
Use Oracle JVM and set -XX:hashCode=2. If I remember corretly, this chooses the Default implementation to be "constant 1". Just for the purpose of proving you're right.
I have little to add to Michael's answer (+1) except a bit of code golfing and statistics.
The Wikipedia article on the Birthday problem that Michael linked to has a nice table of the number of events necessary to get a collision, with a desired probability, given a value space of a particular size. For example, Java's hashCode has 32 bits, giving a value space of 4 billion. To get a collision with a probability of 50%, about 77,000 events are necessary.
Here's a simple way to find two instances of Object that have the same hashCode:
static int findCollision() {
Map<Integer,Object> map = new HashMap<>();
Object n, o;
do {
n = new Object();
o = map.put(n.hashCode(), n);
} while (o == null);
assert n != o && n.hashCode() == o.hashCode();
return map.size() + 1;
}
This returns the number of attempts it took to get a collision. I ran this a bunch of times and generated some statistics:
System.out.println(
IntStream.generate(HashCollisions::findCollision)
.limit(1000)
.summaryStatistics());
IntSummaryStatistics{count=1000, sum=59023718, min=635, average=59023.718000, max=167347}
This seems quite in line with the numbers from the Wikipedia table. Incidentally, this took only about 10 seconds to run on my laptop, so this is far from a pathological case.
You were right in the first place, but it bears repeating: hash codes are not unique!
I have a project with many bean classes like ItemBean:
public class ItemBean
{
private String name;
private int id;
getters/setters...
}
I wrote a custom equals method because two items should be treated as equal if they have the same id and name, regardless of whether they're the same object in memory or not. I'm now looking into writing a custom hashCode() function. I looked at other stackoverflow questions and this tutorial, but they seem overly general whereas I'm looking for best practices for simple bean classes.
I came up with this method:
Uses caching
Uses all attributes that are involved in the equals method of ItemBean
Uses the 17/31 'magical number' primes as described in the other stackoverflow question.
Implemented method:
public final int hashCode()
{
if (cachedHashCode == 0)
{
int result = 17;
result = 31 * (result + id);
cachedHashCode = 31 * (result + name.hashCode());
}
return cachedHashCode;
}
Is is good practice to base your hashcode method like this on all the attributes of a class that make it unique? If not, what are disadvantages of this method and what are better alternatives? If one my bean classes has 10 attributes instead of only 2, is XOR'ing ten attributes a costly operation that should be avoided or not?
From the JavaDoc of Object.hashCode():
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
This can be achieved by using all members in hashCode that are used in equals and vice-versa.
The alorightms described in What is a best practice of writing hash function in java? are worth following.
I would not worry about performance. ^ is a very basic operator that is and can be optimized by the JVM.
I was reading Effective Java Item 9 and decided to run the example code by myself. But it works slightly different depending on how I insert a new object that I don't understand what exactly is going on inside. The PhoneNumber class looks:
public class PhoneNumber {
private final short areaCode;
private final short prefix;
private final short lineNumber;
public PhoneNumber(int areaCode, int prefix, int lineNumber) {
this.areaCode = (short)areaCode;
this.prefix = (short) prefix;
this.lineNumber = (short)lineNumber;
}
#Override public boolean equals(Object o) {
if(o == this) return true;
if(!(o instanceof PhoneNumber)) return false;
PhoneNumber pn = (PhoneNumber)o;
return pn.lineNumber == lineNumber && pn.prefix == prefix && pn.areaCode == areaCode;
}
}
Then according to the book and as is when I tried,
public static void main(String[] args) {
HashMap<PhoneNumber, String> phoneBook = new HashMap<PhoneNumber, String>();
phoneBook.put(new PhoneNumber(707,867,5309), "Jenny");
System.out.println(phoneBook.get(new PhoneNumber(707,867,5309)));
}
This prints "null" and it's explained in the book because HashMap has an optimization that caches the hash code associated with each entry and doesn't check for object equality if the hash codes don't match. It makes sense to me. But when I do this:
public static void main(String[] args) {
PhoneNumber p1 = new PhoneNumber(707,867,5309);
phoneBook.put(p1, "Jenny");
System.out.println(phoneBook.get(new PhoneNumber(707,867,5309)));
}
Now it returns "Jenny". Can you explain why it didn't fail in the second case?
The experienced behaviour might depend on the Java version and vendor that was used to run the application, because since the general contract of Object.hashcode() is violated, the result is implementation dependent.
A possible explanation (taking one possible implementation of HashMap):
The HashMap class in its internal implementation puts objects (keys) in different buckets based on their hashcode. When you query an element or you check if a key is contained in the map, first the proper bucket is looked for based on the hashcode of the queried key. Inside the bucket objects are checked in a sequencial way, and inside a bucket only the equals() method is used to compare elements.
So if you do not override Object.hashcode() it will be indeterministic if 2 different objects produce default hashcodes which may or may not determine the same bucket. If by any chance they "point" to the same bucket, you will still be able to find the key if the equals() method says they are equal. If by any chance they "point" to 2 different buckets, you will not find the key even if equals() method says they are equal.
hashcode() must be overriden to be consistent with your overridden equals() method. Only in this case it is guaranteed the proper, expected and consistent working of HashMap.
Read the javadoc of Object.hashcode() for the contract that you must not violate. The main point is that if equals() returns true for another object, the hashcode() method must return the same value for both of these objects.
Can you explain why it didn't fail in the second case?
In a nutshell, it is not guaranteed to fail. The two objects in the second example could end up having the same hash code (purely by coincidence or, more likely, due to compiler optimizations or due to how the default hashCode() works in your JVM). This would lead to the behaviour you describe.
For what it's worth, I cannot reproduce this behaviour with my compiler/JVM.
In your case by coincidence JVM was able to find the same hashCode for both object. When I ran your code, in my JVM it gave null for both the case. So your problem is because of JVM not the code.
It is better to override hashCode() each and every time when you override equils() method.
I haven't read Effective Java, I read SCJP by Kathy Sierra. So if you need more details then you can read this book. It's nice.
Your last code snipped does not compile because you haven't declared phoneBook.
Both main methods should work exactly the same. There is a 1 in 16 chance that it will print Jenny because a newly crated HashMap has a default size of 16. In detail that means that only the lower 4 bits of the hashCode will be checked. If they equal the equal method is used.
I run the code below in Hotspot JDK 1.6 on Windows XP,
I ran it twice and I got the results below.
So basically it seems the object.hashcode() also have conflicts?
it looks like it's not returning the memory address in the VM.
However, a comment in the JDK said the values should be distinct, can anyone explain?
As much as is reasonably practical, the hashCode method defined by
class Object does return distinct integers for distinct
objects. (This is typically implemented by converting the internal
address of the object into an integer, but this implementation
technique is not required by the
JavaTM programming language.)
#return a hash code value for this object.
#see java.lang.Object#equals(java.lang.Object)
#see java.util.Hashtable
This is the first result:
i,hashcode(): 361,9578500
i,hashcode(): 1886,9578500
conflict:1886, 361
i,hashcode(): 1905,14850080
i,hashcode(): 2185,14850080
conflict:2185, 1905
9998
This is the 2nd result:
i,hashcode(): 361,5462872
i,hashcode(): 1886,29705835
conflict:1887, 362
i,hashcode(): 1905,9949222
i,hashcode(): 2185,2081190
conflict:2186, 1906
9998
10000
My code:
#Test
public void testAddr()
{
Set<Integer> s = new TreeSet<Integer>();
Map<Integer, Integer> m = new TreeMap<Integer, Integer>();
Set<Object> os = new HashSet<Object>();
for(int i = 0; i < 10000; ++i)
{
Object o = new Object();
os.add(o);
Integer h = o.hashCode();
if((i == 361) || (i == 1886) || (i == 2185) || (i == 1905))
{
System.out.println("i,hashcode(): " + i + "," + h);
}
if(s.contains(h))
{
System.out.println("conflict:" + i + ", " + m.get(h));
}
else
{
s.add(h);
m.put(h, i);
}
}
System.out.println(s.size());
int c = 0;
for(Object o: os)
{
c++;
}
System.out.println(c);
}
hashCode() is supposed to be used for placing objects in hash tables. The rule for hashCode is not that hashCode should never generate conflicts, although that is a desirable property, but that equal objects must have equal hash codes. This does not preclude non-equal objects from having equal hash codes.
You have found a case where the default Object.hashCode() implementation does generate equal hash codes for non-equal objects. It is required that the hash code of an object not change unless there is a change to some field affection equality of that object with another. One possible cause is that the garbage collector rearranged memory so that a later instantiation of o was at the same location as an earlier instantiation of o (that is, you allocated two objects o in the loop, and the garbage collector rearranged memory in between the two allocations so that the old o was moved out of one location of memory and the new o was then allocated at that location). Then, even though the hash code for the old o cannot change, the hash code for the new o is the address where the new o is stored in memory, which happens to be equal to the hash code for the old o.
It's an unfortunately common misinterpretation of the API docs. From a still-unfixed (1 vote) bug for this some time ago.
(spec) System.identityHashCode doc inadequate, Object.hashCode default
implementation docs mislead
[...]
From Usenet discussions and Open Source Software it appears that
many, perhaps majority, of programmers take this to mean that the
default implementation, and hence System.identityHashCode, will
produce unique hashcodes.
The suggested implementation technique is not even appropriate to
modern handleless JVMs, and should go the same way as JVM Spec Chapter
9.
The qualification "As much as is reasonably practical," is, in
practice, insufficient to make clear that hashcodes are not, in
practice, distinct.
It is possible that a long-running program may create, call hashCode() upon, and abandon many billions of objects during the time that it runs. Thus, it would be mathematically impossible to ensure that ensuring that once some object hashCode returned a particular number, no other object would ever return that same number for the life of the program. Even if hashCode() somehow managed to return unique values for the first 4,294,967,296 objects, it would have no choice but to return an already-used value for the next one (since the previous call would have used the last remaining formerly-unused value).
The fact that hashCode() clearly cannot guarantee that hash values won't get reused for the life of the program does not mean it couldn't guarantee that hash codes won't get reused during the lifetime of the objects in question. Indeed, for some memory-management schemes, such a guarantee could be made relatively cheaply. For example, the 1984 Macintosh split the heap into two parts, one of which held fixed-sized object descriptors, and one of which held variable-sized object data. The object descriptors, once created, would never move; if any objects were deleted, the space used by their descriptors would get reused when new objects were created. Under such a scheme, the address of an object descriptor would represent a unique and unchanging representation of its identity for as long as the object existed, and could thus be used as a hashCode() value. Unfortunately, such schemes tend to have more overhead than some other approaches in which objects have no fixed address associated with them.
The comment does not say that it is distinct.
It says that it is distinct as much as is reasonably practical.
Apparently, you found a case where it wasn't practical.
Hashcodes do not have to be unique just consistent. Although more often than not they are typically fairly unique.
In addition to your excerpt above Object has the following to say.
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
Object Doc
I stumbled across the source of AtomicInteger and realized that
new AtomicInteger(0).equals(new AtomicInteger(0))
evaluates to false.
Why is this? Is it some "defensive" design choice related to concurrency issues? If so, what could go wrong if it was implemented differently?
(I do realize I could use get and == instead.)
This is partly because an AtomicInteger is not a general purpose replacement for an Integer.
The java.util.concurrent.atomic package summary states:
Atomic classes are not general purpose replacements for
java.lang.Integer and related classes. They do not define methods
such as hashCode and compareTo. (Because atomic variables are
expected to be mutated, they are poor choices for hash table keys.)
hashCode is not implemented, and so is the case with equals. This is in part due to a far larger rationale that is discussed in the mailing list archives, on whether AtomicInteger should extend Number or not.
One of the reasons why an AtomicXXX class is not a drop-in replacement for a primitive, and that it does not implement the Comparable interface, is because it is pointless to compare two instances of an AtomicXXX class in most scenarios. If two threads could access and mutate the value of an AtomicInteger, then the comparison result is invalid before you use the result, if a thread mutates the value of an AtomicInteger. The same rationale holds good for the equals method - the result for an equality test (that depends on the value of the AtomicInteger) is only valid before a thread mutates one of the AtomicIntegers in question.
On the face of it, it seems like a simple omission but it maybe it does make some sense to actually just use the idenity equals provided by Object.equals
For instance:
AtomicInteger a = new AtomicInteger(0)
AtomicInteger b = new AtomicInteger(0)
assert a.equals(b)
seems reasonable, but b isn't really a, it is designed to be a mutable holder for a value and therefore can't really replace a in a program.
also:
assert a.equals(b)
assert a.hashCode() == b.hashCode()
should work but what if b's value changes in between.
If this is the reason it's a shame it wasn't documented in the source for AtomicInteger.
As an aside: A nice feature might also have been to allow AtomicInteger to be equal to an Integer.
AtomicInteger a = new AtomicInteger(25);
if( a.equals(25) ){
// woot
}
trouble it would mean that in order to be reflexive in this case Integer would have to accept AtomicInteger in it's equals too.
I would argue that because the point of an AtomicInteger is that operations can be done atomically, it would be be hard to ensure that the two values are compared atomically, and because AtomicIntegers are generally counters, you'd get some odd behaviour.
So without ensuring that the equals method is synchronised you wouldn't be sure that the value of the atomic integer hasn't changed by the time equals returns. However, as the whole point of an atomic integer is not to use synchronisation, you'd end up with little benefit.
I suspect that comparing the values is a no-go since there's no way to do it atomically in a portable fashion (without locks, that is).
And if there's no atomicity then the variables could compare equal even they never contained the same value at the same time (e.g. if a changed from 0 to 1 at exactly the same time as b changed from 1 to 0).
AtomicInteger inherits from Object and not Integer, and it uses standard reference equality check.
If you google you will find this discussion of this exact case.
Imagine if equals was overriden and you put it in a HashMap and then you change the value. Bad things will happen:)
equals is not only used for equality but also to meet its contract with hashCode, i.e. in hash collections. The only safe approach for hash collections is for mutable object not to be dependant on their contents. i.e. for mutable keys a HashMap is the same as using an IdentityMap. This way the hashCode and whether two objects are equal does not change when the keys content changes.
So new StringBuilder().equals(new StringBuilder()) is also false.
To compare the contents of two AtomicInteger, you need ai.get() == ai2.get() or ai.intValue() == ai2.intValue()
Lets say that you had a mutable key where the hashCode and equals changed based on the contents.
static class BadKey {
int num;
#Override
public int hashCode() {
return num;
}
#Override
public boolean equals(Object obj) {
return obj instanceof BadKey && num == ((BadKey) obj).num;
}
#Override
public String toString() {
return "Bad Key "+num;
}
}
public static void main(String... args) {
Map<BadKey, Integer> map = new LinkedHashMap<BadKey, Integer>();
for(int i=0;i<10;i++) {
BadKey bk1 = new BadKey();
bk1.num = i;
map.put(bk1, i);
bk1.num = 0;
}
System.out.println(map);
}
prints
{Bad Key 0=0, Bad Key 0=1, Bad Key 0=2, Bad Key 0=3, Bad Key 0=4, Bad Key 0=5, Bad Key 0=6, Bad Key 0=7, Bad Key 0=8, Bad Key 0=9}
As you can see we now have 10 keys, all equal and with the same hashCode!
equals is correctly implemented: an AtomicInteger instance can only equal itself, as only that very same instance will provably store the same sequence of values over time.
Please recall that Atomic* classes act as reference types (just like java.lang.ref.*), meant to wrap an actual, "useful" value. Unlike it is the case in functional languages (see e.g. Clojure's Atom or Haskell's IORef), the distinction between references and values is rather blurry in Java (blame mutability), but it is still there.
Considering the current wrapped value of an Atomic class as the criterion for equality is quite clearly a misconception, as it would imply that new AtomicInteger(1).equals(1).
One limitation with Java is that there is no means of distinguishing a mutable-class instance which can and will be mutated, from a mutable-class instance which will never be exposed to anything that might mutate it(*). References to things of the former type should only be considered equal if they refer to the same object, while references to things of the latter type should often be considered equal if the refer to objects with equivalent state. Because Java only allows one override of the virtual equals(object) method, designers of mutable classes have to guess whether enough instances will meet the latter pattern (i.e. be held in such a way that they'll never be mutated) to justify having equals() and hashCode() behave in a fashion suitable for such usage.
In the case of something like Date, there are a lot of classes which encapsulate a reference to a Date that is never going to be modified, and which want to have their own equivalence relation incorporate the value-equivalence of the encapsulated Date. As such, it makes sense for Date to override equals and hashCode to test value equivalence. On the other hand, holding a reference to an AtomicInteger that is never going to be modified would be silly, since the whole purpose of that type centers around mutability. An AtomicInteger instance which is never going to be mutated may, for all practical purposes, simply be an Integer.
(*) Any requirement that a particular instance never mutate is only binding as long as either (1) information about its identity hash value exists somewhere, or (2) more than one reference to the object exists somewhere in the universe. If neither condition applies to the instance referred to by Foo, replacing Foo with a reference to a clone of Foo would have no observable effect. Consequently, one would be able to mutate the instance without violating a requirement that it "never mutate" by pretending to replace Foo with a clone and mutating the "clone".