HashSet<POJO>.contains misbehaves - java

As part of a Hadoop Mapper, I have a HashSet<MySimpleObject> that contains instances of a very simple class with only two integer attributes. As one should, I customised hashCode() and equals():
public class MySimpleObject {
private int i1, i2;
public set(int i1, int i2) {
this.i1 = i1;
this.i2 = i2;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + i1;
result = prime * result + i2;
return result;
}
#Override
public boolean equals(Object obj) {
if (obj == null) return false;
if (this == obj) return true;
if ( obj.getClass() != MySimpleObject.class ) return false;
MySimpleObject other = (MySimpleObject)obj;
return (this.i1 == other.i1) && (this.i2 == other.i2);
}
Somehow, sometimes, calls to mySet.contains(aSimpleObj) return true though the set actually doesn't contain this value.
I understand how hashCode() is first used to split instances into buckets and equals() only called to compare instances within a given bucket.
I tried to change the prime value in hasCode() to spread instances differently into the buckets, and saw that contains() still sometimes returned a wrong result, but not for the same previously failing value.
It also seems that this value was then correctly identified as being outwith the set; I therefore suspect something is wrong with the equality check rather than the hashing, but I may be wrong...
I'm at a total loss here, and out of ideas. Can anyone shed light on this at all?
----- edit -----
some clarifications:
i1 & i2 are never updated after construction for the instances that were added to the set (though they are sometimes updated, elsewhere in the code, for other instances of that same class);
the set is potentially quite large (i.e. can reach nearly 15K entries) and I wonder if the issue could be linked to this (bucket overflow, e.g.?).

I bet you have trouble coming up with a concise reproduction of this bug.
Your code shown looks right. I think the objects in your collection are being mutated and this fact is obscured to you by other code.
You could debug this by temporarily adding:
Add boolean hashCodeCalled=false to your class
When hashCode() is called, set hashCodeCalled=true
When a setter is called, and that boolean is true, then throw an exception or log the current stack trace
Alternatively, you could refactor your code such that these instances are immutable and I bet the problem disappears.

Related

Comparing two large lists in java

I have to Array lists with 1000 objects in each of them. I need to remove all elements in Array list 1 which are there in Array list 2. Currently I am running 2 loops which is resulting in 1000 x 1000 operations in worst case.
List<DataClass> dbRows = object1.get("dbData");
List<DataClass> modifiedData = object1.get("dbData");
List<DataClass> dbRowsForLog = object2.get("dbData");
for (DataClass newDbRows : dbRows) {
boolean found=false;
for (DataClass oldDbRows : dbRowsForLog) {
if (newDbRows.equals(oldDbRows)) {
found=true;
modifiedData.remove(oldDbRows);
break;
}
}
}
public class DataClass{
private int categoryPosition;
private int subCategoryPosition;
private Timestamp lastUpdateTime;
private String lastModifiedUser;
// + so many other variables
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
DataClass dataClassRow = (DataClass) o;
return categoryPosition == dataClassRow.categoryPosition
&& subCategoryPosition == dataClassRow.subCategoryPosition && (lastUpdateTime.compareTo(dataClassRow.lastUpdateTime)==0?true:false)
&& stringComparator(lastModifiedUser,dataClassRow.lastModifiedUser);
}
public String toString(){
return "DataClass[categoryPosition="+categoryPosition+",subCategoryPosition="+subCategoryPosition
+",lastUpdateTime="+lastUpdateTime+",lastModifiedUser="+lastModifiedUser+"]";
}
public static boolean stringComparator(String str1, String str2){
return (str1 == null ? str2 == null : str1.equals(str2));
}
public int hashCode() {
int hash = 7;
hash = 31 * hash + (int) categoryPosition;
hash = 31 * hash + (int) subCategoryPosition
hash = 31 * hash + (lastModifiedUser == null ? 0 : lastModifiedUser.hashCode());
return hash;
}
}
The best work around i could think of is create 2 sets of strings by calling tostring() method of DataClass and compare string. It will result in 1000 (for making set1) + 1000 (for making set 2) + 1000 (searching in set ) = 3000 operations. I am stuck in Java 7. Is there any better way to do this? Thanks.
Let Java's builtin collections classes handle most of the optimization for you by taking advantage of a HashSet. The complexity of its contains method is O(1). I would highly recommend looking up how it achieves this because it's very interesting.
List<DataClass> a = object1.get("dbData");
HashSet<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
return a;
And it's all done for you.
EDIT: caveat
In order for this to work, DataClass needs to implement Object::hashCode. Otherwise, you can't use any of the hash-based collection algorithms.
EDIT 2: implementing hashCode
An object's hash code does not need to change every time an instance variable changes. The hash code only needs to reflect the instance variables that determine equality.
For example, imagine each object had a unique field private final UUID id. In this case, you could determine if two objects were the same by simply testing the id value. Fields like lastUpdateTime and lastModifiedUser would provide information about the object, but two instances with the same id would refer to the same object, even if the lastUpdateTime and lastModifiedUser of each were different.
The point is that if you really want to want to optimize this, include as few fields as possible in the hash computation. From your example, it seems like categoryPosition and subCategoryPosition might be enough.
Whatever fields you choose to include, the simplest way to compute a hash code from them is to use Objects::hash rather than running the numbers yourself.
It is a Set A-B operation(only retain elements in Set A that are not in Set B = A-B)
If using Set is fine then we can do like below. We can use ArrayList as well in place of Set but in AL case for each element to remove/retain check it needs to go through an entire other list scan.
Set<DataClass> a = new HashSet<>(object1.get("dbData"));
Set<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
If ordering is needed, use TreeSet.
Try to return a set from object1.get("dbData") and object2.get("dbData") that skips one more intermediate collection creation.

How to implement a compareTo() method when consistent with Equal and hashcode

I have a class Product, which three variables:
class Product implements Comparable<Product>{
private Type type; // Type is an enum
Set<Attribute> attributes; // Attribute is a regular class
ProductName name; // ProductName is another enum
}
I used Eclipse to automatically generate the equal() and hashcode() methods:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((attributes == null) ? 0 : attributes.hashCode());
result = prime * result + ((type == null) ? 0 : type.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Product other = (Product) obj;
if (attributes == null) {
if (other.attributes != null)
return false;
} else if (!attributes.equals(other.attributes))
return false;
if (type != other.type)
return false;
return true;
}
Now in my application I need to sort a Set of Product, so I need to implement the Comparable interface and compareTo method:
#Override
public int compareTo(Product other){
int diff = type.hashCode() - other.getType().hashCode();
if (diff > 0) {
return 1;
} else if (diff < 0) {
return -1;
}
diff = attributes.hashCode() - other.getAttributes().hashCode();
if (diff > 0) {
return 1;
} else if (diff < 0) {
return -1;
}
return 0;
}
Does this implementation make sense? What about if I just want to sort the product based on the String values of "type" and "attributes" values. So how to implement this?
Edit:
The reason I want to sort a Set of is because I have Junit test which asserts on the string values of a HashSet. My goal is to maintain the same order of output as I sort the set. otherwise, even if the Set's values are the same, the assertion will fail due to random output of a set.
Edit2:
Through the discussion, it's clear that to assert the equality of String values of a HashSet isn't good in unit tests. For my situation I currently write a sort() function to sort the HashSet String values in natural ordering, so it can consistently output the same String value for my unit tests and that suffice for now. Thanks all.
Looks like from all the comments in here you dont need to use Comparator at all. Because:
1) You are using HashSet that does not work with Comparator. It is not ordered.
2) You just need to make sure that two HashSets containing Products are equal. It means they are same size and contain the same set of Products.
Since you already added hashCode and equals methods to Product all you need to do is call equals method on those HashSets.
HashSet<Product> set1 = ...
HashSet<Product> set2 = ...
assertTrue( set1.equals(set2) );
This implementation does not seem to be consistent. You have no control over how the hash codes look like. If you have obj1 < obj2 according to compareTo in the first try, the next time you start your JVM it could be the other way around obj1 > obj2.
The only thing that you really know is that if diff == 0 then the objects are considered to be equal. However you can also just use the equals method for that check.
It is now up to you how you define when obj1 < obj2 or obj1 > obj2. Just make sure that it is consistent.
By the way, you know that the current implementation does not include ProductName name in the equals check? Dont know if that is intended thus the remark.
The question is, what do you know about that attributes? Maybe they implement Comparable (for example if they are Numbers), then you can order according to their compareTo method. If you totally know nothing about the objects, it will be hard to build up a consistent ordering.
If you just want them to be ordered consistently but the ordering itself does not play any role, you could just give them ids at creation time and sort by them. At this point you could indeed use the hashcodes if it does not matter that it can change between JVM calls, but only then.

Java overriding equals() and hashcode() for two interchangeable integers

I'm overriding the equals and hashcode methods for a simple container object for two ints. Each int reflects the index of another object (it doesn't matter what that object is). The point of the class is to represent a connection between the two objects.
The direction of the connection doesn't matter, therefore the equals method should return true regardless of which way round the two ints are in the object E.g.
connectionA = new Connection(1,2);
connectionB = new Connection(1,3);
connectionC = new Connection(2,1);
connectionA.equals(connectionB); // returns false
connectionA.equals(connectionC); // returns true
Here is what I have (modified from the source code for Integer):
public class Connection {
// Simple container for two numbers which are connected.
// Two Connection objects are equal regardless of the order of from and to.
int from;
int to;
public Connection(int from, int to) {
this.from = from;
this.to = to;
}
// Modifed from Integer source code
#Override
public boolean equals(Object obj) {
if (obj instanceof Connection) {
Connection connectionObj = (Connection) obj;
return ((from == connectionObj.from && to == connectionObj.to) || (from == connectionObj.to && to == connectionObj.from));
}
return false;
}
#Override
public int hashCode() {
return from*to;
}
}
This does work however my question is: Is there a better way to achieve this?
My main worry is with the hashcode() method will return the same hashcode for any two integers which multiply to equal the same number. E.g.
3*4 = 12
2*6 = 12 // same!
The documentation, http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#hashCode(), states that
It is not required that if two objects are unequal according to the
equals(java.lang.Object) method, then calling the hashCode method on
each of the two objects must produce distinct integer results.
However, the programmer should be aware that producing distinct
integer results for unequal objects may improve the performance of
hashtables.
If anyone can see a simple way of reducing the number of matching hashcodes then I would be appreciative of an answer.
Thanks!
Tim
PS I'm aware that there is a java.sql.Connection which could cause some import annoyances. The object actually has a more specific name in my application but for brevity I shortened it to Connection here.
Three solutions that would "work" have been proposed. (By work, I mean that they satisfy the basic requirement of a hashcode ... that different inputs give different outputs ... and they also satisfy the OP's additional "symmetry" requirement.)
These are:
# 1
return from ^ to;
# 2
return to*to+from*from;
# 3
int res = 17;
res = res * 31 + Math.min(from, to);
res = res * 31 + Math.max(from, to);
return res;
The first one has the problem that the range of the output is bounded by the range of the actual input values. So for instance if we assume that the inputs are both non-negative numbers less or equal to 2i and 2j respectively, then the output will be less or equal to 2max(i,j). That is likely to give you poor "dispersion"1 in your hash table ... and a higher rate of collisions. (There is also a problem when from == to!)
The second and third ones are better than the first, but you are still liable to get more collisions than is desirable if from and to are small.
I would suggest a 4th alternative if it is critical that you minimize collisions for small values of from and to.
#4
int res = Math.max(from, to);
res = (res << 16) | (res >>> 16); // exchange top and bottom 16 bits.
res = res ^ Math.min(from, to);
return res;
This has the advantage that if from and to are both in the range 0..216-1, you get a unique hashcode for each distinct (unordered) pair.
1 - I don't know if this is the correct technical term for this ...
This is widely accepted approach:
#Override
public int hashCode() {
int res = 17;
res = res * 31 + Math.min(from, to);
res = res * 31 + Math.max(from, to);
return res;
}
i think, something like
#Override
public int hashCode() {
return to*to+from*from;
}
is good enough
Typically I use XOR for hashcode method.
#Override
public int hashCode() {
return from ^ to;
}
I wonder why nobody offered the usually best solution: Normalize your data:
Connection(int from, int to) {
this.from = Math.min(from, to);
this.to = Math.max(from, to);
}
If it's impossible, then I'd suggest something like
27644437 * (from+to) + Math.min(from, to)
By a using a multiplier different from 31, you avoid collisions like in this question.
By using a big multiplier you spread the numbers better.
By using an odd multiplier you ensure that the multiplication is bijective (i.e., no information gets lost).
By using a prime you gain nothing at all, but everyone does it and it has no disadvantage.
Java 1.7+ have Objects.hash
#Override
public int hashCode() {
return Objects.hash(from, to);
}

what would be a good hash function for an integer tuple?

I have this class...
public class StartStopTouple {
public int iStart;
public int iStop;
public int iHashCode;
public StartStopTouple(String start, String stop) {
this.iStart = Integer.parseInt(start);
this.iStop = Integer.parseInt(stop);
}
#Override
public boolean equals(Object theObject) {
// check if 'theObject' is null
if (theObject == null) {
return false;
}
// check if 'theObject' is a reference to 'this' StartStopTouple... essentially they are the same Object
if (this == theObject) {
return true;
}
// check if 'theObject' is of the correct type as 'this' StartStopTouple
if (!(theObject instanceof StartStopTouple)) {
return false;
}
// cast 'theObject' to the correct type: StartStopTouple
StartStopTouple theSST = (StartStopTouple) theObject;
// check if the (start,stop) pairs match, then the 'theObject' is equal to 'this' Object
if (this.iStart == theSST.iStart && this.iStop == theSST.iStop) {
return true;
} else {
return false;
}
} // equal() end
#Override
public int hashCode() {
return iHashCode;
}
}
... and I define equality between such Objects only if iStart and iStop in one Object are equal to iStart and iStop in the other Object.
So since I've overridden equals(), I need to override hashCode() but I'm not sure how to define a good hash function for this class. What would be a good way to create a hash code for this class using iStart and iStop?
I'd be tempted to use this, particularly since you're going to memoize it:
Long.valueOf((((long) iStart) << 32) | istop)).hashcode();
From Bloch's "Effective Java":
int iHashCode = 17;
iHashCode = 31 * iHashCode + iStart;
iHashCode = 31 * iHashCode + iStop;
Note: 31 is chosen because the multiplication by 31 can be optimized by the VM as bit operations. (But performance is not useful in your case since as mentioned by #Ted Hopp you are only computing the value once.)
Note: it does not matter if iHashCode rolls over past the largest int.
the simplest might be best
iHashCode = iStart^iStop;
the XOR of the two values
note this will give equal hashcodes when start and stop are swapped
as another possibility you can do
iHashCode = ((iStart<<16)|(iStart>>>16))^iStop;
this first barrel shifts start by 16 and then xors stop with it so the least significant bits are put apart in the xor (if start is never larger than 65k (of more accurately 2^16) you can omit the (iStart>>>16) part)

Treeset.contains() problem

So I've been struggling with a problem for a while now, figured I might as well ask for help here.
I'm adding Ticket objects to a TreeSet, Ticket implements Comparable and has overridden equals(), hashCode() and CompareTo() methods. I need to check if an object is already in the TreeSet using contains(). Now after adding 2 elements to the set it all checks out fine, yet after adding a third it gets messed up.
running this little piece of code after adding a third element to the TreeSet, Ticket temp2 is the object I'm checking for(verkoopLijst).
Ticket temp2 = new Ticket(boeking, TicketType.STANDAARD, 1,1);
System.out.println(verkoop.getVerkoopLijst().first().hashCode());
System.out.println(temp2.hashCode());
System.out.println(verkoop.getVerkoopLijst().first().equals(temp2));
System.out.println(verkoop.getVerkoopLijst().first().compareTo(temp2));
System.out.println(verkoop.getVerkoopLijst().contains(temp2));
returns this:
22106622
22106622
true
0
false
Now my question would be how this is even possible?
Edit:
public class Ticket implements Comparable{
private int rijNr, stoelNr;
private TicketType ticketType;
private Boeking boeking;
public Ticket(Boeking boeking, TicketType ticketType, int rijNr, int stoelNr){
//setters
}
#Override
public int hashCode(){
return boeking.getBoekingDatum().hashCode();
}
#Override
#SuppressWarnings("EqualsWhichDoesntCheckParameterClass")
public boolean equals(Object o){
Ticket t = (Ticket) o;
if(this.boeking.equals(t.getBoeking())
&&
this.rijNr == t.getRijNr() && this.stoelNr == t.getStoelNr()
&&
this.ticketType.equals(t.getTicketType()))
{
return true;
}
else return false;
}
/*I adjusted compareTo this way because I need to make sure there are no duplicate Tickets in my treeset. Treeset seems to call CompareTo() to check for equality before adding an object to the set, instead of equals().
*/
#Override
public int compareTo(Object o) {
int output = 0;
if (boeking.compareTo(((Ticket) o).getBoeking())==0)
{
if(this.equals(o))
{
return output;
}
else return 1;
}
else output = boeking.compareTo(((Ticket) o).getBoeking());
return output;
}
//Getters & Setters
On compareTo contract
The problem is in your compareTo. Here's an excerpt from the documentation:
Implementor must ensure sgn(x.compareTo(y)) == -sgn(y.compareTo(x)) for all x and y.
Your original code is reproduced here for reference:
// original compareTo implementation with bug marked
#Override
public int compareTo(Object o) {
int output = 0;
if (boeking.compareTo(((Ticket) o).getBoeking())==0)
{
if(this.equals(o))
{
return output;
}
else return 1; // BUG!!!! See explanation below!
}
else output = boeking.compareTo(((Ticket) o).getBoeking());
return output;
}
Why is the return 1; a bug? Consider the following scenario:
Given Ticket t1, t2
Given t1.boeking.compareTo(t2.boeking) == 0
Given t1.equals(t2) return false
Now we have both of the following:
t1.compareTo(t2) returns 1
t2.compareTo(t1) returns 1
That last consequence is a violation of the compareTo contract.
Fixing the problem
First and foremost, you should have taken advantage of the fact that Comparable<T> is a parameterizable generic type. That is, instead of:
// original declaration; uses raw type!
public class Ticket implements Comparable
it'd be much more appropriate to instead declare something like this:
// improved declaration! uses parameterized Comparable<T>
public class Ticket implements Comparable<Ticket>
Now we can write our compareTo(Ticket) (no longer compareTo(Object)). There are many ways to rewrite this, but here's a rather simplistic one that works:
#Override public int compareTo(Ticket t) {
int v;
v = this.boeking.compareTo(t.boeking);
if (v != 0) return v;
v = compareInt(this.rijNr, t.rijNr);
if (v != 0) return v;
v = compareInt(this.stoelNr, t.stoelNr);
if (v != 0) return v;
v = compareInt(this.ticketType, t.ticketType);
if (v != 0) return v;
return 0;
}
private static int compareInt(int i1, int i2) {
if (i1 < i2) {
return -1;
} else if (i1 > i2) {
return +1;
} else {
return 0;
}
}
Now we can also define equals(Object) in terms of compareTo(Ticket) instead of the other way around:
#Override public boolean equals(Object o) {
return (o instanceof Ticket) && (this.compareTo((Ticket) o) == 0);
}
Note the structure of the compareTo: it has multiple return statements, but in fact, the flow of logic is quite readable. Note also how the priority of the sorting criteria is explicit, and easily reorderable should you have different priorities in mind.
Related questions
What is a raw type and why shouldn't we use it?
How to sort an array or ArrayList ASC first by x and then by y?
Should a function have only one return statement?
This could happen if your compareTo method isn't consistent. I.e. if a.compareTo(b) > 0, then b.compareTo(a) must be < 0. And if a.compareTo(b) > 0 and b.compareTo(c) > 0, then a.compareTo(c) must be > 0. If those aren't true, TreeSet can get all confused.
Firstly, if you are using a TreeSet, the actual behavior of your hashCode methods won't affect the results. TreeSet does not rely on hashing.
Really we need to see more code; e.g. the actual implementations of the equals and compareTo methods, and the code that instantiates the TreeSet.
However, if I was to guess, it would be that you have overloaded the equals method by declaring it with the signature boolean equals(Ticket other). That would lead to the behavior that you are seeing. To get the required behavior, you must override the method; e.g.
#Override
public boolean equals(Object other) { ...
(It is a good idea to put in the #Override annotation to make it clear that the method overrides a method in the superclass, or implements a method in an interface. If your method isn't actually an override, then you'll get a compilation error ... which would be a good thing.)
EDIT
Based on the code that you have added to the question, the problem is not overload vs override. (As I said, I was only guessing ...)
It is most likely that the compareTo and equals are incorrect. It is still not entirely clear exactly where the bug is because the semantics of both methods depends on the compareTo and equals methods of the Boeking class.
The first if statement of the Ticket.compareTo looks highly suspicious. It looks like the return 1; could cause t1.compareTo(t2) and t2.compareTo(t1) to both return 1 for some tickets t1 and t2 ... and that would definitely be wrong.

Categories

Resources