Comparing two large lists in java - java

I have to Array lists with 1000 objects in each of them. I need to remove all elements in Array list 1 which are there in Array list 2. Currently I am running 2 loops which is resulting in 1000 x 1000 operations in worst case.
List<DataClass> dbRows = object1.get("dbData");
List<DataClass> modifiedData = object1.get("dbData");
List<DataClass> dbRowsForLog = object2.get("dbData");
for (DataClass newDbRows : dbRows) {
boolean found=false;
for (DataClass oldDbRows : dbRowsForLog) {
if (newDbRows.equals(oldDbRows)) {
found=true;
modifiedData.remove(oldDbRows);
break;
}
}
}
public class DataClass{
private int categoryPosition;
private int subCategoryPosition;
private Timestamp lastUpdateTime;
private String lastModifiedUser;
// + so many other variables
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
DataClass dataClassRow = (DataClass) o;
return categoryPosition == dataClassRow.categoryPosition
&& subCategoryPosition == dataClassRow.subCategoryPosition && (lastUpdateTime.compareTo(dataClassRow.lastUpdateTime)==0?true:false)
&& stringComparator(lastModifiedUser,dataClassRow.lastModifiedUser);
}
public String toString(){
return "DataClass[categoryPosition="+categoryPosition+",subCategoryPosition="+subCategoryPosition
+",lastUpdateTime="+lastUpdateTime+",lastModifiedUser="+lastModifiedUser+"]";
}
public static boolean stringComparator(String str1, String str2){
return (str1 == null ? str2 == null : str1.equals(str2));
}
public int hashCode() {
int hash = 7;
hash = 31 * hash + (int) categoryPosition;
hash = 31 * hash + (int) subCategoryPosition
hash = 31 * hash + (lastModifiedUser == null ? 0 : lastModifiedUser.hashCode());
return hash;
}
}
The best work around i could think of is create 2 sets of strings by calling tostring() method of DataClass and compare string. It will result in 1000 (for making set1) + 1000 (for making set 2) + 1000 (searching in set ) = 3000 operations. I am stuck in Java 7. Is there any better way to do this? Thanks.

Let Java's builtin collections classes handle most of the optimization for you by taking advantage of a HashSet. The complexity of its contains method is O(1). I would highly recommend looking up how it achieves this because it's very interesting.
List<DataClass> a = object1.get("dbData");
HashSet<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
return a;
And it's all done for you.
EDIT: caveat
In order for this to work, DataClass needs to implement Object::hashCode. Otherwise, you can't use any of the hash-based collection algorithms.
EDIT 2: implementing hashCode
An object's hash code does not need to change every time an instance variable changes. The hash code only needs to reflect the instance variables that determine equality.
For example, imagine each object had a unique field private final UUID id. In this case, you could determine if two objects were the same by simply testing the id value. Fields like lastUpdateTime and lastModifiedUser would provide information about the object, but two instances with the same id would refer to the same object, even if the lastUpdateTime and lastModifiedUser of each were different.
The point is that if you really want to want to optimize this, include as few fields as possible in the hash computation. From your example, it seems like categoryPosition and subCategoryPosition might be enough.
Whatever fields you choose to include, the simplest way to compute a hash code from them is to use Objects::hash rather than running the numbers yourself.

It is a Set A-B operation(only retain elements in Set A that are not in Set B = A-B)
If using Set is fine then we can do like below. We can use ArrayList as well in place of Set but in AL case for each element to remove/retain check it needs to go through an entire other list scan.
Set<DataClass> a = new HashSet<>(object1.get("dbData"));
Set<DataClass> b = new HashSet<>(object2.get("dbData"));
a.removeAll(b);
If ordering is needed, use TreeSet.
Try to return a set from object1.get("dbData") and object2.get("dbData") that skips one more intermediate collection creation.

Related

How to implement a compareTo() method when consistent with Equal and hashcode

I have a class Product, which three variables:
class Product implements Comparable<Product>{
private Type type; // Type is an enum
Set<Attribute> attributes; // Attribute is a regular class
ProductName name; // ProductName is another enum
}
I used Eclipse to automatically generate the equal() and hashcode() methods:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((attributes == null) ? 0 : attributes.hashCode());
result = prime * result + ((type == null) ? 0 : type.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Product other = (Product) obj;
if (attributes == null) {
if (other.attributes != null)
return false;
} else if (!attributes.equals(other.attributes))
return false;
if (type != other.type)
return false;
return true;
}
Now in my application I need to sort a Set of Product, so I need to implement the Comparable interface and compareTo method:
#Override
public int compareTo(Product other){
int diff = type.hashCode() - other.getType().hashCode();
if (diff > 0) {
return 1;
} else if (diff < 0) {
return -1;
}
diff = attributes.hashCode() - other.getAttributes().hashCode();
if (diff > 0) {
return 1;
} else if (diff < 0) {
return -1;
}
return 0;
}
Does this implementation make sense? What about if I just want to sort the product based on the String values of "type" and "attributes" values. So how to implement this?
Edit:
The reason I want to sort a Set of is because I have Junit test which asserts on the string values of a HashSet. My goal is to maintain the same order of output as I sort the set. otherwise, even if the Set's values are the same, the assertion will fail due to random output of a set.
Edit2:
Through the discussion, it's clear that to assert the equality of String values of a HashSet isn't good in unit tests. For my situation I currently write a sort() function to sort the HashSet String values in natural ordering, so it can consistently output the same String value for my unit tests and that suffice for now. Thanks all.
Looks like from all the comments in here you dont need to use Comparator at all. Because:
1) You are using HashSet that does not work with Comparator. It is not ordered.
2) You just need to make sure that two HashSets containing Products are equal. It means they are same size and contain the same set of Products.
Since you already added hashCode and equals methods to Product all you need to do is call equals method on those HashSets.
HashSet<Product> set1 = ...
HashSet<Product> set2 = ...
assertTrue( set1.equals(set2) );
This implementation does not seem to be consistent. You have no control over how the hash codes look like. If you have obj1 < obj2 according to compareTo in the first try, the next time you start your JVM it could be the other way around obj1 > obj2.
The only thing that you really know is that if diff == 0 then the objects are considered to be equal. However you can also just use the equals method for that check.
It is now up to you how you define when obj1 < obj2 or obj1 > obj2. Just make sure that it is consistent.
By the way, you know that the current implementation does not include ProductName name in the equals check? Dont know if that is intended thus the remark.
The question is, what do you know about that attributes? Maybe they implement Comparable (for example if they are Numbers), then you can order according to their compareTo method. If you totally know nothing about the objects, it will be hard to build up a consistent ordering.
If you just want them to be ordered consistently but the ordering itself does not play any role, you could just give them ids at creation time and sort by them. At this point you could indeed use the hashcodes if it does not matter that it can change between JVM calls, but only then.

Correct way to implement Map<MyObject,ArrayList<MyObject>>

I was asked this in interview. using Google Guava or MultiMap is not an option.
I have a class
public class Alpha
{
String company;
int local;
String title;
}
I have many instances of this class (in order of millions). I need to process them and at the end find the unique ones and their duplicates.
e.g.
instance --> instance1, instance5, instance7 (instance1 has instance5 and instance7 as duplicates)
instance2 --> instance2 (no duplicates for instance 2)
My code works fine
declare datastructure
HashMap<Alpha,ArrayList<Alpha>> hashmap = new HashMap<Alpha,ArrayList<Alpha>>();
Add instances
for (Alpha x : arr)
{
ArrayList<Alpha> list = hashmap.get(x); ///<<<<---- doubt about this. comment#1
if (list == null)
{
list = new ArrayList<Alpha>();
hashmap.put(x, list);
}
list.add(x);
}
Print instances and their duplicates.
for (Alpha x : hashmap.keySet())
{
ArrayList<Alpha> list = hashmap.get(x); //<<< doubt about this. comment#2
System.out.println(x + "<---->");
for(Alpha y : list)
{
System.out.print(y);
}
System.out.println();
}
Question: My code works, but why? when I do hashmap.get(x); (comment#1 in code). it is possible that two different instances might have same hashcode. In that case, I will add 2 different objects to the same List.
When I retrieve, I should get a List which has 2 different instances. (comment#2) and when I iterate over the list, I should see at least one instance which is not duplicate of the key but still exists in the list. I don't. Why?. I tried returning constant value from my hashCode function, it works fine.
If you want to see my implementation of equals and hashCode,let me know.
Bonus question: Any way to optimize it?
Edit:
#Override
public boolean equals(Object obj) {
if (obj==null || obj.getClass()!=this.getClass())
return false;
if (obj==this)
return true;
Alpha guest = (Alpha)obj;
return guest.getLocal()==this.getLocal()
&& guest.getCompany() == this.getCompany()
&& guest.getTitle() == this.getTitle();
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (title==null?0:title.hashCode());
result = prime * result + local;
result = prime * result + (company==null?0:company.hashCode());
return result;
}
it is possible that two different instances might have same hashcode
Yes, but hashCode method is used to identify the index to store the element. Two or more keys could have the same hashCode but that's why they are also evaluated using equals.
From Map#containsKey javadoc:
Returns true if this map contains a mapping for the specified key. More formally, returns true if and only if this map contains a mapping for a key k such that (key==null ? k==null : key.equals(k)). (There can be at most one such mapping.)
Some enhancements to your current code:
Code oriented to interfaces. Use Map and instantiate it by HashMap. Similar to List and ArrayList.
Compare Strings and Objects in general using equals method. == compares references, equals compares the data stored in the Object depending the implementation of this method. So, change the code in Alpha#equals:
public boolean equals(Object obj) {
if (obj==null || obj.getClass()!=this.getClass())
return false;
if (obj==this)
return true;
Alpha guest = (Alpha)obj;
return guest.getLocal().equals(this.getLocal())
&& guest.getCompany().equals(this.getCompany())
&& guest.getTitle().equals(this.getTitle());
}
When navigating through all the elements of a map in pairs, use Map#entrySet instead, you can save the time used by Map#get (since it is supposed to be O(1) you won't save that much but it is better):
for (Map.Entry<Alpha, List<Alpha>> entry : hashmap.keySet()) {
List<Alpha> list = entry.getValuee();
System.out.println(entry.getKey() + "<---->");
for(Alpha y : list) {
System.out.print(y);
}
System.out.println();
}
Use equals along with hashCode to solve the collision state.
Steps:
First compare on the basis of title in hashCode()
If the title is same then look into equals() based on company name to resolve the collision state.
Sample code
class Alpha {
String company;
int local;
String title;
public Alpha(String company, int local, String title) {
this.company = company;
this.local = local;
this.title = title;
}
#Override
public int hashCode() {
return title.hashCode();
}
#Override
public boolean equals(Object obj) {
if (obj instanceof Alpha) {
return this.company.equals(((Alpha) obj).company);
}
return false;
}
}
...
Map<Alpha, ArrayList<Alpha>> hashmap = new HashMap<Alpha, ArrayList<Alpha>>();
hashmap.put(new Alpha("a", 1, "t1"), new ArrayList<Alpha>());
hashmap.put(new Alpha("b", 2, "t1"), new ArrayList<Alpha>());
hashmap.put(new Alpha("a", 3, "t1"), new ArrayList<Alpha>());
System.out.println("Size : "+hashmap.size());
Output
Size : 2

Java overriding equals() and hashcode() for two interchangeable integers

I'm overriding the equals and hashcode methods for a simple container object for two ints. Each int reflects the index of another object (it doesn't matter what that object is). The point of the class is to represent a connection between the two objects.
The direction of the connection doesn't matter, therefore the equals method should return true regardless of which way round the two ints are in the object E.g.
connectionA = new Connection(1,2);
connectionB = new Connection(1,3);
connectionC = new Connection(2,1);
connectionA.equals(connectionB); // returns false
connectionA.equals(connectionC); // returns true
Here is what I have (modified from the source code for Integer):
public class Connection {
// Simple container for two numbers which are connected.
// Two Connection objects are equal regardless of the order of from and to.
int from;
int to;
public Connection(int from, int to) {
this.from = from;
this.to = to;
}
// Modifed from Integer source code
#Override
public boolean equals(Object obj) {
if (obj instanceof Connection) {
Connection connectionObj = (Connection) obj;
return ((from == connectionObj.from && to == connectionObj.to) || (from == connectionObj.to && to == connectionObj.from));
}
return false;
}
#Override
public int hashCode() {
return from*to;
}
}
This does work however my question is: Is there a better way to achieve this?
My main worry is with the hashcode() method will return the same hashcode for any two integers which multiply to equal the same number. E.g.
3*4 = 12
2*6 = 12 // same!
The documentation, http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#hashCode(), states that
It is not required that if two objects are unequal according to the
equals(java.lang.Object) method, then calling the hashCode method on
each of the two objects must produce distinct integer results.
However, the programmer should be aware that producing distinct
integer results for unequal objects may improve the performance of
hashtables.
If anyone can see a simple way of reducing the number of matching hashcodes then I would be appreciative of an answer.
Thanks!
Tim
PS I'm aware that there is a java.sql.Connection which could cause some import annoyances. The object actually has a more specific name in my application but for brevity I shortened it to Connection here.
Three solutions that would "work" have been proposed. (By work, I mean that they satisfy the basic requirement of a hashcode ... that different inputs give different outputs ... and they also satisfy the OP's additional "symmetry" requirement.)
These are:
# 1
return from ^ to;
# 2
return to*to+from*from;
# 3
int res = 17;
res = res * 31 + Math.min(from, to);
res = res * 31 + Math.max(from, to);
return res;
The first one has the problem that the range of the output is bounded by the range of the actual input values. So for instance if we assume that the inputs are both non-negative numbers less or equal to 2i and 2j respectively, then the output will be less or equal to 2max(i,j). That is likely to give you poor "dispersion"1 in your hash table ... and a higher rate of collisions. (There is also a problem when from == to!)
The second and third ones are better than the first, but you are still liable to get more collisions than is desirable if from and to are small.
I would suggest a 4th alternative if it is critical that you minimize collisions for small values of from and to.
#4
int res = Math.max(from, to);
res = (res << 16) | (res >>> 16); // exchange top and bottom 16 bits.
res = res ^ Math.min(from, to);
return res;
This has the advantage that if from and to are both in the range 0..216-1, you get a unique hashcode for each distinct (unordered) pair.
1 - I don't know if this is the correct technical term for this ...
This is widely accepted approach:
#Override
public int hashCode() {
int res = 17;
res = res * 31 + Math.min(from, to);
res = res * 31 + Math.max(from, to);
return res;
}
i think, something like
#Override
public int hashCode() {
return to*to+from*from;
}
is good enough
Typically I use XOR for hashcode method.
#Override
public int hashCode() {
return from ^ to;
}
I wonder why nobody offered the usually best solution: Normalize your data:
Connection(int from, int to) {
this.from = Math.min(from, to);
this.to = Math.max(from, to);
}
If it's impossible, then I'd suggest something like
27644437 * (from+to) + Math.min(from, to)
By a using a multiplier different from 31, you avoid collisions like in this question.
By using a big multiplier you spread the numbers better.
By using an odd multiplier you ensure that the multiplication is bijective (i.e., no information gets lost).
By using a prime you gain nothing at all, but everyone does it and it has no disadvantage.
Java 1.7+ have Objects.hash
#Override
public int hashCode() {
return Objects.hash(from, to);
}

what would be a good hash function for an integer tuple?

I have this class...
public class StartStopTouple {
public int iStart;
public int iStop;
public int iHashCode;
public StartStopTouple(String start, String stop) {
this.iStart = Integer.parseInt(start);
this.iStop = Integer.parseInt(stop);
}
#Override
public boolean equals(Object theObject) {
// check if 'theObject' is null
if (theObject == null) {
return false;
}
// check if 'theObject' is a reference to 'this' StartStopTouple... essentially they are the same Object
if (this == theObject) {
return true;
}
// check if 'theObject' is of the correct type as 'this' StartStopTouple
if (!(theObject instanceof StartStopTouple)) {
return false;
}
// cast 'theObject' to the correct type: StartStopTouple
StartStopTouple theSST = (StartStopTouple) theObject;
// check if the (start,stop) pairs match, then the 'theObject' is equal to 'this' Object
if (this.iStart == theSST.iStart && this.iStop == theSST.iStop) {
return true;
} else {
return false;
}
} // equal() end
#Override
public int hashCode() {
return iHashCode;
}
}
... and I define equality between such Objects only if iStart and iStop in one Object are equal to iStart and iStop in the other Object.
So since I've overridden equals(), I need to override hashCode() but I'm not sure how to define a good hash function for this class. What would be a good way to create a hash code for this class using iStart and iStop?
I'd be tempted to use this, particularly since you're going to memoize it:
Long.valueOf((((long) iStart) << 32) | istop)).hashcode();
From Bloch's "Effective Java":
int iHashCode = 17;
iHashCode = 31 * iHashCode + iStart;
iHashCode = 31 * iHashCode + iStop;
Note: 31 is chosen because the multiplication by 31 can be optimized by the VM as bit operations. (But performance is not useful in your case since as mentioned by #Ted Hopp you are only computing the value once.)
Note: it does not matter if iHashCode rolls over past the largest int.
the simplest might be best
iHashCode = iStart^iStop;
the XOR of the two values
note this will give equal hashcodes when start and stop are swapped
as another possibility you can do
iHashCode = ((iStart<<16)|(iStart>>>16))^iStop;
this first barrel shifts start by 16 and then xors stop with it so the least significant bits are put apart in the xor (if start is never larger than 65k (of more accurately 2^16) you can omit the (iStart>>>16) part)

Compound String key in HashMap

We are storing a String key in a HashMap that is a concatenation of three String fields and a boolean field. Problem is duplicate keys can be created if the delimiter appears in the field value.
So to get around this, based on advice in another post, I'm planning on creating a key class which will be used as the HashMap key:
class TheKey {
public final String k1;
public final String k2;
public final String k3;
public final boolean k4;
public TheKey(String k1, String k2, String k3, boolean k4) {
this.k1 = k1; this.k2 = k2; this.k3 = k3; this.k4 = k4;
}
public boolean equals(Object o) {
TheKey other = (TheKey) o;
//return true if all four fields are equal
}
public int hashCode() {
return ???;
}
}
My questions are:
What value should be returned from hashCode(). The map will hold a total of about 30 values. Of those 30, there are about 10 distinct values of k1 (some entries share the same k1 value).
To store this key class as the HashMap key, does one only need to override the equals() and hashCode() methods? Is anything else required?
Just hashCode and equals should be fine. The hashCode could look something like this:
public int hashCode() {
int hash = 17;
hash = hash * 31 + k1.hashCode();
hash = hash * 31 + k2.hashCode();
hash = hash * 31 + k3.hashCode();
hash = hash * 31 + k4 ? 0 : 1;
return hash;
}
That's assuming none of the keys can be null, of course. Typically you could use 0 as the "logical" hash code for a null reference in the above equation. Two useful methods for compound equality/hash code which needs to deal with nulls:
public static boolean equals(Object o1, Object o2) {
if (o1 == o2) {
return true;
}
if (o1 == null || o2 == null) {
return false;
}
return o1.equals(o2);
}
public static boolean hashCode(Object o) {
return o == null ? 0 : o.hashCode();
}
Using the latter method in the hash algorithm at the start of this answer, you'd end up with something like:
public int hashCode() {
int hash = 17;
hash = hash * 31 + ObjectUtil.hashCode(k1);
hash = hash * 31 + ObjectUtil.hashCode(k2);
hash = hash * 31 + ObjectUtil.hashCode(k3);
hash = hash * 31 + k4 ? 0 : 1;
return hash;
}
In Eclipse you can generate hashCode and equals by Alt-Shift-S h.
Ask Eclipse 3.5 to create the hashcode and equals methods for you :)
this is how a well-formed equals class with equals ans hashCode should look like: (generated with intellij idea, with null checks enabled)
class TheKey {
public final String k1;
public final String k2;
public final String k3;
public final boolean k4;
public TheKey(String k1, String k2, String k3, boolean k4) {
this.k1 = k1;
this.k2 = k2;
this.k3 = k3;
this.k4 = k4;
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
TheKey theKey = (TheKey) o;
if (k4 != theKey.k4) return false;
if (k1 != null ? !k1.equals(theKey.k1) : theKey.k1 != null) return false;
if (k2 != null ? !k2.equals(theKey.k2) : theKey.k2 != null) return false;
if (k3 != null ? !k3.equals(theKey.k3) : theKey.k3 != null) return false;
return true;
}
#Override
public int hashCode() {
int result = k1 != null ? k1.hashCode() : 0;
result = 31 * result + (k2 != null ? k2.hashCode() : 0);
result = 31 * result + (k3 != null ? k3.hashCode() : 0);
result = 31 * result + (k4 ? 1 : 0);
return result;
}
}
The implementation of your hashCode() doesn't matter much unless you make it super stupid. You could very well just return the sum of all the strings hash codes (truncated to an int) but you should make sure you fix this:
If your hash code implementation is slow, consider caching it in the instance. Depending on how long your key objects stick around and how they are used with the hash table when you get things out of it you may not want to spend longer than necessary calculating the same value over and over again. If you stick with Jon's implementation of hashCode() there is probably no need for it as String already cache its hashCode() for you.
This is however more of a general advice, since the mid 90's I've seen quite a few developers get stung on slow (and even worse, changing) hashCode() implementations.
Don't be sloppy when you create the equals() implementation. Your equals() above will be both ineffective and flawed. First of all you don't need to compare the values if the objects have different hash codes. You should also return false (and not a null pointer exception) if you get a null as the argument.
The rules are simple, this page will walk you through them.
Edit:
I have to ask one more thing... You say "Problem is duplicate keys can be created if the delimiter appears in the field value". Why is that?
If the format is key+delimiter+key+delimiter+key it really doesn't matter if there are one or more delimiters in the keys unless you get really unlucky with a combination of two keys and in that case you probably should have selected another delimiter (there are quite a few to choose from in unicode).
Anyway, Jon is right in his comment below... Don't do caching "until you've proven it's a good thing". It is a good practice always.
Have you taken a look at the specifications of hashCode()? Perhaps this will give you a better idea of what the function should return.
I do not know if this is an option for you but apache commons library provides an implementation for MultiKeyMap
For the hashCode, you could instead use something like
k1.hashCode() ^ k2.hashCode() ^ k3.hashCode() ^ k4.hashCode()
XOR is entropy-preserving, and this incorporates k4's hashCode in a much better way than the previous suggestions. Just having one bit of information from k4 means that if all your composite keys have identical k1, k2, k3 and only differing k4s, your hash codes will all be identical and you'll get a degenerate HashMap.
I thought your main concern was speed (based on your original post)? Why don't you just make sure you use a separator which does not occur in your (handfull of) field values? Then you can just create String key using concatenation and do away with all this 'key-class' hocus pocus. Smells like serious over-engineering to me.

Categories

Resources