Fastest way to compare 2 large objects in java

Fastest way to compare 2 large objects in java - java

I want to know what would be the fastest way to compare 2 objects in java 8.
I have 2 objects of the same class with 100 properties.
What is the fastest way to find the properties which have different values apart from the compareTo() method in which they are checked for the properties one by one.

You may optimize the equals method so that it bails out as soon difference is found.
If the object is immutable, you may cache its hashCode value, compare the hash value as the first step in the equals method

Another way to optimize would be to pick out some of the equals checks that will most likely return false and separate them from all others. This would give the JIT compiler a chance to inline the fast track. Note that this will only improve performance when the equals method is called often enough to get compiled and the fast track is actually inlined. Later depends on its size and also other factors. So there will be no guarantee and you would need to verify and experiment a bit with a microbenchmark tool like JMH.
Having all the comparisons in one method reduces the likeliness of inlining, since the whole method with 100 comparisons is most likely too big for inlining. JIT compiler's profiling works on method level, so either a complete mehtod gets inlined or it does not get inlined.
Note that this is already advanced micro-optimization. Do this only when your comparison is used frequently and there is a real need for optimization. I used this approach successfully in in one of my projects, where we had a high load scenario with tight time constraints. We did this only because we ran out of other possible optimizations. So think twice, whether you really want to optimize here.
Example:
public boolean equals(Object other) {
// fast track that may get inlined as long as it is
// not too big
if (!equalsFast(other)) {
return false;
}
// slow track that will not be inlined but only
// called sometimes
return equalsOthers(other);
}

Related

Java collection performance when comparing items

A basic performance question from someone coming from C/C++.
I'm using a Collection (ArrayDeque) to simply hold, add, remove items by identity. I know the contract is for the collection to use equals() when checking equality, for example during remove(obj), but in my case I want to use reference semantics (like IdentityHashMap but don't need the map). So I am fine to just know that I will never override the equals() on any of the objects held inside the collection (which is declared to hold an interface).
Coming from native programming I can't avoid asking myself, will the compiled code of remove(obj) traverse items and perform a virtual call on Object.equals() only to end up comparing addresses? Since I'm storing interface references, there is no way (?) to optimise this using final so the compiler doesn't bother making the useless calls (i.e. inline them) - but now I'm getting ahead of myself because it may be such optimisation is not necessary anyway and JVM has other means (devirtualisation?) to generate optimal code in this case.
Assuming my code needs the level of optimisation that can be obtained by thinking about this aspect in the first place - is my understanding correct? What is a good design for this case?

Making the method final wont avoid the virtual call because invokevirtual opcode will be used anyway and there is no way for the JVM to tell if the method was final or not.
The good news is that the JVM might be able inline it or avoid the virtual call if it can't see that the method is overridden anywhere in the classpath so your performance will improve as your program runs.

When you use the remove method, it will call the equals method for comparison. Ideally, you should be overriding the equals and hashcode method to use such methods. Otherwise the by-default implementation of type-checking and address comparison happens. It is highly recommended to define your implementation of equals and hashcode methods while using methods of Collections.
Regarding the performance, yes you are right - all the objects in the collection will be scanned linearly till the JVM encounters correct match. It is a linear search, hence the time complexity for this operation of removal will take O(n) time.

Is there a difference calling a function with a simple return over just doing the check?

Is there any difference between doing
if (numberOfEntries >= array.length) {do stuff}; // Check if array is full directly
over doing something like
private boolean isArrayFull(){
return numberOfEntries >= array.length;
}
if (isArrayFull()) {do stuff}; // Call a check function
Over large arrays, many iterations and any other environment of execution, is there any difference to these methods other than readability and code duplication, if I need to check if the array is full anywhere else?

Forget about performance. That is negligible.
But if you are doing it many times, util method isArrayFull() makes sense. Because if you are adding more conditions to your check, changing in the function reflects everywhere.

As said above, first make your design good and then determine performance issues, using some tools. Java has JIT optimisations for inlining, so there is no difference.
The JIT aggressively inlines methods, removing the overhead of method calls
from https://techblug.wordpress.com/2013/08/19/java-jit-compiler-inlining/

Note: The below explanation is not any language specific. It is generic.
The difference comes when you analyze the options at machine level, A function is actually some JMP operations and allot of PUSH/POP operations on the CPU. An IF is usually a single COMP operation which is much cheaper than any what happens during function call.
If your 'IF's usually return false/true then I won't worry about it as the CPU optimizes IFs in a very good way by predicting the result as long as the IFs are "predictable" (usually returns true or false or has some pattern of true/false)
I would go with the IFs in cases where even negligible improvement in performance is a big deal.
In cases like web applications reducing the code redundancy to make the code manageable and readable is way more important than the optimization to save a few instructions at machine level.

Can Java compiler optimize adding to a set in recursive methods

Simple question asked mostly out of curiosity about what java compiler's are smart enough to do. I know not all compilers are built equally, but I'm wondering if others feel it's reasonable to expect an optimization on most compilers I'm likely to run against, not if it works on a specific version or on all versions.
So lets say that I have some tree structure and I want to collect all the descendant of a node. There are two easy ways to do this recursively.
The more natural method, for me, to do this would be something like this:
public Set<Node> getDescendants(){
Set<Node> descendants=new HashSet<Node>();
descendants.addall(getChildren());
for(Node child: getChildren()){
descendants.addall(child.getDescendants());
}
return descendants;
}
However, assuming no compiler optimizations and a decent sized tree this could get rather expensive. On each recursive call I create and fully populate a set, only to return that set up the stack so the calling method can add the contents of my returning set to it's version of the descendants set, discarding the version that was just built and populated in the recursive call.
So now I'm creating many sets just to have them be discarded as soon as I return their contents. Not only do I pay a minor initialization cost for building the sets, but I also pay the more substantial cost of moving all the contents of one set into the larger set. In large trees most of my time is spent moving Nodes around in memory from set A to B. I think this even makes my algorithm O(n^2) instead of O(n) due to the time spent copying Nodes; though it may work out to being O(N log(n)) if I set down to do the math.
I could instead have a simple getDescendants method that calls a helper method that looks like this:
public Set<Node> getDescendants(){
Set<node> descendants=new HashSet<Node>();
getDescendantsHelper(descendants);
return descendants;
}
public Set<Node> getDescendantsHelper(Set<Node> descendants){
descendants.addall(getChildren());
for(Node child: getChildren()){
child.getDescendantsHelper(descendant);
}
return nodes;
}
This ensures that I only ever create one set and I don't have to waste time copying from one set to another. However, it requires writing two methods instead of one and generally feels a little more cumbersome.
The question is, do I need to do option two if I'm worried about optimizing this sort of method? or can I reasonably expect the java compiler, or JIT, to recognize that I am only creating temporary sets for convenience of returning to the calling method and avoid the wasteful copying between sets?
edit: cleaned up bad copy paste job which lead to my sample method adding everything twice. You know something is bad when your 'optimized' code is slower then your regular code.

The question is, do I need to do option two if I'm worried about optimizing this sort of method?
Definitely yes. If performance is a concern (and most of the time it is not!), then you need it.
The compiler optimizes a lot but on a very different scale. Basically, it works with one method only and it optimizes the most commonly used path there in. Due to heavy inlining it can sort of optimize across method calls, but nothing like the above.
It can also optimize away needless allocations, but only in very simple cases. Maybe something like
int sum(int... a) {
int result = 0;
for (int x : a) result += x;
return result;
}
Calling sum(1, 2, 3) means allocating int[3] for the varargs arguments and this can be eliminated (if the compiler really does it is a different question). It can even find out that the result is a constant (which I doubt it really does). If the result doesn't get used, it can perform dead code elimination (this happens rather often).
Your example involves allocating a whole HashMap and all its entries, and is several orders of magnitude more complicated. The compiler has no idea how a HashMap works and it can't find out e.g., that after m.addAll(m1) the set m contains all member of m1. No way.
This is an algorithmical optimization rather than low-level. That's what humans are still needed for.
For things the compiler could do (but currently fails to), see e.g. these questions of mine concerning associativity and bounds checks.

Calling getters on an object vs. storing it as a local variable (memory footprint, performance)

In the following piece of code we make a call listType.getDescription() twice:
for (ListType listType: this.listTypeManager.getSelectableListTypes())
{
if (listType.getDescription() != null)
{
children.add(new SelectItem( listType.getId() , listType.getDescription()));
}
}
I would tend to refactor the code to use a single variable:
for (ListType listType: this.listTypeManager.getSelectableListTypes())
{
String description = listType.getDescription();
if (description != null)
{
children.add(new SelectItem(listType.getId() ,description));
}
}
My understanding is the JVM is somehow optimized for the original code and especially nesting calls like children.add(new SelectItem(listType.getId(), listType.getDescription()));.
Comparing the two options, which one is the preferred method and why? That is in terms of memory footprint, performance, readability/ease, and others that don't come to my mind right now.
When does the latter code snippet become more advantageous over the former, that is, is there any (approximate) number of listType.getDescription() calls when using a temp local variable becomes more desirable, as listType.getDescription() always requires some stack operations to store the this object?

I'd nearly always prefer the local variable solution.
Memory footprint
A single local variable costs 4 or 8 bytes. It's a reference and there's no recursion, so let's ignore it.
Performance
If this is a simple getter, the JVM can memoize it itself, so there's no difference. If it's a expensive call which can't be optimized, memoizing manually makes it faster.
Readability
Follow the DRY principle. In your case it hardly matters as the local variable name is character-wise as about as long as the method call, but for anything more complicated, it's readability as you don't have to find the 10 differences between the two expressions. If you know they're the same, so make it clear using the local variable.
Correctness
Imagine your SelectItem does not accept nulls and your program is multithreaded. The value of listType.getDescription() can change in the meantime and you're toasted.
Debugging
Having a local variable containing an interesting value is an advantage.
The only thing to win by omitting the local variable is saving one line. So I'd do it only in cases when it really doesn't matter:
very short expression
no possible concurrent modification
simple private final getter

I think the way number two is definitely better because it improves readability and maintainability of your code which is the most important thing here. This kind of micro-optimization won't really help you in anything unless you writing an application where every millisecond is important.

I'm not sure either is preferred. What I would prefer is clearly readable code over performant code, especially when that performance gain is negligible. In this case I suspect there's next to no noticeable difference (especially given the JVM's optimisations and code-rewriting capabilities)

In the context of imperative languages, the value returned by a function call cannot be memoized (See http://en.m.wikipedia.org/wiki/Memoization) because there is no guarantee that the function has no side effect. Accordingly, your strategy does indeed avoid a function call at the expense of allocating a temporary variable to store a reference to the value returned by the function call.
In addition to being slightly more efficient (which does not really matter unless the function is called many times in a loop), I would opt for your style due to better code readability.

I agree on everything. About the readability I'd like to add something:
I see lots of programmers doing things like:
if (item.getFirst().getSecond().getThird().getForth() == 1 ||
item.getFirst().getSecond().getThird().getForth() == 2 ||
item.getFirst().getSecond().getThird().getForth() == 3)
Or even worse:
item.getFirst().getSecond().getThird().setForth(item2.getFirst().getSecond().getThird().getForth())
If you are calling the same chain of 10 getters several times, please, use an intermediate variable. It's just much easier to read and debug

I would agree with the local variable approach for readability only if the local variable's name is self-documenting. Calling it "description" wouldn't be enough (which description?). Calling it "selectableListTypeDescription" would make it clear. I would throw in that the incremented variable in the for loop should be named "selectableListType" (especially if the "listTypeManager" has accessors for other ListTypes).
The other reason would be if there's no guarantee this is single-threaded or your list is immutable.

String intern in equals method

Is it a good practise to use String#intern() in equals method of the class. Suppose we have a class:
public class A {
private String field;
private int number;
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (getClass() != obj.getClass()) {
return false;
}
final A other = (A) obj;
if ((this.field == null) ? (other.field != null) : !this.field.equals(other.field)) {
return false;
}
if (this.number != other.number) {
return false;
}
return true;
}
}
Will it be faster to use field.intern() != other.field.intern() instead of !this.field.equals(other.field).

No! Using String.intern() implicitly like this is not a good idea:
It will not be faster. As a matter of fact it will be slower due to the use of a hash table in the background. A get() operation in a hash table contains a final equality check, which is what you want to avoid in the first place. Used like this, intern() will be called each and every time you call equals() for your class.
String.intern() has a lot of memory/GC implications that you should not implicitly force on users of this class.
If you want to avoid full blown equality checks when possible, consider the following avenues:
If you know that the set of strings is limited and you have repeated equality checks, you can use intern() for the field at object creation, so that any subsequent equality checks will come down to an identity comparison.
Use an explicit HashMap or WeakHashMap instead of intern() to avoid storing strings in the GC permanent generation - this was an issue in older JVMs, not sure if it is still a valid concern.
Keep in mind that if the set of strings is unbounded, you will have memory issues.
That said, all this sounds like premature optimization to me. String.equals() is pretty fast in the general case, since it compares the string lengths before comparing the strings themselves. Have you profiled your code?

Good practice : Nope. You're doing something tricky, and that makes for brittle, less readable code. Unless this equals() method needs to be crazy performant (and your performance tests validate that it is in fact faster), it's not worth it.
Faster : Could be. But don't forget that you can have unintended side effects from using the intern() method: http://www.onkarjoshi.com/blog/213/6-things-to-remember-about-saving-memory-with-the-string-intern-method/

Any benefit gained by performing an identity comparison on the interned Strings is likely to be outweighed by the associated cost of interning the Strings.
In the above case you could consider interning the String when you instantiate the class, providing the field is constant (in which case you should also mark it as final). You could also check for null on instantiation to avoid having to check on each call to equals (assuming you disallow null Strings).
However, in general these types of micro-optimisation offer little gain in performance.

Let's go through this one step at a time...
The idea here is that if you use String#intern, you'll be given a canonical representation of that String. A pool of Strings is kept internally and each entry is guaranteed to be unique for that pool with regard to equals. If you call intern() on a String, then either a previously pooled identical String is going to be returned, or the String you called intern on is going to be pooled and returned.
So if we have two Strings s1 and s2 and we assume neither is null, then the following two lines of code are considered idempotent:
s1.equals(s2);
s1.intern() == s2.intern();
Let's investigate two assumptions we've made now:
s1.intern() and s2.intern() really will return the same object if s1.equals(s2) evaluates to true.
Using the == operator on two interned references to the same String will be more efficient than using the equals method.
The first assumption is probably the most dangerous of all. The JavaDoc for the intern method tells us that using this method will return a canonical representation for an internally kept pool of Strings. But it doesn't tell us anything about that pool. Once an entry has been added to the pool, can it ever be removed again? Will the pool keep growing indefinitely or will entries occassionally be culled to make it act as a limited-size cache? You'd have to check the actual specifications of the Java Language and Virtual Machine to get any certainty, if they offer it at all. Having to check specs for a limited optimization is usually a big warning sign. Checking the source code for Sun's JDK 7, I see that intern is specified as a native method. So not only is the implementation likely to be vendor-specific, it might vary across platforms as well for VMs from the same vendor. All bets are off regarding stuff that's not in the spec.
On to our second assumption. Let's consider for a moment what it would take to intern a String... First of all, we'll need to check if the String is already in the pool. We'll assume they've tried to get an O(1) complexity going there to keep this fast by using some hashing scheme. But that's assuming we've got a hash of the String. Since this is a native method, I'm not certain what would be used... Some hash of the native representation or simply what hashCode() returns. I know from the source code of Sun's JDK that a String instance caches its hash code. It'll only be calculated the first time the method is called, and after that the calculated value will be returned. So at the very least, a hash must be calculated at least once if we're to use that. Getting a reliable hash of a String will probably involve arithmetic on each and every character, which can be expensive for lenghty values. Even once we have the hash and thus a set of Strings that are candidates for being matches in the interned pool, we'd still have to verify if one of these really is an exact match which would involve... an equality check. Meaning going through each and every character of the Strings and seeing if they match if trivial cases like inequal length can't be applied first. Worse still, we might have to do this for more than one other String like we'd do with a regular equals, since multiple Strings in the pool might have the same hash or end up in the same hash bucket.
So, that stuff we need to do to find out if a String was already interned or not sounds suspiciously like what equals would need to do. Basically, we've gained nothing and might even have made our equals implementation more expensive. At least, if we're going to call intern each and every time. So maybe we should intern the String right away and simply always use that interned reference. Let's check how class A would look if that were the case. I'm assuming the String field is initialized on construction:
public class A {
private final String field;
public A(final String s) {
field = s.intern();
}
}
That's looking a little more sensible. Any Strings that are passed to the constructor and are equal will end up being the same reference. Now we can safely use == between the field field of A instances for equality checks, right?
Well, it'd be useless. Why? If you check the source for equals in class String, you'll find that any implementation made by someone with half a brain will first do a == check to catch the trivial case where the instance and the argument are the same reference first. That could save a potentially heavy char-by-char comparison. I know the JDK 7 source I'm using for reference does this. So you're still better off using equals because it does that reference check anyway.
The second reason this'd be a bad idea is that first point way up above... We simply don't know if the instances are going to be kept in the pool indefinitely. Check this scenario, which may or may not occur depending on JVM implementation:
String s1 = ... //Somehow gets passed a non-interned "test" value
A a1 = new A(s1);
//Lots of time passes... winter comes and goes and spring returns the land to a lush green...
String s2 = ... //Somehow gets passed a non-interned "test" value
A a2 = new A(s2);
a1.equals(a2); //Totally returns the wrong result
What happened? Well, if it turns out the interned String pool will sometimes be culled of certain entries, then that first construction of an A could have s1 interned, only to see it being removed from the pool, to have it later replaced by that s2 instance. Since s1 and s2 are conceivably different instances, the == check fails. Can this happen? I've got no idea. I certainly won't go check the specs and native code to find out. Will the programmer that's going through your code with a debugger to find out why the hell "test" is not considered the same as "test"?
It's no problem if we're using equals. It'll catch the same instance case early for optimal results, which will benefit us when we've interned our Strings, but we won't have to worry about cases where the instances still end up being different because then equals is gonna do the classic compare work. It just goes to show that it's best not to second-guess the actual runtime implementation or compiler, because these things were made by people who know the specs like the back of their hands and really worry about performance.
So String interning manually can be of benefit when you make sure that...
you're not interning each and every time, but just intern a String once like when intializing a field and then keep using that interned instance;
you still use equals to make sure implementation details won't ruin your day and your code doesn't actually rely on that interning, instead relying on the implementation of the method to catch the trivial cases.
After keeping this in mind, surely it's worth using intern()? Well, we still don't know how expensive intern() is. It's a native method so it might be really fast. But we're not sure unless we check the code for our target platform and JVM implementation. We've also had to make sure we understand exactly what interning does and what assumptions we've made about it. Are you sure the next person reading your code will have the same level of understanding? They might be bewildered about this weird method they've never seen before that dabbles in JVM internals and might spend an hour reading the same gibberish I'm typing right now, instead of getting work done.
That's the problem right there... Before, it was simple. You used equals and were done. Now, you've added another little thing that can nestle itself in your mind and cause you to wake up screaming one night because you've just realized that oh my God you've forgot to take out one of the == uses and that piece of code is used in a routine controlling the killer bots' apprisal of citizen disobedience and you've heard its JVM isn't too solid!
Donald Knuth was famously attributed the quote...
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
Knuth was clever enough to add in that 97% detail. Sometimes, thoroughly micro-optimizing a small portion of code can make a big difference. Say, if that piece of code takes up 30% of the program's runtime execution. The problem with micro-optimizations is that they tend to work on assumptions. When you start using intern() and believe that from then on it'll be safe to make reference equality checks, you've made a hell of a lot of assumptions. And even if you go down to implementation level to check if they're right, are you sure they will be in the next JRE version?
I myself have used intern() manually. Did it in some piece of code where the same handful of Strings are gonna end up in hundreds if not thousands of object instances as fields. Those fields are gonna be used as keys in HashMaps and are frequently used while doing some validation over those instances. I figured interning was worth it for two purposes: reducing memory overhead by making all those equal Strings one single instance and speeding up the map lookups, since they're using hashCode() and equals. But I've made damn sure that you can take all those intern() calls out of the code and everything will still work fine. The interning is just some icing on the cake in this case, a little extra that may or may not make a bit of difference along the road. But it's not an essential part of my code's correctness.
Long post, eh? Why'd I go through the trouble of typing all of this up? To show you that if you make micro-optimizations, you'd better know damn well what you're doing and willing to document it so thoroughly that you might as well not have bothered.

This is hard to say given that you have not specified hardware. Timing test are difficult to get right and are not universal. Have you done a timing test yourself?
My feeling is that the intern pattern would not be faster as each string would need to be matched to a possible string in a dictionary of all interned strings.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fastest way to compare 2 large objects in java - java

I want to know what would be the fastest way to compare 2 objects in java 8. I have 2 objects of the same class with 100 properties. What is the fastest way to find the properties which have different values apart from the compareTo() method in which they are checked for the properties one by one.

You may optimize the equals method so that it bails out as soon difference is found. If the object is immutable, you may cache its hashCode value, compare the hash value as the first step in the equals method

Related

Java collection performance when comparing items

Is there a difference calling a function with a simple return over just doing the check?

Can Java compiler optimize adding to a set in recursive methods

Calling getters on an object vs. storing it as a local variable (memory footprint, performance)

String intern in equals method

Categories

Resources