Java collection performance when comparing items

Java collection performance when comparing items - java

A basic performance question from someone coming from C/C++.
I'm using a Collection (ArrayDeque) to simply hold, add, remove items by identity. I know the contract is for the collection to use equals() when checking equality, for example during remove(obj), but in my case I want to use reference semantics (like IdentityHashMap but don't need the map). So I am fine to just know that I will never override the equals() on any of the objects held inside the collection (which is declared to hold an interface).
Coming from native programming I can't avoid asking myself, will the compiled code of remove(obj) traverse items and perform a virtual call on Object.equals() only to end up comparing addresses? Since I'm storing interface references, there is no way (?) to optimise this using final so the compiler doesn't bother making the useless calls (i.e. inline them) - but now I'm getting ahead of myself because it may be such optimisation is not necessary anyway and JVM has other means (devirtualisation?) to generate optimal code in this case.
Assuming my code needs the level of optimisation that can be obtained by thinking about this aspect in the first place - is my understanding correct? What is a good design for this case?

Making the method final wont avoid the virtual call because invokevirtual opcode will be used anyway and there is no way for the JVM to tell if the method was final or not.
The good news is that the JVM might be able inline it or avoid the virtual call if it can't see that the method is overridden anywhere in the classpath so your performance will improve as your program runs.

When you use the remove method, it will call the equals method for comparison. Ideally, you should be overriding the equals and hashcode method to use such methods. Otherwise the by-default implementation of type-checking and address comparison happens. It is highly recommended to define your implementation of equals and hashcode methods while using methods of Collections.
Regarding the performance, yes you are right - all the objects in the collection will be scanned linearly till the JVM encounters correct match. It is a linear search, hence the time complexity for this operation of removal will take O(n) time.

Related

Fastest way to compare 2 large objects in java

I want to know what would be the fastest way to compare 2 objects in java 8.
I have 2 objects of the same class with 100 properties.
What is the fastest way to find the properties which have different values apart from the compareTo() method in which they are checked for the properties one by one.

You may optimize the equals method so that it bails out as soon difference is found.
If the object is immutable, you may cache its hashCode value, compare the hash value as the first step in the equals method

Another way to optimize would be to pick out some of the equals checks that will most likely return false and separate them from all others. This would give the JIT compiler a chance to inline the fast track. Note that this will only improve performance when the equals method is called often enough to get compiled and the fast track is actually inlined. Later depends on its size and also other factors. So there will be no guarantee and you would need to verify and experiment a bit with a microbenchmark tool like JMH.
Having all the comparisons in one method reduces the likeliness of inlining, since the whole method with 100 comparisons is most likely too big for inlining. JIT compiler's profiling works on method level, so either a complete mehtod gets inlined or it does not get inlined.
Note that this is already advanced micro-optimization. Do this only when your comparison is used frequently and there is a real need for optimization. I used this approach successfully in in one of my projects, where we had a high load scenario with tight time constraints. We did this only because we ran out of other possible optimizations. So think twice, whether you really want to optimize here.
Example:
public boolean equals(Object other) {
// fast track that may get inlined as long as it is
// not too big
if (!equalsFast(other)) {
return false;
}
// slow track that will not be inlined but only
// called sometimes
return equalsOthers(other);
}

How to decide between lambda iteration and normal loop?

Since he introduction of Java 8 I got really hooked to lambdas and started using them whenever possible, mostly to start getting accustomed to them. One of the most common usage is when we want to iterate and act upon a collection of objects in which case I either resort to forEach or stream(). I rarely write the old for(T t : Ts) loop and I almost forgot about the for(int i = 0.....).
However, we were discussing this with my supervisor the other day and he told me that lambdas aren't always the best choice and can sometimes hinder performance. From a lecture I had seen on this new feature I got the feeling that lambda iterations are always fully optimized by the compiler and will (always?) be better than bare iterations, but he begs to differ. Is this true? If yes how do I distinguish between the best solution in each scenario?
P.S: I'm not talking about cases where it is recommended to apply parallelStream. Obviously those will be faster.

Performance depends on so many factors, that it’s hard to predict. Normally, we would say, if your supervisor claims that there was a problem with performance, your supervisor is in charge of explaining what problem.
One thing someone might be afraid of, is that behind the scenes, a class is generated for each lambda creation site (with the current implementation), so if the code in question is executed only once, this might be considered a waste of resources. This harmonizes with the fact that lambda expressions have a higher initialization overhead as the ordinary imperative code (we are not comparing to inner classes here), so inside class initializers, which only run once, you might consider avoiding it. This is also in line with the fact, that you should never use parallel streams in class initializers, so this potential advantage isn’t available here anyway.
For ordinary, frequently executed code that is likely to be optimized by the JVM, these problems do not arise. As you supposed correctly, classes generated for lambda expressions get the same treatment (optimizations) as other classes. At these places, calling forEach on collections bears the potential of being more efficient than a for loop.
The temporary object instances created for an Iterator or the lambda expression are negligible, however, it might be worth noting that a foreach loop will always create an Iterator instance whereas lambda expression do not always do. While the default implementation of Iterable.forEach will create an Iterator as well, some of the most often used collections take the opportunity to provide a specialized implementation, most notably ArrayList.
The ArrayList’s forEach is basically a for loop over an array, without any Iterator. It will then invoke the accept method of the Consumer, which will be a generated class containing a trivial delegation to the synthetic method containing the code of you lambda expression. To optimize the entire loop, the horizon of the optimizer has to span the ArrayList’s loop over an array (a common idiom recognizable for an optimizer), the synthetic accept method containing a trivial delegation and the method containing your actual code.
In contrast, when iterating over the same list using a foreach loop, an Iterator implementation is created containing the ArrayList iteration logic, spread over two methods, hasNext() and next() and instance variables of the Iterator. The loop will repeatedly invoke the hasNext() method to check the end condition (index<size) and next() which will recheck the condition before returning the element, as there is no guaranty that the caller does properly invoke hasNext() before next(). Of course, an optimizer is capable of removing this duplication, but that requires more effort than not having it in the first place. So to get the same performance of the forEach method, the optimizer’s horizon has to span your loop code, the nontrivial hasNext() implementation and the nontrivial next() implementation.
Similar things may apply to other collections having a specialized forEach implementation as well. This also applies to Stream operations, if the source provides a specialized Spliterator implementation, which does not spread the iteration logic over two methods like an Iterator.
So if you want to discuss the technical aspects of foreach vs. forEach(…), you may use these information.
But as said, these aspects describe only potential performance aspects as the work of the optimizer and other runtime environmental aspects may change the outcome completely. I think, as a rule of thumb, the smaller the loop body/action is, the more appropriate is the forEach method. This harmonizes perfectly with the guideline of avoiding overly long lambda expressions anyway.

It depends on specific implementation.
In general forEach method and foreach loop over Iterator usually have pretty similar performance as they use similar level of abstraction. stream() is usually slower (often by 50-70%) as it adds another level that provides access to the underlying collection.
The advantages of stream() generally are the possible parallelism and easy chaining of the operations with lot of reusable ones provided by JDK.

Why does Java's Area#equals method not override Object#equals?

I just ran into a problem caused by Java's java.awt.geom.Area#equals(Area) method. The problem can be simplified to the following unit test:
#org.junit.Test
public void testEquals() {
java.awt.geom.Area a = new java.awt.geom.Area();
java.awt.geom.Area b = new java.awt.geom.Area();
assertTrue(a.equals(b)); // -> true
java.lang.Object o = b;
assertTrue(a.equals(o)); // -> false
}
After some head scratching and debugging, I finally saw in the JDK source, that the signature of the equals method in Area looks like this:
public boolean equals(Area other)
Note that it does not #Override the normal equals method from Object, but instead just overloads the method with a more concrete type. Thus, the two calls in the example above end up calling different implementations of equals.
As this behavior has been present since Java 1.2, I assume it is not considered a bug. I am, therefore, more interested in finding out why the decision was made to not properly override the equals method, but at the same time provide an overloaded variant. (Another hint that this was an actual decision made is the absence of an overwritten hashCode() method.)
My only guess would be that the authors feared that the slow equals implementation for areas is unsuitable for comparing equality when placing Areas in Set,Map,etc. datastructures. (In the above example, you could add a to a HashSet, and although b is equal to a, calling contains(b) will fail.) Then again, why did they not just name the questionable method in a way that does not clash with such a fundamental concept as the equals method ?

RealSkeptic linked to JDK-4391558 in a comment above. The comment in that bug explains the reasoning:
The problem with overriding equals(Object) is that you must also
override hashCode() to return a value which guarantees that equals()
is true only if the hashcodes of the two objects are also equal.
but:
The problem here is that Area.equals(Area) does not perform a very
straight-forward comparison. It painstakingly examines each and every
piece of geometry in the two Areas and tests to see if they cover the
same enclosed spaces. Two Area objects could have a completely
different description of the same enclosed space and equals(Area)
would detect that they were the same.
So basically we're left with an array of not-so-pleasant options, such as:
deprecate equals(Area) and create an alternate name for that
operation, such as "areasEqual" so as to avoid the confusion.
Unfortunately, the old method would remain and would be linkable and
would trap many people who were intending to invoke the equals(Object)
version.
or:
deprecate equals(Area) and change its implementation to be exactly
that of equals(Object) so as to avoid semantic problems if the wrong
method is called. Create a new method with a different name to avoid
confusion to implement the old functionality provided by equals(Area).
or:
implement equals(Object) to call equals(Area) and implement a dummy
hashCode() which honors the equals/hashCode contract in a degenerate
way by returning a constant. This would make the hashCode method
essentially useless and make Area objects nearly useless as keys in a
HashMap or Hashtable.
or other ways to modify the equals(Area) behavior that would either change its semantics or make it inconsistent with hashCode.
Looks like changing this method is deemed by the maintainers to be neither feasible (because neither option outlined in the bug comment quite solves the problem) nor important (since the method, as implemented, is quite slow and would probably only ever return true when comparing an instance of an Area with itself, as the commenter suggests).

"Why does Java's Area#equals method not override Object#equals?"
Because overriding is not necessary for overloaded methods where the parameters are of differing types.
An overridden method would have the exact same method name, return type, number of parameters, and types of parameters as the method in the parent class, and the only difference would be the definition of the method.
This case does not compel us to override but it is overloading as it follows these rules:
1.) The number of parameters is different for the methods.
2.) The parameter types are different (like changing a parameter that was a float to an int).
"why did they not just name the questionable method in a way that does not clash with such a fundamental concept as the equals method?"
Because this could trip people up going into the future. If we had a time machine to the 90's we could do it without this concern.

Time complexity measure of JDK class methods

Is there an established way of measuring (or getting an existing measure) a JDK class method complexity? Is javap representative of time complexity and to what degree. In particular, I am interested in the complexity of Arrays.sort() but also some other collections manipulation methods.
E.g. I am trying to compare two implementations for performance, one is using Arrays.sort() and one doesn't. The javap disassembly for that doesn't returns a lot more steps (twice as many) but I am not sure if the one that does excludes the Arrays.sort() steps. IOW, does javap of one method include a recursive measure of the methods invoked within or just for that method?
Also, is there a way, without modifying and recompiling the Java code itself, to find how many loop iterations were done when a certain base Java method was invoked on specific parameters? E.g. measure the number of iterations of Arrays.sort('A', 'r', 'T', 'f')?

I would not expect javap to be even a little bit representative of actual speed.
The Javadoc specifies the algorithmic complexity, but if you care about constant factors there is absolutely no way to realistically compare constant factors except with actual benchmarks.
You can't get any information on what was done when Arrays.sort is called on a primitive array, but by passing a custom Comparator that counts the number of calls, you can count the number of comparisons made when sorting an object array. (That said, object arrays are sorted with a different sorting algorithm -- specifically a stable one -- and primitive arrays are sorted with a Quicksort variant.)

you can use the output from javap to determine where loops occur you want to find the goto instruction. This post gives a comprehensive explanation of that identification.
From the post:
Before considering any loop start/exit instrumentation, you should
look into the definitions of what entry, exit and successors are.
Although a loop will only have one entry point, it may have multiple
exit points and/or multiple successors, typically caused by break
statements (sometimes with labels), return statements and/or
exceptions (explicitly caught or not). While you haven't given details
regarding the kind of instrumentations you're investigating, it's
certainly worth considering where you want to insert code (if that's
what you want to do). Typically, some instrumentation may have to be
done before each exit statement or instead of each successor statement
(in which case you'll have to move the original statement).

Arrays.sort() for primitives uses tuned quicksort. For Object uses mergesort (but this is depends on implementation).
From: Arrays
For example, the algorithm used by sort(Object[]) does not have to be
a mergesort, but it does have to be stable

Java equals(): to reflect or not to reflect

This question is specifically related to overriding the equals() method for objects with a large number of fields. First off, let me say that this large object cannot be broken down into multiple components without violating OO principles, so telling me "no class should have more than x fields" won't help.
Moving on, the problem came to fruition when I forgot to check one of the fields for equality. Therefore, my equals method was incorrect. Then I thought to use reflection:
--code removed because it was too distracting--
The purpose of this post isn't necessarily to refactor the code (this isn't even the code I am using), but instead to get input on whether or not this is a good idea.
Pros:
If a new field is added, it is automatically included
The method is much more terse than 30 if statements
Cons:
If a new field is added, it is automatically included, sometimes this is undesirable
Performance: This has to be slower, I don't feel the need to break out a profiler
Whitelisting certain fields to ignore in the comparison is a little ugly
Any thoughts?

If you did want to whitelist for performance reasons, consider using an annotation to indicate which fields to compare. Also, this implementation won't work if your fields don't have good implementations for equals().
P.S. If you go this route for equals(), don't forget to do something similar for hashCode().
P.P.S. I trust you already considered HashCodeBuilder and EqualsBuilder.

Use Eclipse, FFS!
Delete the hashCode and equals methods you have.
Right click on the file.
Select Source->Generate hashcode and equals...
Done! No more worries about reflection.
Repeat for each field added, you just use the outline view to delete your two methods, and then let Eclipse autogenerate them.

If you do go the reflection approach, EqualsBuilder is still your friend:
public boolean equals(Object obj) {
return EqualsBuilder.reflectionEquals(this, obj);
}

Here's a thought if you're worried about:
1/ Forgetting to update your big series of if-statements for checking equality when you add/remove a field.
2/ The performance of doing this in the equals() method.
Try the following:
a/ Revert back to using the long sequence of if-statements in your equals() method.
b/ Have a single function which contains a list of the fields (in a String array) and which will check that list against reality (i.e., the reflected fields). It will throw an exception if they don't match.
c/ In your constructor for this object, have a synchronized run-once call to this function (similar to a singleton pattern). In other words, if this is the first object constructed by this class, call the checking function described in (b) above.
The exception will make it immediately obvious when you run your program if you haven't updated your if-statements to match the reflected fields; then you fix the if-statements and update the field list from (b) above.
Subsequent construction of objects will not do this check and your equals() method will run at it's maximum possible speed.
Try as I might, I haven't been able to find any real problems with this approach (greater minds may exist on StackOverflow) - there's an extra condition check on each object construction for the run-once behaviour but that seems fairly minor.
If you try hard enough, you could still get your if-statements out of step with your field-list and reflected fields but the exception will ensure your field list matches the reflected fields and you just make sure you update the if-statements and field list at the same time.

You can always annotate the fields you do/do not want in your equals method, that should be a straightforward and simple change to it.
Performance is obviously related to how often the object is actually compared, but a lot of frameworks use hash maps, so your equals may be being used more than you think.
Also, speaking of hash maps, you have the same issue with the hashCode method.
Finally, do you really need to compare all of the fields for equality?

You have a few bugs in your code.
You cannot assume that this and obj are the same class. Indeed, it's explicitly allowed for obj to be any other class. You could start with if ( ! obj instanceof myClass ) return false; however this is still not correct because obj could be a subclass of this with additional fields that might matter.
You have to support null values for obj with a simple if ( obj == null ) return false;
You can't treat null and empty string as equal. Instead treat null specially. Simplest way here is to start by comparing Field.get(obj) == Field.get(this). If they are both equal or both happen to point to the same object, this is fast. (Note: This is also an optimization, which you need since this is a slow routine.) If this fails, you can use the fast if ( Field.get(obj) == null || Field.get(this) == null ) return false; to handle cases where exactly one is null. Finally you can use the usual equals().
You're not using foundMismatch
I agree with Hank that [HashCodeBuilder][1] and [EqualsBuilder][2] is a better way to go. It's easy to maintain, not a lot of boilerplate code, and you avoid all these issues.

You could use Annotations to exclude fields from the check
e.g.
#IgnoreEquals
String fieldThatShouldNotBeCompared;
And then of course you check the presence of the annotation in your generic equals method.

If you have access to the names of the fields, why don't you make it a standard that fields you don't want to include always start with "local" or "nochk" or something like that.
Then you blacklist all fields that begin with this (code is not so ugly then).
I don't doubt it's a little slower. You need to decide whether you want to swap ease-of-updates against execution speed.

Take a look at org.apache.commons.EqualsBuilder:
http://commons.apache.org/proper/commons-lang/javadocs/api-3.2/org/apache/commons/lang3/builder/EqualsBuilder.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java collection performance when comparing items - java

Related

Fastest way to compare 2 large objects in java

How to decide between lambda iteration and normal loop?

Why does Java's Area#equals method not override Object#equals?

Time complexity measure of JDK class methods

Java equals(): to reflect or not to reflect

Categories

Resources