Performance difference between assignment and conditional test - java

This question is specifically geared towards the Java language, but I would not mind feedback about this being a general concept if so. I would like to know which operation might be faster, or if there is no difference between assigning a variable a value and performing tests for values. For this issue we could have a large series of Boolean values that will have many requests for changes. I would like to know if testing for the need to change a value would be considered a waste when weighed against the speed of simply changing the value during every request.
public static void main(String[] args){
Boolean array[] = new Boolean[veryLargeValue];
for(int i = 0; i < array.length; i++) {
array[i] = randomTrueFalseAssignment;
}
for(int i = 400; i < array.length - 400; i++) {
testAndChange(array, i);
}
for(int i = 400; i < array.length - 400; i++) {
justChange(array, i);
}
}
This could be the testAndChange method
public static void testAndChange(Boolean[] pArray, int ind) {
if(pArray)
pArray[ind] = false;
}
This could be the justChange method
public static void justChange(Boolean[] pArray, int ind) {
pArray[ind] = false;
}
If we were to end up with the very rare case that every value within the range supplied to the methods were false, would there be a point where one method would eventually become slower than the other? Is there a best practice for issues similar to this?
Edit: I wanted to add this to help clarify this question a bit more. I realize that the data type can be factored into the answer as larger or more efficient datatypes can be utilized. I am more focused on the task itself. Is the task of a test "if(aConditionalTest)" is slower, faster, or indeterminable without additional informaiton (such as data type) than the task of an assignment "x=avalue".

As #TrippKinetics points out, there is a semantical difference between the two methods. Because you use Boolean instead of boolean, it is possible that one of the values is a null reference. In that case the first method (with the if-statement) will throw an exception while the second, simply assigns values to all the elements in the array.
Assuming you use boolean[] instead of Boolean[]. Optimization is an undecidable problem. There are very rare cases where adding an if-statement could result in better performance. For instance most processors use cache and the if-statement can result in the fact that the executed code is stored exactly on two cache-pages where without an if on more resulting in cache faults. Perhaps you think you will save an assignment instruction but at the cost of a fetch instruction and a conditional instruction (which breaks the CPU pipeline). Assigning has more or less the same cost as fetching a value.
In general however, one can assume that adding an if statement is useless and will nearly always result in slower code. So you can quite safely state that the if statement will slow down your code always.
More specifically on your question, there are faster ways to set a range to false. For instance using bitvectors like:
long[] data = new long[(veryLargeValue+0x3f)>>0x06];//a long has 64 bits
//assign random values
int low = 400>>0x06;
int high = (veryLargeValue-400)>>0x06;
data[low] &= 0xffffffffffffffff<<(0x3f-(400&0x3f));
for(int i = low+0x01; i < high; i++) {
data[i] = 0x00;
}
data[high] &= 0xffffffffffffffff>>(veryLargeValue-400)&0x3f));
The advantage is that a processor can perform operations on 32- or 64-bits at once. Since a boolean is one bit, by storing bits into a long or int, operations are done in parallel.

Related

Avoid indirection and redundant method calls

I am currently reviewing a PullRequest that contains this:
- for (int i = 0; i < outgoingMassages.size(); i++) {
+ for (int i = 0, size = outgoingMassages.size(); i < size; i++)
https://github.com/criticalmaps/criticalmaps-android/pull/52
somehow it feels wrong to me - would think that the vm is doing these optimisations - but cannot really say for sure. Would love to get some input if this change can make sense - or a confirmation that this is done on VM-side.
No, is not sure that the VM will change your code from
- for (int i = 0; i < outgoingMassages.size(); i++) {
to
+ for (int i = 0, size = outgoingMassages.size(); i < size; i++)
In your for loop it is possible that the outgoingMassages will change its size. So This optimization can't be applied by the JVM. Also another thread can change the outgoingMassages size if this is a shared resource.
The JVM can change the code only if the behaviour doesn't change. For example it can substitute a list of string concatenations with a sequence of append to StringBuilder, or it can inline a simple method call, or can calculate a value out of the loop if it is a constant value.
The VM will not do this optimizations. Since it is possible that the size()-Method does not return the same result each call. So the method must be called each iteration.
However if size is a simple getter-method the performance impact is very small. Probably not measurable. (In few cases it may enable Java to use parallelization which could make a difference then, but this depends on the content of the loop).
The bigger difference may be to ensure that the for-loop has an amount of iterations which is known in advance. It doesn't seem make sense to me in this example. But maybe the called method could return changing results which are unwanted?
If the size() method on your collection is just giving the value of a private field, then the VM will optimise most of this away (but not quite all). It'll do it by inlining the size() method, so that it just becomes access to that field.
The remaining bit that won't get optimised is that size in the new code will get treated as final, and therefore constant, whereas the field picked up from the collection won't be treated as final (perhaps it's modified from another thread). So in the original case the field will get read in every iteration, but in the new case it won't.
It is likely that any decent optimiser - either at the VM or in the compiler - will recognise:
class Messages {
int size;
public int size() {
return size;
}
}
public void test() {
Messages outgoingMassages = new Messages();
for (int i = 0; i < outgoingMassages.size(); i++) {
}
}
and optimise it to
for (int i = 0; i < outgoingMassages.size; i++) {
doing the extra - untested - optimisation should therefore be considered evil.
The method invocation will happen on each iteration of the loop and is not free of costs. Since you cannot predict how often this happens calling it once will always be less. It is a minor optimization but you should not rely on the compiler to do the optimization for you.
Further, the member outgoingMassages ..
private ArrayList<OutgoingChatMessage> outgoingMassages ..
... should be an interface:
private List<OutgoingChatMessage> outgoingMassages ..
Then, calling .size() will become a virtual method. To find out the specific object class the method table will be invoked for all classes of the hierarchy. This is not free of costs again.

How can I evaluate a hash table implementation? (Using HashMap as reference)

Problem:
I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.
I am not interested in 100% accuracy but just being in the right direction in my estimation.
I am interested in the difference not only per operation but mainly on the hashtable as a "whole".
I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).
What I did so far:
Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):
int[] keys = new int[10000000];
String[] values = new String[10000000];
for(int i = 0; i < keys.length; ++i) {
keys[i] = i;
values[i] = "" + i;
}
if(operation.equals("put", keys, values)) {
runPutOperation(map);
}
public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {
long min = Long.MAX_VALUE;
long max = Long.MIN_VALUE;
long run = 0;
for(int i = 0; i < 10; ++i) {
long start = System.currentTimeMillis();
for(int i = 0; i < keys.length; ++i) {
map.put(keys[i], values[i]);
}
long total = System.currentTimeMillis() - start;
System.out.println(total/1000d + " seconds");
if(total < min) {
min = time;
}
if(total > max) {
max = time;
}
run += time;
map = null;
map = createNewHashMap();
hintsToGC();
}
return new long[] {min, max, run};
}
public void hintsToGC() {
for(int i = 0; i < 20; ++i) {
System.out.print(". ");
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private HashMapInterface<String> createNewHashMap() {
if(jdk) {
return new JDKHashMapWrapper<String>();
}
else {
return new AlternativeHashMapWrapper<String>();
}
}
public class JDKHashMapWrapper implements HashMapInterface<String> {
HashMap<Integer, String> hashMap;
JDKHashMapWrapper() {
hashMap = new HashMap<Integer, String>();
}
public String put(Integer key, String value) {
return hashMap.put(key, value);
}
//etc
}
(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?
Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor
Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?
UPDATE after #dimo414 comments:
Well at a minimum the hashtable as a "whole" isn't meaningful
I mean how the hashtable behaves under various loads both at runtime and in memory consumption.
Every data structure is a tradeoff of different methods
I agree. My trade-off is an acceptable access penalty for memory improvement
You need to identify what features you're interested in verifying
1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table
Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.
If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.
Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.
As I understand you are interested in both operations execution time and memory consumption of the maps in the test.
I will start with memory consumption as this seams not to be answered at all. What I propose is to use a small library called Classmexer. I personally used it when I need to get the 100% correct memory consumption of any object. It has the java agent approach (because it's using the Instrumentation API), which means that you need to add it as the parameter to the JVM executing your tests:
-javaagent: [PATH_TO]/classmexer.jar
The usage of the Classmexer is very simple. At any point of time you can get the memory consumption in bytes by executing:
MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL)
Note that with visibility filter you can specify if the memory calculation should be done for the object (our map) plus all other reachable object through references. That's what VisibilityFilter.ALL is for. However, this would mean that the size you get back includes all objects you used for keys and values. Thus if you have 100 Integer/String entries the reported size will include those as well.
For the timing aspect I would propose JMH tool, as this tool is made for micro bench-marking. There are plenty examples online, for example this article has map testing examples that can guide you pretty good.
Note that I should be careful when do you call the Classmexer's Memory Util as it will interfere with the time results if you call it during the time measuring. Furthermore, I am sure that there are many other tools similar to Classmexer, but I like it because it small and simple.
I was just doing something similar to this, and I ended up using the built in profiler in the Netbeans IDE. You can get really detailed info on both CPU and memory usage. I had originally written all my code in Eclipse, but Netbeans has an import feature for bringing in Eclipse projects and it set it all up no problem, if that is possibly your situation too.
For timing, you might also look at the StopWatch class in Apache Commons. It's a much more intuitive way of tracking time on targeted operations, e.g.:
StopWatch myMapTimer = new StopWatch();
HashMap<Integer, Integer> hashMap = new HashMap<>();
myMapTimer.start();
for (int i = 0; i < numElements; i++)
hashMap.put(i, i);
myMapTimer.stop();
System.out.println(myMapTimer.getTime()); // time will be in milliseconds

What is better/faster in Java: 2 method calls or 1 object call

I'm afraid this is a terribly stupid question. However, I can't find an answer to it and therefore require some help :)
Let's start with a simplification of my real problem:
Assume I have a couple of boxes each filled with a mix of different gems.
I'm now creating an object gem which has the attribute colour and a method getColour to get the colour of the gem.
Further I'm creating an object box which has a list of gems as attribute and a method getGem to get a gem from that list.
What I want to do now is to count all gems in all boxes by colour. Now I could either do something like
int sapphire = 0;
int ruby = 0;
int emerald = 0;
for(each box = i)
for(each gem = j)
if(i.getGem(j).getColour().equals(“blue”)) sapphire++;
else if(i.getGem(j).getColour().equals(“red”)) ruby++;
else if(i.getGem(j).getColour().equals(“green”)) emerald++;
or I could do
int sapphire = 0;
int ruby = 0;
int emerald = 0;
String colour;
for(each box = i)
for(each gem = j)
colour = i.getGem(j).getColour();
if(colour.equals(“blue”)) sapphire++;
else if(colour.equals(“red”)) ruby++;
else if(colour.equals(“green”)) emerald++;
My question is now if both is essentially the same or should one be preferred over the other? I understand that a lot of unnecessary new string objects are produced in the second case, but do I get a speed advantage in return as colour is more “directly” available?
I would dare to make a third improvement:
int sapphire = 0;
int ruby = 0;
int emerald = 0;
for(each box = i) {
for(each gem = j) {
String colour = i.getGem(j).getColour();
if(“blue”.equals(colour)) sapphire++;
else if(“red”.equals(colour)) ruby++;
else if(“green”.equals(colour)) emerald++;
}
}
I use a local variable inside the for-loop. Why? Because you probably need it only there.
It is generally better to put STATIC_STRING.equals(POSSIBLE_NULL_VALUE).
This has the advantage: easier to read and should have no performance problem. If you have a performance problem, then you should consider looking somewhere else in your code. Related to this: this answer.
conceptually both codes have equal complexity i.e.: O(i*j). But if calling a method and get a returned value are considered to be two processes then the complexity of your first code will be 4*O(i*j).(consider O(i*j) as a function) and of your second code will be O(i*(j+2)). although this complexity difference is not considerable enough but if you are comparing then yes your first code is more complex and not a good programming style.
The cost of your string comparisons is going to wipe out all other considerations in this sort of approach.
You would be better off using something else (for example an enum). That would also expand automatically.
(Although your for each loop isn't proper Java syntax anyway so that's a bit odd).
enum GemColour {
blue,
red,
green
}
Then in your count function:
Map<GemColour, Integer> counts = new EnumMap<GemColour, Integer>(GemColour.class);
for (Box b: box) {
for (Gem g: box.getGems() {
Integer count = counts.get(g.getColour());
if (count == null) {
count=1;
} else {
count+=1;
}
counts.put(g.getColour(), count);
}
}
Now it will automatically extend to any new colors you add without you needing to make any code changes. It will also be much faster as it does a single integer comparison rather than a string comparison and uses that to put the correct value into the correct place in the map (which behind the scenes is just an array).
To get the counts just do, for example:
counts.get(GemColour.blue);
As has been pointed out in the comments the java Stream API would allow you to do all of this in one line:
boxes.stream().map(Box::getGems).flatMap(Collection::stream).collect(groupingBy‌​‌​(Gem::getColour, counting()))
It's less easy to understand what it is doing that way though.

Strange performance drop after innocent changes to a trivial program

Imagine you want to count how many non-ASCII chars a given char[] contains. Imagine, the performance really matters, so we can skip our favorite slogan.
The simplest way is obviously
int simpleCount() {
int result = 0;
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
Then you think that many inputs are pure ASCII and that it could be a good idea to deal with them separately. For simplicity assume you write just this
private int skip(int i) {
for (; i < string.length; i++) {
if (string[i] >= 128) break;
}
return i;
}
Such a trivial method could be useful for more complicated processing and here it can't do no harm, right? So let's continue with
int smartCount() {
int result = 0;
for (int i = skip(0); i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
It's the same as simpleCount. I'm calling it "smart" as the actual work to be done is more complicated, so skipping over ASCII quickly makes sense. If there's no or a very short ASCII prefix, it can costs a few cycles more, but that's all, right?
Maybe you want to rewrite it like this, it's the same, just possibly more reusable, right?
int smarterCount() {
return finish(skip(0));
}
int finish(int i) {
int result = 0;
for (; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
And then you ran a benchmark on some very long random string and get this
The parameters determine the ASCII to non-ASCII ratio and the average length of a non-ASCII sequence, but as you can see they don't matter. Trying different seeds and whatever doesn't matter. The benchmark uses caliper, so the usual gotchas don't apply. The results are fairly repeatable, the tiny black bars at the end denote the minimum and maximum times.
Does anybody have an idea what's going on here? Can anybody reproduce it?
Got it.
The difference is in the possibility for the optimizer/CPU to predict the number of loops in for. If it is able to predict the number of repeats up front, it can skip the actual check of i < string.length. Therefore the optimizer needs to know up front how often the condition in the for-loop will succeed and therefore it must know the value of string.length and i.
I made a simple test, by replacing string.length with a local variable, that is set once in the setup method. Result: smarterCount has runtime of about simpleCount. Before the change smarterCount took about 50% longer then simpleCount. smartCount did not change.
It looks like the optimizer looses the information of how many loops it will have to do when a call to another method occurs. That's the reason why finish() immediately ran faster with the constant set, but not smartCount(), as smartCount() has no clue about what i will be after the skip() step. So I did a second test, where I copied the loop from skip() into smartCount().
And voilà, all three methods return within the same time (800-900 ms).
My tentative guess would be that this is about branch prediction.
This loop:
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
Contains exactly one branch, the backward edge of the loop, and it is highly predictable. A modern processor will be able to accurately predict this, and so fill its whole pipeline with instructions. The sequence of loads is also highly predictable, so it will be able to pre-fetch everything the pipelined instructions need. High performance results.
This loop:
for (; i < string.length - 1; i++) {
if (string[i] >= 128) break;
}
Has a dirty great data-dependent conditional branch sitting in the middle of it. That is much harder for the processor to predict accurately.
Now, that doesn't entirely make sense, because (a) the processor will surely quickly learn that the break branch will usually not be taken, (b) the loads are still predictable, and so just as pre-fetchable, and (c) after that loop exits, the code goes into a loop which is identical to the loop which goes fast. So i wouldn't expect this to make all that much difference.

Java: why is computing faster than assigning value (int)?

The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key
It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.
The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;
Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling
To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.

Categories

Resources