Ternary Conditional causes weird CPU usage in Java - java

I was running this bit of code to compare performance of 3 equivalent methods of calculating wraparound coordinates:
public class Test {
private static final float MAX = 1000000;
public static void main(String[] args) {
long time = System.currentTimeMillis();
for (float i = -MAX; i <= MAX; i++) {
for (float j = 100; j < 10000; j++) {
method1(i, j);
//method2(i, j);
//method3(i, j);
}
}
System.out.println(System.currentTimeMillis() - time);
}
private static float method1(float value, float max) {
value %= max + 1;
return (value < 0) ? value + max : value;
}
private static float method2(float value, float max) {
value %= max + 1;
if (value < 0)
value += max;
return value;
}
private static float method3(float value, float max) {
return ((value % max) + max) % max;
}
}
I ran this code three times with method1, method2, and method3 (one method per test).
What was peculiar was that method1 spawned several java processes, and managed to utilize nearly 100% of my dual core cpu, while method2 and method3 spawned only 1 process and therefore could only take up 25% of my cpu (since there are 4 virtual cores). Why would this happen?
Here are the details of my machine:
Java 1.6.0_65
Macbook Air 13" late 2013

After reviewing the various comments and failing at consistently replicating the behavior. I have decided that this is indeed a fluke of the JRE. Thanks to the comments, I have learned that benchmarking this way is not very useful and I should use a Micro-benchmarking framework such as Google's caliper to test such minute differences.
Some resources I found:
Google's java micro-benchmark page
IBM page discussing a flawed benchmark and common pitfalls
Oracle's page on micro-benchmarking
Stackoverflow question on micro-benchmarking

Related

Java Benchmark for recursive stairs climbing puzzle

This now very common algorithm question was asked by a proctor during a whiteboard exam session. My job was to observe, listen to and objectively judge the answers given, but I had neither control over this question asked nor could interact with the person answering.
There was five minutes given to analyze the problem, where the candidate could write bullet notes, pseudo code (this was allowed during actual code-writing as well as long as it was clearly indicated, and people including pseudo-code as comments or TODO tasks before figuring out the algorithm got bonus points).
"A child is climbing up a staircase with n steps, and can hop either 1 step, 2 steps, or 3 steps at a time. Implement a method to count how many possible ways the child can jump up the stairs."
The person who got this question couldn't get started on the recursion algorithm on the spot, so the proctor eventually piece-by-piece led him to HIS solution, which in my opinion was not optimal (well, different from my chosen solution making it difficult to grade someone objectively with respect to code optimization).
Proctor:
public class Staircase {
public static int stairs;
public Staircase() {
int a = counting(stairs);
System.out.println(a);
}
static int counting(int n) {
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
public static void main(String[] args) {
Staircase child;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
child = new Staircase();
}
System.out.println("Time:" + ((System.nanoTime() - t1)/1000000));
}
}
//
Mine:
public class Steps {
public static int stairs;
int c2 = 0;
public Steps() {
int a = step2(0);
System.out.println(a);
}
public static void main(String[] args) {
Steps steps;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
steps = new Steps();
}
System.out.println("Time:" + ((System.nanoTime() - t1) / 1000000));
}
public int step2(int c) {
if (c + 1 < stairs) {
if (c + 2 <= stairs) {
if (c + 3 <= stairs) {
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
} else {
c2++;
}
return c2;
}
}
OUTPUT:
Proctor: Time: 356
Mine: Time: 166
Could someone clarify which algorithm is better/ more optimal? The execution time of my algorithm appears to be less than half as long, (but I am referencing and updating an additional integer which i thought was rather inconsequential) and it allows for setting arbitrary starting and ending step without needing to first now their difference (although for anything higher than n=40 you will need a beast of a CPU).
My question: (feel free to ignore the above example) How do you properly benchmark a similar recursion-based problem (tower of Hanoi etc.). Do you just look at the timing, or take other things into consideration (heap?).
Teaser: You may perform this computation easily in less than one millisecond. Details follow...
Which one is "better"?
The question of which algorithm is "better" may refer to the execution time, but also to other things, like the implementation style.
The Staircase implementation is shorter, more concise and IMHO more readable. And more importantly: It does not involve a state. The c2 variable that you introduced there destroys the advantages (and beauty) of a purely functional recursive implementation. This may easily be fixed, although the implementation then already becomes more similar to the Staircase one.
Measuring performance
Regarding the question about execution time: Properly measuring execution time in Java is tricky.
Related reading:
How do I write a correct micro-benchmark in Java?
Java theory and practice: Anatomy of a flawed microbenchmark
HotSpot Internals
In order to properly and reliably measure execution times, there exist several options. Apart from a profiler, like VisualVM, there are frameworks like JMH or Caliper, but admittedly, using them may be some effort.
For the simplest form of a very basic, manual Java Microbenchmark you have to consider the following:
Run the algorithms multiple times, to give the JIT a chance to kick in
Run the algorithms alternatingly and not only one after the other
Run the algorithms with increasing input size
Somehow save and print the results of the computation, to prevent the computation from being optimized away
Don't print anything to the console during the benchmark
Consider that timings may be distorted by the garbage collector (GC)
Again: These are only rules of thumb, and there may still be unexpected results (refer to the links above for more details). But with this strategy, you usually obtain a good indication about the performance, and at least can see whether it's likely that there really are significant differences between the algorithms.
The differences between the approaches
The Staircase implementation and the Steps implementation are not very different.
The main conceptual difference is that the Staircase implementation is counting down, and the Steps implementation is counting up.
The main difference that actually affects the performance is how the Base Case is handled (see Recursion on Wikipedia). In your implementation, you avoid calling the method recursively when it is not necessary, at the cost of some additional if statements. The Staircase implementation uses a very generic treatment of the base case, by just checking whether n < 0.
One could consider an "intermediate" solution that combines ideas from both approaches:
class Staircase2
{
public static int counting(int n)
{
int result = 0;
if (n >= 1)
{
result += counting(n-1);
if (n >= 2)
{
result += counting(n-2);
if (n >= 3)
{
result += counting(n-3);
}
}
}
else
{
result += 1;
}
return result;
}
}
It's still recursive without a state, and sums up the intermediate results, avoiding many of the "useless" calls by using some if queries. It's already noticably faster than the original Staircase implementation, but still a tad slower than the Steps implementation.
Why both solutions are slow
For both implementations, there's not really anything to be computed. The method consists of few if statements and some additions. The most expensive thing here is actually the recursion itself, with the deeeeply nested call tree.
And that's the key point here: It's a call tree. Imagine what it is computing for a given number of steps, as a "pseudocode call hierarchy":
compute(5)
compute(4)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
One can imagine that this grows exponentially when the number becomes larger. And all the results are computed hundreds, thousands or or millions of times. This can be avoided
The fast solution
The key idea to make the computation faster is to use Dynamic Programming. This basically means that intermediate results are stored for later retrieval, so that they don't have to be computed again and again.
It's implemented in this example, which also compares the execution time of all approaches:
import java.util.Arrays;
public class StaircaseSteps
{
public static void main(String[] args)
{
for (int i = 5; i < 33; i++)
{
runStaircase(i);
runSteps(i);
runDynamic(i);
}
}
private static void runStaircase(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Staircase.counting(i);
}
long after = System.nanoTime();
System.out.println("Staircase up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runSteps(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Steps.step(i);
}
long after = System.nanoTime();
System.out.println("Steps up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runDynamic(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += StaircaseDynamicProgramming.counting(i);
}
long after = System.nanoTime();
System.out.println("Dynamic up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
}
class Staircase
{
public static int counting(int n)
{
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
}
class Steps
{
static int c2 = 0;
static int stairs;
public static int step(int c)
{
c2 = 0;
stairs = c;
return step2(0);
}
private static int step2(int c)
{
if (c + 1 < stairs)
{
if (c + 2 <= stairs)
{
if (c + 3 <= stairs)
{
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
}
else
{
c2++;
}
return c2;
}
}
class StaircaseDynamicProgramming
{
public static int counting(int n)
{
int results[] = new int[n+1];
Arrays.fill(results, -1);
return counting(n, results);
}
private static int counting(int n, int results[])
{
int result = results[n];
if (result == -1)
{
result = 0;
if (n >= 1)
{
result += counting(n-1, results);
if (n >= 2)
{
result += counting(n-2, results);
if (n >= 3)
{
result += counting(n-3, results);
}
}
}
else
{
result += 1;
}
}
results[n] = result;
return result;
}
}
The results on my PC are as follows:
...
Staircase up to 29 gives 34850335 time 310.672814
Steps up to 29 gives 34850335 time 112.237711
Dynamic up to 29 gives 34850335 time 0.089785
Staircase up to 30 gives 64099760 time 578.072582
Steps up to 30 gives 64099760 time 204.264142
Dynamic up to 30 gives 64099760 time 0.091524
Staircase up to 31 gives 117897840 time 1050.152703
Steps up to 31 gives 117897840 time 381.293274
Dynamic up to 31 gives 117897840 time 0.084565
Staircase up to 32 gives 216847936 time 1929.43348
Steps up to 32 gives 216847936 time 699.066728
Dynamic up to 32 gives 216847936 time 0.089089
Small changes in the order of statements or so ("micro-optimizations") may have a small impact, or make a noticable difference. But using an entirely different approach can make the real difference.

Java Math.min/max performance

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)
I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)
So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).
public class OpsMath {
public static final int max(final int a, final int b) {
if (a > b) {
return a;
}
return b;
}
}
That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).
Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:
try ( final Benchmark bench = new Benchmark( "millis to max" ) )
{
int max = Integer.MIN_VALUE;
for ( int i = 0; i < array.length; ++i )
{
max = OpsMath.max( max, array[i] );
// max = Math.max( max, array[i] );
}
}
I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.
'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.
The array is randomized each time the test runs.
Running the test 6 times, it gives these results:
Java standard Math:
millis to max 9.242167
millis to max 2.1566199999999998
millis to max 2.046396
millis to max 2.048616
millis to max 2.035761
millis to max 2.001044
So fairly stable after the first run, and running the tests again gives similar results.
OpsMath:
millis to max 8.65418
millis to max 1.161559
millis to max 0.955851
millis to max 0.946642
millis to max 0.994543
millis to max 0.9469069999999999
Again, very stable results after the first run.
The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.
Specs:
CPU: Intel i5 2500, 3,3Ghz.
Java Version: JDK 8 (public march 18 release), x64.
Debian Jessie (testing release) x64.
I have yet to try with 32 bit JVM.
EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.
// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " +
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
int max = Integer.MIN_VALUE;
// Randomize the array.
for ( int i = 0; i < array.length; ++i )
{
array[i] = r.nextInt();
}
final long start = System.nanoTime();
for ( int i = 0; i < array.length; ++i )
{
max = Math.max( array[i], max );
// OpsMath.max() method implemented as described.
// max = OpsMath.max( array[i], max );
}
// Calc time.
final double end = (System.nanoTime() - start);
// Store results.
times.add( Double.valueOf( end ) );
results.add( Integer.valueOf( max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
System.out.println( "IT" + i + " result: " + results.get( i ) );
System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}
Java Math.max result:
IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047
OpsMath.max result:
IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984
Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.
It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of
Math.max( array[i], max );
being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given
final int[] array = new int[(8*1024*1024)/4];
it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.1
In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).
This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).
1 Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.
As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.
When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:
0x00007f4b65425c50: mov %r11d,%edi ;*getstatic array
; - foo146::bench#81 (line 40)
0x00007f4b65425c53: mov 0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov 0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov 0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov 0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov 0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov 0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov 0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov 0x1c(%rax,%rdx,4),%r14d ;*iaload
; - foo146::bench#86 (line 40)
0x00007f4b65425c79: cmp %edi,%r8d
0x00007f4b65425c7c: cmovl %edi,%r8d
0x00007f4b65425c80: cmp %r8d,%r10d
0x00007f4b65425c83: cmovl %r8d,%r10d
0x00007f4b65425c87: cmp %r10d,%ecx
0x00007f4b65425c8a: cmovl %r10d,%ecx
0x00007f4b65425c8e: cmp %ecx,%r14d
0x00007f4b65425c91: cmovl %ecx,%r14d
0x00007f4b65425c95: cmp %r14d,%esi
0x00007f4b65425c98: cmovl %r14d,%esi
0x00007f4b65425c9c: cmp %esi,%ebx
0x00007f4b65425c9e: cmovl %esi,%ebx
0x00007f4b65425ca1: cmp %ebx,%r9d
0x00007f4b65425ca4: cmovl %ebx,%r9d
0x00007f4b65425ca8: cmp %r9d,%r11d
0x00007f4b65425cab: cmovl %r9d,%r11d ;*invokestatic max
; - foo146::bench#88 (line 40)
0x00007f4b65425caf: add $0x8,%edx ;*iinc
; - foo146::bench#92 (line 39)
0x00007f4b65425cb2: cmp $0x1ffff9,%edx
0x00007f4b65425cb8: jl 0x00007f4b65425c50
Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.
With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As #maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.
I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.
Using JDK 8:
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
On Ubuntu 13.10
I ran the following:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array;
public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
this.max = max;
this.array = array;
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);
m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);
// total time over number of calls to max
return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
}
public double averageTime(int repeats) {
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return (double) cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array = new int[size];
for (int i = 0; i < size; i++) array[i] = random.nextInt();
double tMath = new MaxPerformance(Math::max, array).averageTime(100);
double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
And I got the following results (average nanoseconds taken for each call to max):
Java Math: 15.443555810000003
Alt 1: 14.968298919999997
Alt 2: 16.442204045
So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.
In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one. This can be done by using two randomized arrays instead of one as in this benchmark:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array1, array2;
public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
this.max = max;
this.array1 = array1;
this.array2 = array2;
if (array1.length != array2.length) throw new IllegalArgumentException();
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
m += m; // to avoid optimizations!
return ((double) (System.nanoTime() - start)) / (double) array1.length;
}
public double averageTime(int repeats) {
// warm up rounds:
double tmp = 0;
for (int i = 0; i < 10; i++) tmp += time();
tmp *= 2.0;
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array1 = new int[size];
int[] array2 = new int[size];
for (int i = 0; i < size; i++) {
array1[i] = random.nextInt();
array2[i] = random.nextInt();
}
double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
Which gave me:
Java Math: 15.346468170000005
Alt 1: 16.378737519999998
Alt 2: 20.506475350000006
The way your test is set up makes a huge difference on the results. The JDK version seems to be the fastest in this scenario. This time by a relatively large margin compared to the previous case.
Somebody mentioned Caliper. Well if you read the wiki, one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general. I think this is a clear example of that.
Here's a branchless min operation, max can be implemented by replacing diff=a-b with diff=b-a.
public static final long min(final long a, final long b) {
final long diff = a - b;
// All zeroes if a>=b, all ones if a<b because the sign bit is propagated
final long mask = diff >> 63;
return (a & mask) | (b & (~mask));
}
It should be as fast as streaming the memory because the CPU operations should be hidden by the sequential memory read latency.

Is bitwise operation faster than modulo/reminder operator in Java?

I read in couple of blogs that in Java modulo/reminder operator is slower than bitwise-AND. So, I wrote the following program to test.
public class ModuloTest {
public static void main(String[] args) {
final int size = 1024;
int index = 0;
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndex(size, i);
}
long end = System.nanoTime();
System.out.println("Time taken by Modulo (%) operator --> " + (end - start) + "ns.");
start = System.nanoTime();
final int shiftFactor = size - 1;
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndexBitwise(shiftFactor, i);
}
end = System.nanoTime();
System.out.println("Time taken by bitwise AND --> " + (end - start) + "ns.");
}
private static int getNextIndex(int size, int nextInt) {
return nextInt % size;
}
private static int getNextIndexBitwise(int size, int nextInt) {
return nextInt & size;
}
}
But in my runtime environment (MacBook Pro 2.9GHz i7, 8GB RAM, JDK 1.7.0_51) I am seeing otherwise. The bitwise-AND is significantly slower, in fact twice as slow than the remainder operator.
I would appreciate it if someone can help me understand if this is intended behavior or I am doing something wrong?
Thanks,
Niranjan
Your code reports bitwise-and being much faster on each Mac I've tried it on, both with Java 6 and Java 7. I suspect the first portion of the test on your machine happened to coincide with other activity on the system. You should try running the test multiple times to verify you aren't seeing distortions based on that. (I would have left this as a 'comment' rather than an 'answer', but apparently you need 50 reputation to do that -- quite silly, if you ask me.)
For starters, logical conjunction trick only works with Nature Number dividends and power of 2 divisors. So, if you need negative dividends, floats, or non-powers of 2, sick with the default % operator.
My tests (with JIT warmup and 1M random iterations), on an i7 with a ton of cores and bus load of ram show about 20% better performance from the bitwise operation. This can very per run, depending how the process scheduler runs the code.
using Scala 2.11.8 on JDK 1.8.91
4Ghz i7-4790K, 8 core AMD, 32GB PC3 19200 ram, SSD
This example in particular will always give you a wrong result. Moreover, I believe that any program which is calculating the modulo by a power of 2 will be faster than bitwise AND.
REASON: When you use N % X where X is kth power of 2, only last k bits are considered for modulo, whereas in case of the bitwise AND operator the runtime actually has to visit each bit of the number under question.
Also, I would like to point out the Hot Spot JVM's optimizes repetitive calculations of similar nature(one of the examples can be branch prediction etc). In your case, the method which is using the modulo just returns the last 10 bits of the number because 1024 is the 10th power of 2.
Try using some prime number value for size and check the same result.
Disclaimer: Micro benchmarking is not considered good.
Is this method missing something?
public static void oddVSmod(){
float tests = 100000000;
oddbit(tests);
modbit(tests);
}
public static void oddbit(float tests){
for(int i=0; i<tests; i++)
if((i&1)==1) {System.out.print(" "+i);}
System.out.println();
}
public static void modbit(float tests){
for(int i=0; i<tests; i++)
if((i%2)==1) {System.out.print(" "+i);}
System.out.println();
}
With that, i used netbeans built-in profiler (advanced-mode) to run this. I set var tests up to 10X10^8, and every time, it showed that modulo is faster than bitwise.
Thank you all for valuable inputs.
#pamphlet: Thank you very much for the concerns, but negative comments are fine with me. I confess that I did not do proper testing as suggested by AndyG. AndyG could have used a softer tone, but its okay, sometimes negatives help seeing the positive. :)
That said, I changed my code (as shown below) in a way that I can run that test multiple times.
public class ModuloTest {
public static final int SIZE = 1024;
public int usingModuloOperator(final int operand1, final int operand2) {
return operand1 % operand2;
}
public int usingBitwiseAnd(final int operand1, final int operand2) {
return operand1 & operand2;
}
public void doCalculationUsingModulo(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingModuloOperator(1, size);
}
}
public void doCalculationUsingBitwise(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingBitwiseAnd(i, size);
}
}
public static void main(String[] args) {
final ModuloTest moduloTest = new ModuloTest();
final int invocationCount = 100;
// testModuloOperator(moduloTest, invocationCount);
testBitwiseOperator(moduloTest, invocationCount);
}
private static void testModuloOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingModulo(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using modulo operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
private static void testBitwiseOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingBitwise(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using bitwise operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
}
I called testModuloOperator() and testBitwiseOperator() in mutual exclusive way. The result was consistent with the idea that bitwise is faster than modulo operator. I ran each of the calculation 100 times and recorded the execution times. Then removed first five and last five recordings and used rest to calculate the avg. time. And, below are my test results.
Using modulo operator, the avg. time for 90 runs: 8388.89ns.
Using bitwise-AND operator, the avg. time for 90 runs: 722.22ns.
Please suggest if my approach is correct or not.
Thanks again.
Niranjan

Java 8 nested loops with streams & performance

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760.
public static int digitSum(BigInteger x)
{
int sum = 0;
for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
return sum;
}
#Test public void solve()
{
int max = 0;
for(int i=1;i<100;i++)
for(int j=1;j<100;j++)
max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
System.out.println(max);
}
My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel()):
int max = IntStream.range(1,100).parallel()
.map(i -> IntStream.range(1, 100)
.map(j->digitSum(BigInteger.valueOf(i).pow(j)))
.max().getAsInt()).max().getAsInt();
My questions
did I do the conversion right or is there a better way to convert nested loops to stream calculations?
why is the stream variant so much slower than the old one?
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right?
Update
I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test).
My results for a,b<1000 instead of 100:
traditional loop 186s
sequential stream 193s
parallel stream 55s
Update 2
Replacing sum+=Integer.valueOf(c+""); with sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. Didn't expect such a big performance impact!
Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much):
int max = IntStream.range(0, 3).parallel().
.map(m -> IntStream.range(0,250)
.map(i -> IntStream.range(1, 1000)
.map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
.max().getAsInt()).max().getAsInt()).max().getAsInt();
I have created a quick and dirty micro benchmark based on your code. The results are:
loop: 3192
lambda: 3140
lambda parallel: 868
So the loop and lambda are equivalent and the parallel stream significantly improves the performance. I suspect your results are unreliable due to your benchmarking methodology.
public static void main(String[] args) {
int sum = 0;
//warmup
for (int i = 0; i < 100; i++) {
solve();
solveLambda();
solveLambdaParallel();
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solve();
}
long end = System.nanoTime();
System.out.println("loop: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambda();
}
long end = System.nanoTime();
System.out.println("lambda: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambdaParallel();
}
long end = System.nanoTime();
System.out.println("lambda parallel : " + (end - start) / 1_000_000);
}
System.out.println(sum);
}
public static int digitSum(BigInteger x) {
int sum = 0;
for (char c : x.toString().toCharArray()) {
sum += Integer.valueOf(c + "");
}
return sum;
}
public static int solve() {
int max = 0;
for (int i = 1; i < 100; i++) {
for (int j = 1; j < 100; j++) {
max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
}
}
return max;
}
public static int solveLambda() {
return IntStream.range(1, 100)
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
public static int solveLambdaParallel() {
return IntStream.range(1, 100)
.parallel()
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
I have also run it with jmh which is more reliable than manual tests. The results are consistent with above (micro seconds per call):
Benchmark Mode Mean Units
c.a.p.SO21968918.solve avgt 32367.592 us/op
c.a.p.SO21968918.solveLambda avgt 31423.123 us/op
c.a.p.SO21968918.solveLambdaParallel avgt 8125.600 us/op
The problem you have is you are looking at sub-optimal code. When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. Loops have been around much longer and are better understood.
One big difference in your loop code, is you working set is very small. You are only considering one maximum digit sum at a time. This means the code is cache friendly and you have very short lived objects. In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. I would expect your GC times to be longer and/or more frequent as well.
why is the stream variant so much slower than the old one?
Loops are fairly well optimised having been around since before Java was developed. They can be mapped very efficiently to hardware. Streams are fairly new and not as heavily optimised.
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
Most likely you have a bottle neck on a shared resource. You create quite a bit of garbage but this is usually fairly concurrent. Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have.

Lookup table Fast Sigmoidal function

private static double [] sigtab = new double[1001]; // values of f(x) for x values
static {
for(int i=0; i<1001; i++) {
double ifloat = i;
ifloat /= 100;
sigtab[i] = 1.0/(1.0 + Math.exp(-ifloat));
}
}
public static double fast_sigmoid (double x) {
if (x <= -10)
return 0.0;
else if (x >= 10)
return 1.0;
else {
double normx = Math.abs(x*100);
int i = (int)normx;
double lookup = sigtab[i] + (sigtab[i+1] - sigtab[i])*(normx - Math.floor(normx));
if (x > 0)
return lookup;
else // (x < 0)
return (1 - lookup);
}
}
Anyone know why this "fast sigmoid" actually runs slower than the exact version using Math.exp?
You should profile your code, but I'll bet it's the call to Math.floor taking around half your CPU cycles (it is slow because it calls the native method StrictMath.floor(double), incurring the JNI overhead.)
It is possible to compute (less-accurate) versions of sigmoid functions faster than the (exact) hardware implementations. Here's an example for tanh, which should be easy to transform to your function (is it expit(-x)?)
Two tricks that are used here are often useful in LUT-based approximations:
Simulate rounding by adding a large constant (forcing the FPU will truncate it, having too few bits to represent the sum)
Make your table size a power of 2 (means one less multiply per call)
public static float fastTanH(float x) {
if (x<0) return -fastTanH(-x);
if (x>8) return 1f;
float xp = TANH_FRAC_BIAS + x;
short ind = (short) Float.floatToRawIntBits(xp);
float tanha = TANH_TAB[ind];
float b = xp - TANH_FRAC_BIAS;
x -= b;
return tanha + x * (1f - tanha*tanha);
}
private static final int TANH_FRAC_EXP = 6; // LUT precision == 2 ** -6 == 1/64
private static final int TANH_LUT_SIZE = (1 << TANH_FRAC_EXP) * 8 + 1;
private static final float TANH_FRAC_BIAS =
Float.intBitsToFloat((0x96 - TANH_FRAC_EXP) << 23);
private static float[] TANH_TAB = new float[TANH_LUT_SIZE];
static {
for (int i = 0; i < TANH_LUT_SIZE; ++ i) {
TANH_TAB[i] = (float) Math.tanh(i / 64.0);
}
}
Do you mean looking up in an array of double elements and performing some calculus should be faster than calculating it on the spot?
Altough the CPU only has basic operations, it can handle an exponentiation pretty easily. I'd say in less than 5 basic operations.
What you are doing here is somehow complex and requires actually having to go fetch some elements in the memory. 64bits*1001 surely fits in your cache but cache access time certainly does not match registry access time.
This case does not surprise me in the least.

Categories

Resources