Is bitwise operation faster than modulo/reminder operator in Java?

Is bitwise operation faster than modulo/reminder operator in Java? - java

I read in couple of blogs that in Java modulo/reminder operator is slower than bitwise-AND. So, I wrote the following program to test.
public class ModuloTest {
public static void main(String[] args) {
final int size = 1024;
int index = 0;
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndex(size, i);
}
long end = System.nanoTime();
System.out.println("Time taken by Modulo (%) operator --> " + (end - start) + "ns.");
start = System.nanoTime();
final int shiftFactor = size - 1;
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndexBitwise(shiftFactor, i);
}
end = System.nanoTime();
System.out.println("Time taken by bitwise AND --> " + (end - start) + "ns.");
}
private static int getNextIndex(int size, int nextInt) {
return nextInt % size;
}
private static int getNextIndexBitwise(int size, int nextInt) {
return nextInt & size;
}
}
But in my runtime environment (MacBook Pro 2.9GHz i7, 8GB RAM, JDK 1.7.0_51) I am seeing otherwise. The bitwise-AND is significantly slower, in fact twice as slow than the remainder operator.
I would appreciate it if someone can help me understand if this is intended behavior or I am doing something wrong?
Thanks,
Niranjan

Your code reports bitwise-and being much faster on each Mac I've tried it on, both with Java 6 and Java 7. I suspect the first portion of the test on your machine happened to coincide with other activity on the system. You should try running the test multiple times to verify you aren't seeing distortions based on that. (I would have left this as a 'comment' rather than an 'answer', but apparently you need 50 reputation to do that -- quite silly, if you ask me.)

For starters, logical conjunction trick only works with Nature Number dividends and power of 2 divisors. So, if you need negative dividends, floats, or non-powers of 2, sick with the default % operator.
My tests (with JIT warmup and 1M random iterations), on an i7 with a ton of cores and bus load of ram show about 20% better performance from the bitwise operation. This can very per run, depending how the process scheduler runs the code.
using Scala 2.11.8 on JDK 1.8.91
4Ghz i7-4790K, 8 core AMD, 32GB PC3 19200 ram, SSD

This example in particular will always give you a wrong result. Moreover, I believe that any program which is calculating the modulo by a power of 2 will be faster than bitwise AND.
REASON: When you use N % X where X is kth power of 2, only last k bits are considered for modulo, whereas in case of the bitwise AND operator the runtime actually has to visit each bit of the number under question.
Also, I would like to point out the Hot Spot JVM's optimizes repetitive calculations of similar nature(one of the examples can be branch prediction etc). In your case, the method which is using the modulo just returns the last 10 bits of the number because 1024 is the 10th power of 2.
Try using some prime number value for size and check the same result.
Disclaimer: Micro benchmarking is not considered good.

Is this method missing something?
public static void oddVSmod(){
float tests = 100000000;
oddbit(tests);
modbit(tests);
}
public static void oddbit(float tests){
for(int i=0; i<tests; i++)
if((i&1)==1) {System.out.print(" "+i);}
System.out.println();
}
public static void modbit(float tests){
for(int i=0; i<tests; i++)
if((i%2)==1) {System.out.print(" "+i);}
System.out.println();
}
With that, i used netbeans built-in profiler (advanced-mode) to run this. I set var tests up to 10X10^8, and every time, it showed that modulo is faster than bitwise.

Thank you all for valuable inputs.
#pamphlet: Thank you very much for the concerns, but negative comments are fine with me. I confess that I did not do proper testing as suggested by AndyG. AndyG could have used a softer tone, but its okay, sometimes negatives help seeing the positive. :)
That said, I changed my code (as shown below) in a way that I can run that test multiple times.
public class ModuloTest {
public static final int SIZE = 1024;
public int usingModuloOperator(final int operand1, final int operand2) {
return operand1 % operand2;
}
public int usingBitwiseAnd(final int operand1, final int operand2) {
return operand1 & operand2;
}
public void doCalculationUsingModulo(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingModuloOperator(1, size);
}
}
public void doCalculationUsingBitwise(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingBitwiseAnd(i, size);
}
}
public static void main(String[] args) {
final ModuloTest moduloTest = new ModuloTest();
final int invocationCount = 100;
// testModuloOperator(moduloTest, invocationCount);
testBitwiseOperator(moduloTest, invocationCount);
}
private static void testModuloOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingModulo(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using modulo operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
private static void testBitwiseOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingBitwise(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using bitwise operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
}
I called testModuloOperator() and testBitwiseOperator() in mutual exclusive way. The result was consistent with the idea that bitwise is faster than modulo operator. I ran each of the calculation 100 times and recorded the execution times. Then removed first five and last five recordings and used rest to calculate the avg. time. And, below are my test results.
Using modulo operator, the avg. time for 90 runs: 8388.89ns.
Using bitwise-AND operator, the avg. time for 90 runs: 722.22ns.
Please suggest if my approach is correct or not.
Thanks again.
Niranjan

Related

Why is Arrays.binarySearch not improving the performance compared to walking the array?

I gave a shot at solving the Hackerland Radio Transmitters programming challange.
To summarize, challenge goes as follows:
Hackerland is a one-dimensional city with n houses, where each house i is located at some xi on the x-axis. The Mayor wants to install radio transmitters on the roofs of the city's houses. Each transmitter has a range, k, meaning it can transmit a signal to all houses ≤ k units of distance away.
Given a map of Hackerland and the value of k, can you find the minimum number of transmitters needed to cover every house?
My implementation is as follows:
package biz.tugay;
import java.util.*;
public class HackerlandRadioTransmitters {
public static int minNumOfTransmitters(int[] houseLocations, int transmitterRange) {
// Sort and remove duplicates..
houseLocations = uniqueHouseLocationsSorted(houseLocations);
int towerCount = 0;
for (int nextHouseNotCovered = 0; nextHouseNotCovered < houseLocations.length; ) {
final int towerLocation = HackerlandRadioTransmitters.findNextTowerIndex(houseLocations, nextHouseNotCovered, transmitterRange);
towerCount++;
nextHouseNotCovered = HackerlandRadioTransmitters.nextHouseNotCoveredIndex(houseLocations, towerLocation, transmitterRange);
if (nextHouseNotCovered == -1) {
break;
}
}
return towerCount;
}
public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
int towerIndex = houseNotCoveredIndex;
int loop = 0;
while (true) {
loop++;
if (towerIndex == houseLocations.length - 1) {
break;
}
if (farthestHouseLocationAllowed >= houseLocations[towerIndex + 1]) {
towerIndex++;
continue;
}
break;
}
System.out.println("findNextTowerIndex looped : " + loop);
return towerIndex;
}
public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
int notCoveredHouseIndex = towerIndex + 1;
int loop = 0;
while (notCoveredHouseIndex < houseLocations.length) {
loop++;
final int locationOfHouseBeingChecked = houseLocations[notCoveredHouseIndex];
if (locationOfHouseBeingChecked > towerCoversUntil) {
break; // Tower does not cover the house anymore, break the loop..
}
notCoveredHouseIndex++;
}
if (notCoveredHouseIndex == houseLocations.length) {
notCoveredHouseIndex = -1;
}
System.out.println("nextHouseNotCoveredIndex looped : " + loop);
return notCoveredHouseIndex;
}
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
Arrays.sort(houseLocations);
final HashSet<Integer> integers = new HashSet<>();
final int[] houseLocationsUnique = new int[houseLocations.length];
int innerCounter = 0;
for (int houseLocation : houseLocations) {
if (integers.contains(houseLocation)) {
continue;
}
houseLocationsUnique[innerCounter] = houseLocation;
integers.add(houseLocationsUnique[innerCounter]);
innerCounter++;
}
return Arrays.copyOf(houseLocationsUnique, innerCounter);
}
}
I am pretty sure this implementation is correct. But please see the detail in the functions: findNextTowerIndex and nextHouseNotCoveredIndex: they walk the array one by one!
One of my tests is as follows:
static void test_01() throws FileNotFoundException {
final long start = System.currentTimeMillis();
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381);
assert minNumOfTransmitters == 1;
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
where input.txt can be downloaded from here. (It is not the most important detail in this question, but still..) So we have an array of 73382 houses, and I deliberately set the transmitter range so the methods I have loop a lot:
Here is a sample output from this test in my machine:
findNextTowerIndex looped : 38213
nextHouseNotCoveredIndex looped : 13785
Took: 359 milliseconds..
I also have this test, which does not assert anything, but just keeps time:
static void test_02() throws FileNotFoundException {
final long start = System.currentTimeMillis();
for (int i = 0; i < 400; i ++) {
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
final int transmitterRange = ThreadLocalRandom.current().nextInt(1, 70000);
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
where I randomly create 400 transmitter ranges, and run the program 400 times.. I will get run times as follows in my machine..
Took: 20149 milliseconds..
So now, I said, why don 't I use binary search instead of walking the array and changed my implementations as follows:
public static int findNextTowerIndex(final int[] houseLocations, final int houseNotCoveredIndex, final int transmitterRange) {
final int houseLocationWeWantToCover = houseLocations[houseNotCoveredIndex];
final int farthestHouseLocationAllowed = houseLocationWeWantToCover + transmitterRange;
int nextTowerIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, farthestHouseLocationAllowed);
if (nextTowerIndex < 0) {
nextTowerIndex = -nextTowerIndex;
nextTowerIndex = nextTowerIndex -2;
}
return nextTowerIndex;
}
public static int nextHouseNotCoveredIndex(final int[] houseLocations, final int towerIndex, final int transmitterRange) {
final int towerCoversUntil = houseLocations[towerIndex] + transmitterRange;
int nextHouseNotCoveredIndex = Arrays.binarySearch(houseLocations, 0, houseLocations.length, towerCoversUntil);
if (-nextHouseNotCoveredIndex > houseLocations.length) {
return -1;
}
if (nextHouseNotCoveredIndex < 0) {
nextHouseNotCoveredIndex = - (nextHouseNotCoveredIndex + 1);
return nextHouseNotCoveredIndex;
}
return nextHouseNotCoveredIndex + 1;
}
and I am expecting a great performance boost, as now I will at most loop for log(N) times, instead of O(N).. So test_01 outputs:
Took: 297 milliseconds..
Remember, it was Took: 359 milliseconds.. before. And for test_02:
Took: 18047 milliseconds..
So I always get values around 20 seconds with array walking implementation and 18 - 19 seconds for binary search implementation.
I was expecting a much better performance gain using Arrays.binarySearch but obviously it is not the case, why is this? What am I missing? Do I need an array with more than 73382 to see the benefit, or is it irrelevant?
Edit #01
After #huck_cussler 's comment, I tried doubling and tripling the dataset I have (with random numbers) and tried running test02 (of course with tripling the array sizes in the test itself..). For the linear implementation the times go like this:
Took: 18789 milliseconds..
Took: 34396 milliseconds..
Took: 53504 milliseconds..
For the binary search implementation, I got values as follows:
Took: 18644 milliseconds..
Took: 33831 milliseconds..
Took: 52886 milliseconds..

Your timing includes the retrieval of data from your hard drive. This could be taking the majority of your runtime. Omit the data load from your timing to get a more accurate comparison of your two approaches. Imagine if it takes up 18 seconds and you're comparing 18.644 vs 18.789 (0.77% improvement) instead of 0.644 vs 0.789 (18.38% improvement).
If you have a linear operation O(n), such as loading a binary structure, and you combine it with a binary search O(log n), you end up with O(n). If you trust Big O notation, then you should expect O(n + log n) to not be significantly different from O(2 * n) as they both reduce to O(n).
Also, a binary search may perform better or worse than a linear search depending on the density of houses between towers. Consider, say 1024 homes with a tower evenly dispersed every 4 homes. A linear search will step 4 times per tower, while a binary search will take log2(1024)=10 steps per tower.
One more thing... your minNumOfTransmitters method is sorting the already-sorted array passed into it from test_01 and test_02. That resorting step takes longer than your searches themselves, which further obscures the timing differences between your two search algorithms.
======
I created a small timing class to give a better picture of what's happening. I've removed the line of code from minNumOfTransmitters to prevent it from rerunning the sort, and added a boolean param to select whether to use your binary version. It totals the sum of times for 400 iterations, separating out each step. The results on my system illustrate that the load time dwarfs the sort time, which in turn dwarfs the solve time.
Load: 22.565s
Sort: 4.518s
Linear: 0.012s
Binary: 0.003s
It's easy to see how optimizing that last step doesn't make much difference in overall runtime.
private static class Timing {
public long load=0;
public long sort=0;
public long solve1=0;
public long solve2=0;
private String secs(long millis) {
return String.format("%3d.%03ds", millis/1000, millis%1000);
}
public String toString() {
return " Load: " + secs(load) + "\n Sort: " + secs(sort) + "\nLinear: " + secs(solve1) + "\nBinary: " + secs(solve2);
}
public void add(Timing timing) {
load+=timing.load;
sort+=timing.sort;
solve1+=timing.solve1;
solve2+=timing.solve2;
}
}
static Timing test_01() throws FileNotFoundException {
Timing timing=new Timing();
long start = System.currentTimeMillis();
final File file = new File("c:\\path\\to\\xnpwdiG3.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
timing.load+=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int[] uniqueHouseLocationsSorted = HackerlandRadioTransmitters.uniqueHouseLocationsSorted(houseLocations);
timing.sort=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int minNumOfTransmitters = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, false);
timing.solve1=System.currentTimeMillis()-start;
start=System.currentTimeMillis();
final int minNumOfTransmittersBin = HackerlandRadioTransmitters.minNumOfTransmitters(uniqueHouseLocationsSorted, 73381, true);
timing.solve2=System.currentTimeMillis()-start;
final long end = System.currentTimeMillis();
return timing;
}

In your time measurement you include operations that are much slower than array search. Namely filesystem I/O and array sorting.
I/O in general (reading/writing from filesystem, network communication) is by orders of magnitude slower than operations that involve only CPU and RAM access.
Let's rewrite your test in a way that does not read the file on every loop iteration:
static void test_02() throws FileNotFoundException {
final File file = new File("input.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
scanner.close();
final int rounds = 400;
final int[] uniqueHouseLocationsSorted = uniqueHouseLocationsSorted(houseLocations);
final int transmitterRange = 73381;
final long start = System.currentTimeMillis();
for (int i = 0; i < rounds; i++) {
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds..");
}
Notice in this version of the test the file is read only once and time measuring starts after that.
With the above, I get Took: 1700 milliseconds.. (more or less a few millis) for both the iterative version and the binary search. So we still can't see that binary search is faster. That's because almost all of that time goes into sorting the array 400 times.
Now let's remove the line that sorts the input array from the minNumOfTransmitters method. We sort the array (once) anyway at the beginning of the test.
Now we can see that things are much faster. After removing the line houseLocations = uniqueHouseLocationsSorted(houseLocations) from minNumOfTransmitters I get: Took: 68 milliseconds.. for the iterative version. Clearly, since this duration is already very small, we will not see a significant difference with the binary search version.
So let's increase the number of loop rounds to: 100000.
Now I get Took: 2121 milliseconds.. for the iterative version and Took: 36 milliseconds.. for the binary search version.
Because we now isolated what we measure and focus on the array searches, rather than including operations that are much slower, we can notice the big difference in performance (for the better) of binary search.
If you want to see how many times binary search enters its while loop, you can implement it yourself and add a counter:
private static int binarySearch0(int[] a, int fromIndex, int toIndex, int key) {
int low = fromIndex;
int high = toIndex - 1;
int loop = 0;
while (low <= high) {
loop++;
int mid = (low + high) >>> 1;
int midVal = a[mid];
if (midVal < key) {
low = mid + 1;
} else if (midVal > key) {
high = mid - 1;
} else {
return mid; // key found
}
}
System.out.println("binary search looped " + loop + " times");
return -(low + 1); // key not found.
}
The method is copied from the Arrays class in the JDK - I just added the loop counter and the println.
When the length of the array to search is 73382, the loop enters only 16 times.
That is exactly what we expect: log(73382) =~ 16.

I agree with other answers that the main issue with your tests is that they measure wrong things: IO and sorting. But I don't think suggested tests are good. My suggestion is following:
static void test_02() throws FileNotFoundException {
final File file = new File("43620487.txt");
final Scanner scanner = new Scanner(file);
int[] houseLocations = new int[73382];
for (int counter = 0; counter < 73382; counter++) {
houseLocations[counter] = scanner.nextInt();
}
final int[] uniqueHouseLocationsSorted = uniqueHouseLocationsSorted(houseLocations);
final Random random = new Random(0); // fixed seed to have the same sequences in all tests
long sum = 0;
// warm up
for (int i = 0; i < 100; i++) {
final int transmitterRange = random.nextInt(70000) + 1;
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
sum += minNumOfTransmitters;
}
// actual measure
final long start = System.currentTimeMillis();
for (int i = 0; i < 4000; i++) {
final int transmitterRange = random.nextInt(70000) + 1;
final int minNumOfTransmitters = minNumOfTransmitters(uniqueHouseLocationsSorted, transmitterRange);
sum += minNumOfTransmitters;
}
final long end = System.currentTimeMillis();
System.out.println("Took: " + (end - start) + " milliseconds. Sum = " + sum);
}
Note also that I remove all System.out.println calls from findNextTowerIndex and nextHouseNotCoveredIndex and uniqueHouseLocationsSorted call from minNumOfTransmitters as they affect performance testing as well.
So what I think is important here:
Move all I/O and sorting out of the measurement loop
Perform some warm up outside of measurement
Use the same random sequence for all measurements
Don't dispose result of the calculation so JIT can't optimize that call out altogether
With such test I see about 10 times difference on my machine: around 80ms vs around 8ms.
And if you really want to do performance tests in Java you should consider using JMH aka Java Microbenchmark Harness

Agree with other answers, the IO time is most problem, and sort is second, the search is last time consumer.
And agree phatfingers's example, the binary search sometime is worst than linear search in your problem because totally linear search goes one loop for every element(n times compare) but binary search run for tower times (O(logn)*#tower)), one suggestion is that binary search not start from 0, but from current location
int nextTowerIndex = Arrays.binarySearch(houseLocations, houseNotCoveredIndex+1, houseLocations.length, arthestHouseLocationAllowed)
then it should O(logn)*#tower/2)
Even more, maybe you can calculate every tower cover how many houses avg then first compare avg houses then using binary search start from houseNotCoveredIndex + avg + 1, but not sure the performance is much better.
ps: sort and unique you can using TreeSet as
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
final Set<Integer> integers = new TreeSet<>();
for (int houseLocation : houseLocations) {
integers.add(houseLocation);
}
int[] unique = new int[integers.size()];
int i = 0;
for(Integer loc : integers){
unique[i] = loc;
i++;
}
return unique;
}

uniqueHouseLocationsSorted is not efficient, andy solution seems better, but I think this could improve the time spent (note that I did not test the code):
public static int[] uniqueHouseLocationsSorted(final int[] houseLocations) {
int size = houseLocations.length;
if (size == 0) return null; // you have to check for null later or maybe throw an exception here
Arrays.sort(houseLocations);
final int[] houseLocationsUnique = new int[size];
int previous = houseLocationsUnique[0] = houseLocations[0];
int innerCounter = 1;
for (int i = 1; i < size; i++) {
int houseLocation = houseLocations[i];
if (houseLocation == previous) continue; // since elements are sorted this is faster
previous = houseLocationsUnique[innerCounter++] = houseLocation;
}
return Arrays.copyOf(houseLocationsUnique, innerCounter);
}
Consider also using an Array list as copying the array takes time.

Does the JIT Optimizer Optimize Multiplication?

In my computer architecture class I just learned that running an algebraic expression involving multiplication through a multiplication circuit can be more costly than running it though an addition circuit, if the number of required multiplications are less than 3. Ex: 3x. If I'm doing this type of computation a few billion times, does it pay off to write it as: x + x + x or does the JIT optimizer optimize for this?

I wouldn't expect to be a huge difference on writing it one way or the other.
The compiler will probably take care of making all of those equivalent.
You can try each method and measure how long it takes, that could give you a good hint to answer your own question.
Here's some code that does the same calculations 10 million times using different approaches (x + x + x, 3*x, and a bit shift followed by a subtraction).
They seem to all take approx the same amount of time as measured by System.nanoTime.
Sample output for one run:
sum : 594599531
mult : 568783654
shift : 564081012
You can also take a look at this question that talks about how compiler's optimization can probably handle those and more complex cases: Is shifting bits faster than multiplying and dividing in Java? .NET?
Code:
import java.util.Random;
public class TestOptimization {
public static void main(String args[]) {
Random rn = new Random();
long l1 = 0, l2 = 0, l3 = 0;
long nano1 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l1 += sum(num);
}
long nano2 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l2 += mult(num);
}
long nano3 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l3 += shift(num);
}
long nano4 = System.nanoTime();
System.out.println(l1);
System.out.println(l2);
System.out.println(l3);
System.out.println("sum : " + (nano2 - nano1));
System.out.println("mult : " + (nano3 - nano2));
System.out.println("shift : " + (nano4 - nano3));
}
private static long sum(long x) {
return x + x + x;
}
private static long mult(long x) {
return 3 * x;
}
private static long shift(long x) {
return (x << 2) - x;
}
}

Java Benchmark for recursive stairs climbing puzzle

This now very common algorithm question was asked by a proctor during a whiteboard exam session. My job was to observe, listen to and objectively judge the answers given, but I had neither control over this question asked nor could interact with the person answering.
There was five minutes given to analyze the problem, where the candidate could write bullet notes, pseudo code (this was allowed during actual code-writing as well as long as it was clearly indicated, and people including pseudo-code as comments or TODO tasks before figuring out the algorithm got bonus points).
"A child is climbing up a staircase with n steps, and can hop either 1 step, 2 steps, or 3 steps at a time. Implement a method to count how many possible ways the child can jump up the stairs."
The person who got this question couldn't get started on the recursion algorithm on the spot, so the proctor eventually piece-by-piece led him to HIS solution, which in my opinion was not optimal (well, different from my chosen solution making it difficult to grade someone objectively with respect to code optimization).
Proctor:
public class Staircase {
public static int stairs;
public Staircase() {
int a = counting(stairs);
System.out.println(a);
}
static int counting(int n) {
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
public static void main(String[] args) {
Staircase child;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
child = new Staircase();
}
System.out.println("Time:" + ((System.nanoTime() - t1)/1000000));
}
}
//
Mine:
public class Steps {
public static int stairs;
int c2 = 0;
public Steps() {
int a = step2(0);
System.out.println(a);
}
public static void main(String[] args) {
Steps steps;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
steps = new Steps();
}
System.out.println("Time:" + ((System.nanoTime() - t1) / 1000000));
}
public int step2(int c) {
if (c + 1 < stairs) {
if (c + 2 <= stairs) {
if (c + 3 <= stairs) {
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
} else {
c2++;
}
return c2;
}
}
OUTPUT:
Proctor: Time: 356
Mine: Time: 166
Could someone clarify which algorithm is better/ more optimal? The execution time of my algorithm appears to be less than half as long, (but I am referencing and updating an additional integer which i thought was rather inconsequential) and it allows for setting arbitrary starting and ending step without needing to first now their difference (although for anything higher than n=40 you will need a beast of a CPU).
My question: (feel free to ignore the above example) How do you properly benchmark a similar recursion-based problem (tower of Hanoi etc.). Do you just look at the timing, or take other things into consideration (heap?).

Teaser: You may perform this computation easily in less than one millisecond. Details follow...
Which one is "better"?
The question of which algorithm is "better" may refer to the execution time, but also to other things, like the implementation style.
The Staircase implementation is shorter, more concise and IMHO more readable. And more importantly: It does not involve a state. The c2 variable that you introduced there destroys the advantages (and beauty) of a purely functional recursive implementation. This may easily be fixed, although the implementation then already becomes more similar to the Staircase one.
Measuring performance
Regarding the question about execution time: Properly measuring execution time in Java is tricky.
Related reading:
How do I write a correct micro-benchmark in Java?
Java theory and practice: Anatomy of a flawed microbenchmark
HotSpot Internals
In order to properly and reliably measure execution times, there exist several options. Apart from a profiler, like VisualVM, there are frameworks like JMH or Caliper, but admittedly, using them may be some effort.
For the simplest form of a very basic, manual Java Microbenchmark you have to consider the following:
Run the algorithms multiple times, to give the JIT a chance to kick in
Run the algorithms alternatingly and not only one after the other
Run the algorithms with increasing input size
Somehow save and print the results of the computation, to prevent the computation from being optimized away
Don't print anything to the console during the benchmark
Consider that timings may be distorted by the garbage collector (GC)
Again: These are only rules of thumb, and there may still be unexpected results (refer to the links above for more details). But with this strategy, you usually obtain a good indication about the performance, and at least can see whether it's likely that there really are significant differences between the algorithms.
The differences between the approaches
The Staircase implementation and the Steps implementation are not very different.
The main conceptual difference is that the Staircase implementation is counting down, and the Steps implementation is counting up.
The main difference that actually affects the performance is how the Base Case is handled (see Recursion on Wikipedia). In your implementation, you avoid calling the method recursively when it is not necessary, at the cost of some additional if statements. The Staircase implementation uses a very generic treatment of the base case, by just checking whether n < 0.
One could consider an "intermediate" solution that combines ideas from both approaches:
class Staircase2
{
public static int counting(int n)
{
int result = 0;
if (n >= 1)
{
result += counting(n-1);
if (n >= 2)
{
result += counting(n-2);
if (n >= 3)
{
result += counting(n-3);
}
}
}
else
{
result += 1;
}
return result;
}
}
It's still recursive without a state, and sums up the intermediate results, avoiding many of the "useless" calls by using some if queries. It's already noticably faster than the original Staircase implementation, but still a tad slower than the Steps implementation.
Why both solutions are slow
For both implementations, there's not really anything to be computed. The method consists of few if statements and some additions. The most expensive thing here is actually the recursion itself, with the deeeeply nested call tree.
And that's the key point here: It's a call tree. Imagine what it is computing for a given number of steps, as a "pseudocode call hierarchy":
compute(5)
compute(4)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
One can imagine that this grows exponentially when the number becomes larger. And all the results are computed hundreds, thousands or or millions of times. This can be avoided
The fast solution
The key idea to make the computation faster is to use Dynamic Programming. This basically means that intermediate results are stored for later retrieval, so that they don't have to be computed again and again.
It's implemented in this example, which also compares the execution time of all approaches:
import java.util.Arrays;
public class StaircaseSteps
{
public static void main(String[] args)
{
for (int i = 5; i < 33; i++)
{
runStaircase(i);
runSteps(i);
runDynamic(i);
}
}
private static void runStaircase(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Staircase.counting(i);
}
long after = System.nanoTime();
System.out.println("Staircase up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runSteps(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Steps.step(i);
}
long after = System.nanoTime();
System.out.println("Steps up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runDynamic(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += StaircaseDynamicProgramming.counting(i);
}
long after = System.nanoTime();
System.out.println("Dynamic up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
}
class Staircase
{
public static int counting(int n)
{
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
}
class Steps
{
static int c2 = 0;
static int stairs;
public static int step(int c)
{
c2 = 0;
stairs = c;
return step2(0);
}
private static int step2(int c)
{
if (c + 1 < stairs)
{
if (c + 2 <= stairs)
{
if (c + 3 <= stairs)
{
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
}
else
{
c2++;
}
return c2;
}
}
class StaircaseDynamicProgramming
{
public static int counting(int n)
{
int results[] = new int[n+1];
Arrays.fill(results, -1);
return counting(n, results);
}
private static int counting(int n, int results[])
{
int result = results[n];
if (result == -1)
{
result = 0;
if (n >= 1)
{
result += counting(n-1, results);
if (n >= 2)
{
result += counting(n-2, results);
if (n >= 3)
{
result += counting(n-3, results);
}
}
}
else
{
result += 1;
}
}
results[n] = result;
return result;
}
}
The results on my PC are as follows:
...
Staircase up to 29 gives 34850335 time 310.672814
Steps up to 29 gives 34850335 time 112.237711
Dynamic up to 29 gives 34850335 time 0.089785
Staircase up to 30 gives 64099760 time 578.072582
Steps up to 30 gives 64099760 time 204.264142
Dynamic up to 30 gives 64099760 time 0.091524
Staircase up to 31 gives 117897840 time 1050.152703
Steps up to 31 gives 117897840 time 381.293274
Dynamic up to 31 gives 117897840 time 0.084565
Staircase up to 32 gives 216847936 time 1929.43348
Steps up to 32 gives 216847936 time 699.066728
Dynamic up to 32 gives 216847936 time 0.089089
Small changes in the order of statements or so ("micro-optimizations") may have a small impact, or make a noticable difference. But using an entirely different approach can make the real difference.

Ternary Conditional causes weird CPU usage in Java

I was running this bit of code to compare performance of 3 equivalent methods of calculating wraparound coordinates:
public class Test {
private static final float MAX = 1000000;
public static void main(String[] args) {
long time = System.currentTimeMillis();
for (float i = -MAX; i <= MAX; i++) {
for (float j = 100; j < 10000; j++) {
method1(i, j);
//method2(i, j);
//method3(i, j);
}
}
System.out.println(System.currentTimeMillis() - time);
}
private static float method1(float value, float max) {
value %= max + 1;
return (value < 0) ? value + max : value;
}
private static float method2(float value, float max) {
value %= max + 1;
if (value < 0)
value += max;
return value;
}
private static float method3(float value, float max) {
return ((value % max) + max) % max;
}
}
I ran this code three times with method1, method2, and method3 (one method per test).
What was peculiar was that method1 spawned several java processes, and managed to utilize nearly 100% of my dual core cpu, while method2 and method3 spawned only 1 process and therefore could only take up 25% of my cpu (since there are 4 virtual cores). Why would this happen?
Here are the details of my machine:
Java 1.6.0_65
Macbook Air 13" late 2013

After reviewing the various comments and failing at consistently replicating the behavior. I have decided that this is indeed a fluke of the JRE. Thanks to the comments, I have learned that benchmarking this way is not very useful and I should use a Micro-benchmarking framework such as Google's caliper to test such minute differences.
Some resources I found:
Google's java micro-benchmark page
IBM page discussing a flawed benchmark and common pitfalls
Oracle's page on micro-benchmarking
Stackoverflow question on micro-benchmarking

Java 8 nested loops with streams & performance

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760.
public static int digitSum(BigInteger x)
{
int sum = 0;
for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
return sum;
}
#Test public void solve()
{
int max = 0;
for(int i=1;i<100;i++)
for(int j=1;j<100;j++)
max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
System.out.println(max);
}
My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel()):
int max = IntStream.range(1,100).parallel()
.map(i -> IntStream.range(1, 100)
.map(j->digitSum(BigInteger.valueOf(i).pow(j)))
.max().getAsInt()).max().getAsInt();
My questions
did I do the conversion right or is there a better way to convert nested loops to stream calculations?
why is the stream variant so much slower than the old one?
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right?
Update
I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test).
My results for a,b<1000 instead of 100:
traditional loop 186s
sequential stream 193s
parallel stream 55s
Update 2
Replacing sum+=Integer.valueOf(c+""); with sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. Didn't expect such a big performance impact!
Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much):
int max = IntStream.range(0, 3).parallel().
.map(m -> IntStream.range(0,250)
.map(i -> IntStream.range(1, 1000)
.map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
.max().getAsInt()).max().getAsInt()).max().getAsInt();

I have created a quick and dirty micro benchmark based on your code. The results are:
loop: 3192
lambda: 3140
lambda parallel: 868
So the loop and lambda are equivalent and the parallel stream significantly improves the performance. I suspect your results are unreliable due to your benchmarking methodology.
public static void main(String[] args) {
int sum = 0;
//warmup
for (int i = 0; i < 100; i++) {
solve();
solveLambda();
solveLambdaParallel();
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solve();
}
long end = System.nanoTime();
System.out.println("loop: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambda();
}
long end = System.nanoTime();
System.out.println("lambda: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambdaParallel();
}
long end = System.nanoTime();
System.out.println("lambda parallel : " + (end - start) / 1_000_000);
}
System.out.println(sum);
}
public static int digitSum(BigInteger x) {
int sum = 0;
for (char c : x.toString().toCharArray()) {
sum += Integer.valueOf(c + "");
}
return sum;
}
public static int solve() {
int max = 0;
for (int i = 1; i < 100; i++) {
for (int j = 1; j < 100; j++) {
max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
}
}
return max;
}
public static int solveLambda() {
return IntStream.range(1, 100)
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
public static int solveLambdaParallel() {
return IntStream.range(1, 100)
.parallel()
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
I have also run it with jmh which is more reliable than manual tests. The results are consistent with above (micro seconds per call):
Benchmark Mode Mean Units
c.a.p.SO21968918.solve avgt 32367.592 us/op
c.a.p.SO21968918.solveLambda avgt 31423.123 us/op
c.a.p.SO21968918.solveLambdaParallel avgt 8125.600 us/op

The problem you have is you are looking at sub-optimal code. When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. Loops have been around much longer and are better understood.
One big difference in your loop code, is you working set is very small. You are only considering one maximum digit sum at a time. This means the code is cache friendly and you have very short lived objects. In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. I would expect your GC times to be longer and/or more frequent as well.
why is the stream variant so much slower than the old one?
Loops are fairly well optimised having been around since before Java was developed. They can be mapped very efficiently to hardware. Streams are fairly new and not as heavily optimised.
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
Most likely you have a bottle neck on a shared resource. You create quite a bit of garbage but this is usually fairly concurrent. Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Is bitwise operation faster than modulo/reminder operator in Java? - java

Related

Why is Arrays.binarySearch not improving the performance compared to walking the array?

Does the JIT Optimizer Optimize Multiplication?

Java Benchmark for recursive stairs climbing puzzle

Ternary Conditional causes weird CPU usage in Java

Java 8 nested loops with streams & performance

Categories

Resources