Java 8 nested loops with streams & performance - java

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760.
public static int digitSum(BigInteger x)
{
int sum = 0;
for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
return sum;
}
#Test public void solve()
{
int max = 0;
for(int i=1;i<100;i++)
for(int j=1;j<100;j++)
max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
System.out.println(max);
}
My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel()):
int max = IntStream.range(1,100).parallel()
.map(i -> IntStream.range(1, 100)
.map(j->digitSum(BigInteger.valueOf(i).pow(j)))
.max().getAsInt()).max().getAsInt();
My questions
did I do the conversion right or is there a better way to convert nested loops to stream calculations?
why is the stream variant so much slower than the old one?
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right?
Update
I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test).
My results for a,b<1000 instead of 100:
traditional loop 186s
sequential stream 193s
parallel stream 55s
Update 2
Replacing sum+=Integer.valueOf(c+""); with sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. Didn't expect such a big performance impact!
Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much):
int max = IntStream.range(0, 3).parallel().
.map(m -> IntStream.range(0,250)
.map(i -> IntStream.range(1, 1000)
.map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
.max().getAsInt()).max().getAsInt()).max().getAsInt();

I have created a quick and dirty micro benchmark based on your code. The results are:
loop: 3192
lambda: 3140
lambda parallel: 868
So the loop and lambda are equivalent and the parallel stream significantly improves the performance. I suspect your results are unreliable due to your benchmarking methodology.
public static void main(String[] args) {
int sum = 0;
//warmup
for (int i = 0; i < 100; i++) {
solve();
solveLambda();
solveLambdaParallel();
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solve();
}
long end = System.nanoTime();
System.out.println("loop: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambda();
}
long end = System.nanoTime();
System.out.println("lambda: " + (end - start) / 1_000_000);
}
{
long start = System.nanoTime();
for (int i = 0; i < 100; i++) {
sum += solveLambdaParallel();
}
long end = System.nanoTime();
System.out.println("lambda parallel : " + (end - start) / 1_000_000);
}
System.out.println(sum);
}
public static int digitSum(BigInteger x) {
int sum = 0;
for (char c : x.toString().toCharArray()) {
sum += Integer.valueOf(c + "");
}
return sum;
}
public static int solve() {
int max = 0;
for (int i = 1; i < 100; i++) {
for (int j = 1; j < 100; j++) {
max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
}
}
return max;
}
public static int solveLambda() {
return IntStream.range(1, 100)
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
public static int solveLambdaParallel() {
return IntStream.range(1, 100)
.parallel()
.map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
.max().getAsInt();
}
I have also run it with jmh which is more reliable than manual tests. The results are consistent with above (micro seconds per call):
Benchmark Mode Mean Units
c.a.p.SO21968918.solve avgt 32367.592 us/op
c.a.p.SO21968918.solveLambda avgt 31423.123 us/op
c.a.p.SO21968918.solveLambdaParallel avgt 8125.600 us/op

The problem you have is you are looking at sub-optimal code. When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. Loops have been around much longer and are better understood.
One big difference in your loop code, is you working set is very small. You are only considering one maximum digit sum at a time. This means the code is cache friendly and you have very short lived objects. In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. I would expect your GC times to be longer and/or more frequent as well.
why is the stream variant so much slower than the old one?
Loops are fairly well optimised having been around since before Java was developed. They can be mapped very efficiently to hardware. Streams are fairly new and not as heavily optimised.
why did the parallel() statement actually increased the time from 0.19s to 0.25s?
Most likely you have a bottle neck on a shared resource. You create quite a bit of garbage but this is usually fairly concurrent. Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have.

Related

Parallelizing Sieve of Eratosthenes in Java

I am trying to make a parallel implementation of the Sieve of Eratosthenes. I made a boolean list which gets filled up with true's for the given size. Whenever a prime is found, all multiples of that prime are marked false in the boolean list.
The way I am trying to make this algorithm parallel is by firing up a new thread while still filtering the initial prime number. For example, the algorithm starts with prime = 2. In the for loop for filter, when prime * prime, I make another for loop in which every number in between the prime (2) and the prime * prime (4) is checked. If that index in the boolean list is still true, I fire up another thread to filter that prime number.
The nested for loop creates more and more overhead as the prime numbers to filter are progressing, so I limited this to only do this nested for loop when the prime number < 100. I am assuming that by that time, the 100 million numbers will be somewhat filtered. The problem here is that this way, the primes to be filter stay just under 9500 primes, while the algorithm stops at 10000 primes (prime * prime < size(100m)). I also think this is not at all the correct way to go about it. I have searched a lot online, but didn't manage to find any examples of parallel Java implementations of the sieve.
My code looks like this:
Main class:
public class Main {
private static ListenableQueue<Integer> queue = new ListenableQueue<>(new LinkedList<>());
private static ArrayList<Integer> primes = new ArrayList<>();
private static boolean serialList[];
private static ArrayList<Integer> serialPrimes = new ArrayList<>();
private static ExecutorService exec = Executors.newFixedThreadPool(10);
private static int size = 100000000;
private static boolean list[] = new boolean[size];
private static int lastPrime = 2;
public static void main(String[] args) {
Arrays.fill(list, true);
parallel();
}
public static void parallel() {
Long startTime = System.nanoTime();
int firstPrime = 2;
exec.submit(new Runner(size, list, firstPrime));
}
public static void parallelSieve(int size, boolean[] list, int prime) {
int queuePrimes = 0;
for (int i = prime; i * prime <= size; i++) {
try {
list[i * prime] = false;
if (prime < 100) {
if (i == prime * prime && queuePrimes <= 1) {
for (int j = prime + 1; j < i; j++) {
if (list[j] && j % prime != 0 && j > lastPrime) {
lastPrime = j;
startNewThread(j);
queuePrimes++;
}
}
}
}
} catch (ArrayIndexOutOfBoundsException ignored) { }
}
}
private static void startNewThread(int newPrime) {
if ((newPrime * newPrime) < size) {
exec.submit(new Runner(size, list, newPrime));
}
else {
exec.shutdown();
for (int i = 2; i < list.length; i++) {
if (list[i]) {
primes.add(i);
}
}
}
}
}
Runner class:
public class Runner implements Runnable {
private int arraySize;
private boolean[] list;
private int k;
public Runner(int arraySize, boolean[] list, int k) {
this.arraySize = arraySize;
this.list = list;
this.k = k;
}
#Override
public void run() {
Main.parallelSieve(arraySize, list, k);
}
}
I feel like there is a much simpler way to solve this...
Do you guys have any suggestions as to how I can make this parallelization working and maybe a bit simpler?
Creating a performant concurrent implementation of an algorithm like the Sieve of Eratosthenes is somewhat more difficult than creating a performant single-threaded implementation. The reason is that you need to find a way to partition the work in a way that minimises communication and interference between the parallel worker threads.
If you achieve complete isolation then you can hope for a speed increase approaching the number of logical processors available, or about one order of magnitude on a typical modern PC. By contrast, using a decent single-threaded implementation of the sieve will give you a speedup of at least two to three orders of magnitude. One simple cop-out would be to simply load the data from a file when needed, or to shell out to a decent prime-sieving program like Kim Walisch's PrimeSieve.
Even if we only want to look at the parallelisation problem, it is still necessary to have some insight in the algorithm itself and into to machine it runs on.
The most important aspect is that modern computers have deep cache hierarchies where only the L1 cache - typically 32 KB - is accessible at full speed and all other memory accesses incur significant penalties. Translated to the Sieve of Eratosthenes this means that you need to sieve your target range one 32 KB window at a time, instead of striding each prime over many megabytes. The small primes up to the square root of the target range end must be sieved before the parallel dance begins, but then each segment or window can be sieved independently.
Sieving a given window or segment necessitates determining the start offsets for the small primes that you want to sieve by, which means at least one modulo divison per small prime per window and division is a an extremely slow operation. However, if you sieve consecutive segments instead of arbitrary windows placed anywhere in the range then you can keep the end offsets for each prime in a vector and use them as start offsets for the next segment, thus eliminating the expensive computation of the start offset.
Thus, one promising parallelisation strategy for the Sieve of Eratosthenes would be to give each worker thread a contiguous group of 32 KB blocks to sieve, so that the start offset calculation needs to happen only once per worker. This way there cannot be memory access contention between workers, since each has its own independent subrange of the target range.
However, before you begin to parallelise - i.e., make your code more complex - you should first slim it down and reduce the work to be done to the absolute essentials. For example, take a look at this fragment from your code:
for (int i = prime; i * prime <= size; i++)
list[i * prime] = false;
Instead of recomputing loop bounds in every iteration and indexing with a multiplication, check the loop variable against a precomputed, loop-invariant value and reduce the multiplication to iterated addition:
for (int o = prime * prime; o <= size; o += prime)
list[o] = false;
There are two simple sieve-specific optimisations that can give significant speed bosts.
1) Leave the even numbers out of your sieve and pull the prime 2 out of thin air when needed. Bingo, you just doubled your performance.
2) Instead of sieving each segment by the small odd primes 3, 5, 7 and so on, blast a precomputed pattern over the segment (or even the whole range). This saves time because these small primes make many, many steps in each segment and account for the lion's share of sieving time.
There are more possible optimisations including a couple more low-hanging fruit but either the returns are diminishing or the effort curve rises steeply. Try searching Code Review for 'sieve'. Also, don't forget that you're fighting a Java compiler in addition to the algorithmic problem and the machine architecture, i.e. things like array bounds checking which your compiler may or may not be able to hoist out of loops.
To give you a ballpark figure: a single-threaded segmented odds-only sieve with precomputed patterns can sieve the whole 32-bit range in 2 to 4 seconds in C#, depending on how much TLC you apply in addition to things mentioned above. Your much smaller problem of primes up to 100000000 (1e8) is solved in less than 100 ms on my aging notebook.
Here's some code that shows how windowed sieving works. For clarity I left off all optimisations like odds-only representation or wheel-3 stepping when reading out the primes and so on. It's C# but that should be similar enough to Java to be readable.
Note: I called the sieve array eliminated because a true value indicates a crossed-off number (saves filling the array with all true at the beginning and it is more logical anyway).
static List<uint> small_primes_between (uint m, uint n)
{
m = Math.Max(m, 2);
if (m > n)
return new List<uint>();
Trace.Assert(n - m < int.MaxValue);
uint sieve_bits = n - m + 1;
var eliminated = new bool[sieve_bits];
foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
{
uint start = prime * prime, stride = prime;
if (start >= m)
start -= m;
else
start = (stride - 1) - (m - start - 1) % stride;
for (uint j = start; j < sieve_bits; j += stride)
eliminated[j] = true;
}
return remaining_numbers(eliminated, m);
}
//---------------------------------------------------------------------------------------------
static List<uint> remaining_numbers (bool[] eliminated, uint sieve_base)
{
var result = new List<uint>();
for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
if (!eliminated[i])
result.Add(sieve_base + i);
return result;
}
//---------------------------------------------------------------------------------------------
static List<uint> small_primes_up_to (uint n)
{
Trace.Assert(n < int.MaxValue); // size_t is int32_t in .Net (!)
var eliminated = new bool[n + 1]; // +1 because indexed by numbers
eliminated[0] = true;
eliminated[1] = true;
for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
if (!eliminated[i])
for (uint j = i * i; j <= n; j += i)
eliminated[j] = true;
return remaining_numbers(eliminated, 0);
}

Java Math.min/max performance

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)
I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)
So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).
public class OpsMath {
public static final int max(final int a, final int b) {
if (a > b) {
return a;
}
return b;
}
}
That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).
Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:
try ( final Benchmark bench = new Benchmark( "millis to max" ) )
{
int max = Integer.MIN_VALUE;
for ( int i = 0; i < array.length; ++i )
{
max = OpsMath.max( max, array[i] );
// max = Math.max( max, array[i] );
}
}
I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.
'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.
The array is randomized each time the test runs.
Running the test 6 times, it gives these results:
Java standard Math:
millis to max 9.242167
millis to max 2.1566199999999998
millis to max 2.046396
millis to max 2.048616
millis to max 2.035761
millis to max 2.001044
So fairly stable after the first run, and running the tests again gives similar results.
OpsMath:
millis to max 8.65418
millis to max 1.161559
millis to max 0.955851
millis to max 0.946642
millis to max 0.994543
millis to max 0.9469069999999999
Again, very stable results after the first run.
The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.
Specs:
CPU: Intel i5 2500, 3,3Ghz.
Java Version: JDK 8 (public march 18 release), x64.
Debian Jessie (testing release) x64.
I have yet to try with 32 bit JVM.
EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.
// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " +
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
int max = Integer.MIN_VALUE;
// Randomize the array.
for ( int i = 0; i < array.length; ++i )
{
array[i] = r.nextInt();
}
final long start = System.nanoTime();
for ( int i = 0; i < array.length; ++i )
{
max = Math.max( array[i], max );
// OpsMath.max() method implemented as described.
// max = OpsMath.max( array[i], max );
}
// Calc time.
final double end = (System.nanoTime() - start);
// Store results.
times.add( Double.valueOf( end ) );
results.add( Integer.valueOf( max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
System.out.println( "IT" + i + " result: " + results.get( i ) );
System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}
Java Math.max result:
IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047
OpsMath.max result:
IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984
Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.
It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of
Math.max( array[i], max );
being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given
final int[] array = new int[(8*1024*1024)/4];
it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.1
In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).
This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).
1 Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.
As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.
When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:
0x00007f4b65425c50: mov %r11d,%edi ;*getstatic array
; - foo146::bench#81 (line 40)
0x00007f4b65425c53: mov 0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov 0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov 0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov 0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov 0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov 0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov 0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov 0x1c(%rax,%rdx,4),%r14d ;*iaload
; - foo146::bench#86 (line 40)
0x00007f4b65425c79: cmp %edi,%r8d
0x00007f4b65425c7c: cmovl %edi,%r8d
0x00007f4b65425c80: cmp %r8d,%r10d
0x00007f4b65425c83: cmovl %r8d,%r10d
0x00007f4b65425c87: cmp %r10d,%ecx
0x00007f4b65425c8a: cmovl %r10d,%ecx
0x00007f4b65425c8e: cmp %ecx,%r14d
0x00007f4b65425c91: cmovl %ecx,%r14d
0x00007f4b65425c95: cmp %r14d,%esi
0x00007f4b65425c98: cmovl %r14d,%esi
0x00007f4b65425c9c: cmp %esi,%ebx
0x00007f4b65425c9e: cmovl %esi,%ebx
0x00007f4b65425ca1: cmp %ebx,%r9d
0x00007f4b65425ca4: cmovl %ebx,%r9d
0x00007f4b65425ca8: cmp %r9d,%r11d
0x00007f4b65425cab: cmovl %r9d,%r11d ;*invokestatic max
; - foo146::bench#88 (line 40)
0x00007f4b65425caf: add $0x8,%edx ;*iinc
; - foo146::bench#92 (line 39)
0x00007f4b65425cb2: cmp $0x1ffff9,%edx
0x00007f4b65425cb8: jl 0x00007f4b65425c50
Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.
With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As #maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.
I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.
Using JDK 8:
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
On Ubuntu 13.10
I ran the following:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array;
public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
this.max = max;
this.array = array;
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);
m = Integer.MIN_VALUE;
for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);
// total time over number of calls to max
return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
}
public double averageTime(int repeats) {
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return (double) cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array = new int[size];
for (int i = 0; i < size; i++) array[i] = random.nextInt();
double tMath = new MaxPerformance(Math::max, array).averageTime(100);
double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
And I got the following results (average nanoseconds taken for each call to max):
Java Math: 15.443555810000003
Alt 1: 14.968298919999997
Alt 2: 16.442204045
So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.
In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one. This can be done by using two randomized arrays instead of one as in this benchmark:
import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
private final BiFunction<Integer, Integer, Integer> max;
private final int[] array1, array2;
public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
this.max = max;
this.array1 = array1;
this.array2 = array2;
if (array1.length != array2.length) throw new IllegalArgumentException();
}
public double time() {
long start = System.nanoTime();
int m = Integer.MIN_VALUE;
for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
m += m; // to avoid optimizations!
return ((double) (System.nanoTime() - start)) / (double) array1.length;
}
public double averageTime(int repeats) {
// warm up rounds:
double tmp = 0;
for (int i = 0; i < 10; i++) tmp += time();
tmp *= 2.0;
double cumulativeTime = 0;
for (int i = 0; i < repeats; i++)
cumulativeTime += time();
return cumulativeTime / (double) repeats;
}
public static void main(String[] args) {
int size = 1000000;
Random random = new Random(123123123L);
int[] array1 = new int[size];
int[] array2 = new int[size];
for (int i = 0; i < size; i++) {
array1[i] = random.nextInt();
array2[i] = random.nextInt();
}
double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);
System.out.println("Java Math: " + tMath);
System.out.println("Alt 1: " + tAlt1);
System.out.println("Alt 2: " + tAlt2);
}
public static int max1(final int a, final int b) {
if (a >= b) return a;
return b;
}
public static int max2(final int a, final int b) {
return (a >= b) ? a : b; // same as JDK implementation
}
}
Which gave me:
Java Math: 15.346468170000005
Alt 1: 16.378737519999998
Alt 2: 20.506475350000006
The way your test is set up makes a huge difference on the results. The JDK version seems to be the fastest in this scenario. This time by a relatively large margin compared to the previous case.
Somebody mentioned Caliper. Well if you read the wiki, one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general. I think this is a clear example of that.
Here's a branchless min operation, max can be implemented by replacing diff=a-b with diff=b-a.
public static final long min(final long a, final long b) {
final long diff = a - b;
// All zeroes if a>=b, all ones if a<b because the sign bit is propagated
final long mask = diff >> 63;
return (a & mask) | (b & (~mask));
}
It should be as fast as streaming the memory because the CPU operations should be hidden by the sequential memory read latency.

Is bitwise operation faster than modulo/reminder operator in Java?

I read in couple of blogs that in Java modulo/reminder operator is slower than bitwise-AND. So, I wrote the following program to test.
public class ModuloTest {
public static void main(String[] args) {
final int size = 1024;
int index = 0;
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndex(size, i);
}
long end = System.nanoTime();
System.out.println("Time taken by Modulo (%) operator --> " + (end - start) + "ns.");
start = System.nanoTime();
final int shiftFactor = size - 1;
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndexBitwise(shiftFactor, i);
}
end = System.nanoTime();
System.out.println("Time taken by bitwise AND --> " + (end - start) + "ns.");
}
private static int getNextIndex(int size, int nextInt) {
return nextInt % size;
}
private static int getNextIndexBitwise(int size, int nextInt) {
return nextInt & size;
}
}
But in my runtime environment (MacBook Pro 2.9GHz i7, 8GB RAM, JDK 1.7.0_51) I am seeing otherwise. The bitwise-AND is significantly slower, in fact twice as slow than the remainder operator.
I would appreciate it if someone can help me understand if this is intended behavior or I am doing something wrong?
Thanks,
Niranjan
Your code reports bitwise-and being much faster on each Mac I've tried it on, both with Java 6 and Java 7. I suspect the first portion of the test on your machine happened to coincide with other activity on the system. You should try running the test multiple times to verify you aren't seeing distortions based on that. (I would have left this as a 'comment' rather than an 'answer', but apparently you need 50 reputation to do that -- quite silly, if you ask me.)
For starters, logical conjunction trick only works with Nature Number dividends and power of 2 divisors. So, if you need negative dividends, floats, or non-powers of 2, sick with the default % operator.
My tests (with JIT warmup and 1M random iterations), on an i7 with a ton of cores and bus load of ram show about 20% better performance from the bitwise operation. This can very per run, depending how the process scheduler runs the code.
using Scala 2.11.8 on JDK 1.8.91
4Ghz i7-4790K, 8 core AMD, 32GB PC3 19200 ram, SSD
This example in particular will always give you a wrong result. Moreover, I believe that any program which is calculating the modulo by a power of 2 will be faster than bitwise AND.
REASON: When you use N % X where X is kth power of 2, only last k bits are considered for modulo, whereas in case of the bitwise AND operator the runtime actually has to visit each bit of the number under question.
Also, I would like to point out the Hot Spot JVM's optimizes repetitive calculations of similar nature(one of the examples can be branch prediction etc). In your case, the method which is using the modulo just returns the last 10 bits of the number because 1024 is the 10th power of 2.
Try using some prime number value for size and check the same result.
Disclaimer: Micro benchmarking is not considered good.
Is this method missing something?
public static void oddVSmod(){
float tests = 100000000;
oddbit(tests);
modbit(tests);
}
public static void oddbit(float tests){
for(int i=0; i<tests; i++)
if((i&1)==1) {System.out.print(" "+i);}
System.out.println();
}
public static void modbit(float tests){
for(int i=0; i<tests; i++)
if((i%2)==1) {System.out.print(" "+i);}
System.out.println();
}
With that, i used netbeans built-in profiler (advanced-mode) to run this. I set var tests up to 10X10^8, and every time, it showed that modulo is faster than bitwise.
Thank you all for valuable inputs.
#pamphlet: Thank you very much for the concerns, but negative comments are fine with me. I confess that I did not do proper testing as suggested by AndyG. AndyG could have used a softer tone, but its okay, sometimes negatives help seeing the positive. :)
That said, I changed my code (as shown below) in a way that I can run that test multiple times.
public class ModuloTest {
public static final int SIZE = 1024;
public int usingModuloOperator(final int operand1, final int operand2) {
return operand1 % operand2;
}
public int usingBitwiseAnd(final int operand1, final int operand2) {
return operand1 & operand2;
}
public void doCalculationUsingModulo(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingModuloOperator(1, size);
}
}
public void doCalculationUsingBitwise(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingBitwiseAnd(i, size);
}
}
public static void main(String[] args) {
final ModuloTest moduloTest = new ModuloTest();
final int invocationCount = 100;
// testModuloOperator(moduloTest, invocationCount);
testBitwiseOperator(moduloTest, invocationCount);
}
private static void testModuloOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingModulo(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using modulo operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
private static void testBitwiseOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingBitwise(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using bitwise operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
}
I called testModuloOperator() and testBitwiseOperator() in mutual exclusive way. The result was consistent with the idea that bitwise is faster than modulo operator. I ran each of the calculation 100 times and recorded the execution times. Then removed first five and last five recordings and used rest to calculate the avg. time. And, below are my test results.
Using modulo operator, the avg. time for 90 runs: 8388.89ns.
Using bitwise-AND operator, the avg. time for 90 runs: 722.22ns.
Please suggest if my approach is correct or not.
Thanks again.
Niranjan

In java, is it more efficient to use byte or short instead of int and float instead of double?

I've noticed I've always used int and doubles no matter how small or big the number needs to be. So in java, is it more efficient to use byte or short instead of int and float instead of double?
So assume I have a program with plenty of ints and doubles. Would it be worth going through and changing my ints to bytes or shorts if I knew the number would fit?
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
By efficient I mostly mean processing. I'd assume the garbage collector would be a lot faster if all the variables would be half size and that calculations would probably be somewhat faster too.
( I guess since I am working on android I need to somewhat worry about ram too)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
I tried it with a small android app I have but didn't really notice a difference at all. (Though I didn't "scientifically" measure anything.)
Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Would it be worth doing from the beginning when I start a new project? (I mean I think every little bit would help but then again if so, why doesn't it seem like anyone does it.)
Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Short answer
Yes, you are wrong. In most cases, it makes little difference in terms of space used.
It is not worth trying to optimize this ... unless you have clear evidence that optimization is needed. And if you do need to optimize memory usage of object fields in particular, you will probably need to take other (more effective) measures.
Longer answer
The Java Virtual Machine models stacks and object fields using offsets that are (in effect) multiples of a 32 bit primitive cell size. So when you declare a local variable or object field as (say) a byte, the variable / field will be stored in a 32 bit cell, just like an int.
There are two exceptions to this:
long and double values require 2 primitive 32-bit cells
arrays of primitive types are represent in packed form, so that (for example) an array of bytes hold 4 bytes per 32bit word.
So it might be worth optimizing use of long and double ... and large arrays of primitives. But in general no.
In theory, a JIT might be able to optimize this, but in practice I've never heard of a JIT that does. One impediment is that the JIT typically cannot run until after there instances of the class being compiled have been created. If the JIT optimized the memory layout, you could have two (or more) "flavors" of object of the same class ... and that would present huge difficulties.
Revisitation
Looking at the benchmark results in #meriton's answer, it appears that using short and byte instead of int incurs a performance penalty for multiplication. Indeed, if you consider the operations in isolation, the penalty is significant. (You shouldn't consider them in isolation ... but that's another topic.)
I think the explanation is that JIT is probably doing the multiplications using 32bit multiply instructions in each case. But in the byte and short case, it executes extra instructions to convert the intermediate 32 bit value to a byte or short in each loop iteration. (In theory, that conversion could be done once at the end of the loop ... but I doubt that the optimizer would be able to figure that out.)
Anyway, this does point to another problem with switching to short and byte as an optimization. It could make performance worse ... in an algorithm that is arithmetic and compute intensive.
Secondary questions
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
No. Not in terms of performance anyway. (There are some methods in Integer, Long, etc for dealing with int, long, etc as unsigned. But these don't give any performance advantage. That is not their purpose.)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
Correct. A field of an object is part of the object. It goes away when the object is garbage collected. Likewise the cells of an array go away when the array is collected. When the field or cell type is a primitive type, then the value is stored in the field / cell ... which is part of the object / array ... and that has been deleted.
That depends on the implementation of the JVM, as well as the underlying hardware. Most modern hardware will not fetch single bytes from memory (or even from the first level cache), i.e. using the smaller primitive types generally does not reduce memory bandwidth consumption. Likewise, modern CPU have a word size of 64 bits. They can perform operations on less bits, but that works by discarding the extra bits, which isn't faster either.
The only benefit is that smaller primitive types can result in a more compact memory layout, most notably when using arrays. This saves memory, which can improve locality of reference (thus reducing the number of cache misses) and reduce garbage collection overhead.
Generally speaking however, using the smaller primitive types is not faster.
To demonstrate that, behold the following benchmark:
public class Benchmark {
public static void benchmark(String label, Code code) {
print(25, label);
try {
for (int iterations = 1; ; iterations *= 2) { // detect reasonable iteration count and warm up the code under test
System.gc(); // clean up previous runs, so we don't benchmark their cleanup
long previouslyUsedMemory = usedMemory();
long start = System.nanoTime();
code.execute(iterations);
long duration = System.nanoTime() - start;
long memoryUsed = usedMemory() - previouslyUsedMemory;
if (iterations > 1E8 || duration > 1E9) {
print(25, new BigDecimal(duration * 1000 / iterations).movePointLeft(3) + " ns / iteration");
print(30, new BigDecimal(memoryUsed * 1000 / iterations).movePointLeft(3) + " bytes / iteration\n");
return;
}
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
private static void print(int desiredLength, String message) {
System.out.print(" ".repeat(Math.max(1, desiredLength - message.length())) + message);
}
private static long usedMemory() {
return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
}
#FunctionalInterface
interface Code {
/**
* Executes the code under test.
*
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
Object execute(int iterations);
}
public static void main(String[] args) {
benchmark("long[] traversal", (iterations) -> {
long[] array = new long[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("int[] traversal", (iterations) -> {
int[] array = new int[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("short[] traversal", (iterations) -> {
short[] array = new short[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (short) i;
}
return array;
});
benchmark("byte[] traversal", (iterations) -> {
byte[] array = new byte[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (byte) i;
}
return array;
});
benchmark("long fields", (iterations) -> {
class C {
long a = 1;
long b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("int fields", (iterations) -> {
class C {
int a = 1;
int b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("short fields", (iterations) -> {
class C {
short a = 1;
short b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("byte fields", (iterations) -> {
class C {
byte a = 1;
byte b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("long multiplication", (iterations) -> {
long result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("int multiplication", (iterations) -> {
int result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("short multiplication", (iterations) -> {
short result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("byte multiplication", (iterations) -> {
byte result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
}
}
Run with OpenJDK 14 on my Intel Core i7 CPU # 3.5 GHz, this prints:
long[] traversal 3.206 ns / iteration 8.007 bytes / iteration
int[] traversal 1.557 ns / iteration 4.007 bytes / iteration
short[] traversal 0.881 ns / iteration 2.007 bytes / iteration
byte[] traversal 0.584 ns / iteration 1.007 bytes / iteration
long fields 25.485 ns / iteration 36.359 bytes / iteration
int fields 23.126 ns / iteration 28.304 bytes / iteration
short fields 21.717 ns / iteration 20.296 bytes / iteration
byte fields 21.767 ns / iteration 20.273 bytes / iteration
long multiplication 0.538 ns / iteration 0.000 bytes / iteration
int multiplication 0.526 ns / iteration 0.000 bytes / iteration
short multiplication 0.786 ns / iteration 0.000 bytes / iteration
byte multiplication 0.784 ns / iteration 0.000 bytes / iteration
As you can see, the only significant speed savings occur when traversing large arrays; using smaller object fields yields negligible benefit, and computations are actually slightly slower on the small datatypes.
Overall, the performance differences are quite minor. Optimizing algorithms is far more important than the choice of primitive type.
Using byte instead of int can increase performance if you are using them in a huge amount. Here is an experiment:
import java.lang.management.*;
public class SpeedTest {
/** Get CPU time in nanoseconds. */
public static long getCpuTime() {
ThreadMXBean bean = ManagementFactory.getThreadMXBean();
return bean.isCurrentThreadCpuTimeSupported() ? bean
.getCurrentThreadCpuTime() : 0L;
}
public static void main(String[] args) {
long durationTotal = 0;
int numberOfTests=0;
for (int j = 1; j < 51; j++) {
long beforeTask = getCpuTime();
// MEASURES THIS AREA------------------------------------------
long x = 20000000;// 20 millions
for (long i = 0; i < x; i++) {
TestClass s = new TestClass();
}
// MEASURES THIS AREA------------------------------------------
long duration = getCpuTime() - beforeTask;
System.out.println("TEST " + j + ": duration = " + duration + "ns = "
+ (int) duration / 1000000);
durationTotal += duration;
numberOfTests++;
}
double average = durationTotal/numberOfTests;
System.out.println("-----------------------------------");
System.out.println("Average Duration = " + average + " ns = "
+ (int)average / 1000000 +" ms (Approximately)");
}
}
This class tests the speed of creating a new TestClass. Each tests does it 20 million times and there are 50 tests.
Here is the TestClass:
public class TestClass {
int a1= 5;
int a2= 5;
int a3= 5;
int a4= 5;
int a5= 5;
int a6= 5;
int a7= 5;
int a8= 5;
int a9= 5;
int a10= 5;
int a11= 5;
int a12=5;
int a13= 5;
int a14= 5;
}
I've run the SpeedTest class and in the end got this:
Average Duration = 8.9625E8 ns = 896 ms (Approximately)
Now I'm changing the ints into bytes in the TestClass and running it again. Here is the result:
Average Duration = 6.94375E8 ns = 694 ms (Approximately)
I believe this experiment shows that if you are instancing a huge amount of variables, using byte instead of int can increase efficiency
byte is generally considered to be 8 bits.
short is generally considered to be 16 bits.
In a "pure" environment, which isn't java as all implementation of bytes and longs, and shorts, and other fun things is generally hidden from you, byte makes better use of space.
However, your computer is probably not 8 bit, and it is probably not 16 bit. this means that
to obtain 16 or 8 bits in particular, it would need to resort to "trickery" which wastes time in order to pretend that it has the ability to access those types when needed.
At this point, it depends on how hardware is implemented. However from I've been tought,
the best speed is achieved from storing things in chunks which are comfortable for your CPU to use. A 64 bit processor likes dealing with 64 bit elements, and anything less than that often requires "engineering magic" to pretend that it likes dealing with them.
One of the reason for short/byte/char being less performant is for lack of direct support for these data types. By direct support, it means, JVM specifications do not mention any instruction set for these data types. Instructions like store, load, add etc. have versions for int data type. But they do not have versions for short/byte/char. E.g. consider below java code:
void spin() {
int i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
Same gets converted into machine code as below.
0 iconst_0 // Push int constant 0
1 istore_1 // Store into local variable 1 (i=0)
2 goto 8 // First time through don't increment
5 iinc 1 1 // Increment local variable 1 by 1 (i++)
8 iload_1 // Push local variable 1 (i)
9 bipush 100 // Push int constant 100
11 if_icmplt 5 // Compare and loop if less than (i < 100)
14 return // Return void when done
Now, consider changing int to short as below.
void sspin() {
short i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
The corresponding machine code will change as follows:
0 iconst_0
1 istore_1
2 goto 10
5 iload_1 // The short is treated as though an int
6 iconst_1
7 iadd
8 i2s // Truncate int to short
9 istore_1
10 iload_1
11 bipush 100
13 if_icmplt 5
16 return
As you can observe, to manipulate short data type, it is still using int data type instruction version and explicitly converting int to short when required. Now, due to this, performance gets reduced.
Now, reason cited for not giving direct support as follows:
The Java Virtual Machine provides the most direct support for data of
type int. This is partly in anticipation of efficient implementations
of the Java Virtual Machine's operand stacks and local variable
arrays. It is also motivated by the frequency of int data in typical
programs. Other integral types have less direct support. There are no
byte, char, or short versions of the store, load, or add instructions,
for instance.
Quoted from JVM specification present here (Page 58).
I would say that accepted answer is somewhat wrong saying "it makes little difference in terms of space used". Here is the example showing that difference in some cases is very different:
Baseline usage 4.90MB, java: 11.0.12
Mem usage - bytes : +202.60 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
The code to verify:
static class Bytes {
public byte f1;
public byte f2;
public byte f3;
public byte f4;
}
static class Shorts {
public short f1;
public short f2;
public short f3;
public short f4;
}
static class Ints {
public int f1;
public int f2;
public int f3;
public int f4;
}
#Test
public void memUsageTest() throws Exception {
int countOfItems = 10 * 1024 * 1024;
float MB = 1024*1024;
Runtime rt = Runtime.getRuntime();
System.gc();
Thread.sleep(1000);
long baseLineUsage = rt.totalMemory() - rt.freeMemory();
trace("Baseline usage %.2fMB, java: %s", (baseLineUsage / MB), System.getProperty("java.version"));
for( int j = 0; j < 3; j++ ) {
Bytes[] bytes = new Bytes[countOfItems];
for( int i = 0; i < bytes.length; i++ ) {
bytes[i] = new Bytes();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - bytes : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
bytes = null;
Shorts[] shorts = new Shorts[countOfItems];
for( int i = 0; i < shorts.length; i++ ) {
shorts[i] = new Shorts();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - shorts: +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
shorts = null;
Ints[] ints = new Ints[countOfItems];
for( int i = 0; i < ints.length; i++ ) {
ints[i] = new Ints();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - ints : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
ints = null;
}
}
private static void trace(String message, Object... args) {
String line = String.format(US, message, args);
System.out.println(line);
}
The difference is hardly noticeable! It's more a question of design, appropriateness, uniformity, habit, etc... Sometimes it's just a matter of taste. When all you care about is that your program gets up and running and substituting a float for an int would not harm correctness, I see no advantage in going for one or another unless you can demonstrate that using either type alters performance. Tuning performance based on types that are different in 2 or 3 bytes is really the last thing you should care about; Donald Knuth once said: "Premature optimization is the root of all evil" (not sure it was him, edit if you have the answer).

Two operations in one loop vs two loops performing the same operations one per loop

This question is identical to this
Two loop bodies or one (result identical)
but in my case, I use Java.
I have two loops that runs a billion times.
int a = 188, b = 144, aMax = 0, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax)
aMax = t;
}
for (int i = 0; i < 1000000000; i++) {
int t = b ^ i;
if (t > bMax)
bMax = t;
}
The time it takes to run these two loops in my machine is appr 4 secs. When I fuse these two loops into a single loop and perform all the operations in that single loop, then it runs in 2 secs. As you can see trivial operations makes up the loop contents, thus requiring constant time.
My question is where am I getting this performance improvement?
I am guessing that the only possible place where performance gets affected in the two separate loops is that it increments i and checks if i < 1000000000 2 billion times vs only 1 billion times if I fuse the loops together. Is anything else going on in there?
Thanks!
If you don't run a warm-up phase, it is possible that the first loop gets optimised and compiled but not the second one, whereas when you merge them the whole merged loop gets compiled. Also, using the server option and your code, most gets optimised away as you don't use the results.
I have run the test below, putting each loop as well as the merged loop in their own method and warmimg-up the JVM to make sure everything gets compiled.
Results (JVM options: -server -XX:+PrintCompilation):
loop 1 = 500ms
loop 2 = 900 ms
merged loop = 1,300 ms
So the merged loop is slightly faster, but not that much.
public static void main(String[] args) throws InterruptedException {
for (int i = 0; i < 3; i++) {
loop1();
loop2();
loopBoth();
}
long start = System.nanoTime();
loop1();
long end = System.nanoTime();
System.out.println((end - start) / 1000000);
start = System.nanoTime();
loop2();
end = System.nanoTime();
System.out.println((end - start) / 1000000);
start = System.nanoTime();
loopBoth();
end = System.nanoTime();
System.out.println((end - start) / 1000000);
}
public static void loop1() {
int a = 188, aMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
}
System.out.println(aMax);
}
public static void loop2() {
int b = 144, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = b ^ i;
if (t > bMax) {
bMax = t;
}
}
System.out.println(bMax);
}
public static void loopBoth() {
int a = 188, b = 144, aMax = 0, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
int u = b ^ i;
if (u > bMax) {
bMax = u;
}
}
System.out.println(aMax);
System.out.println(bMax);
}
In short, the CPU can execute the instructions in the merged loop in parallel, doubling performance.
Its also possible the second loop is not optimised efficiently. This is because the first loop will trigger the whole method to be compiled and the second loop will be compiled without any metrics which can upset the timing of the second loop. I would place each loop in a separate method to make sure this is not the case.
The CPU can perform a large number of independent operation in parallel (depth 10 on Pentium III and 20 in the Xeon). One operation it attempts to do in parallel is a branch, using branch prediction, but if it doesn't take the same branch almost every time.
I suspect with loop unrolling your loop looks more like following (possibly more loop unrolling in this case)
for (int i = 0; i < 1000000000; i += 2) {
// this first block is run almost in parallel
int t1 = a ^ i;
int t2 = b ^ i;
int t3 = a ^ (i+1);
int t4 = b ^ (i+1);
// this block run in parallel
if (t1 > aMax) aMax = t1;
if (t2 > bMax) bMax = t2;
if (t3 > aMax) aMax = t3;
if (t4 > bMax) bMax = t4;
}
Seems to me that in the case of a single loop the JIT may opt to do loop unrolling and as a result the performance is slightly better
Did you use -server? If no, you should - the client JIT is not a as predictable, neither as good. If you are really interested in what exactly is going on, you can use UnlockDiagnostic + LogCompilation to check what optimizations are being applied in both cases (all the way down to the generated assembly).
Also, from the code you provided I can't see whether you do warmup, whether you run your test one or multiple times for the same JVM, whether you did it a couple of runs (different JVMs). Whether you are taking into account the best, the average or the median time, do you throw out outliers?
Here is a good link on the subject of writing Java micro-benchmarks: http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html
Edit: One more microbenchmarking tip, beware of on-the-stack replacement: http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good

Categories

Resources