Why is this C++ code execution so slow compared to java?

Why is this C++ code execution so slow compared to java? - java

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).
Conclusions after 1st 21 comments below ...
Test code:
g++ -o ... Java 5.5 times faster
g++ -O3 -o ... Java 2.9 times faster
g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.
My original project (much more complex than test code):
Java 1.8 times faster
C++ 1.9 times faster
C++ 2 times faster
Software environment:
Ubuntu 16.04 (64 bit).
Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Compilation: g++ -o cpp_test cpp_test.cpp
Java code:
public class JavaTest {
public static void main(String[] args) {
final int ARRAY_LENGTH = 100;
final int FINISH_TRIGGER = 100000000;
int[] intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
boolean finished = false;
long loopCount = 0;
System.out.println("Start");
long startTime = System.nanoTime();
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i <(ARRAY_LENGTH - 1)) i++;
else i = 0;
}
System.out.println("Finish: " + loopCount + " loops; " +
((System.nanoTime() - startTime)/1e9) + " secs");
// 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
}
}
C++ code:
//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
const int ARRAY_LENGTH = 100;
const int FINISH_TRIGGER = 100000000;
int *intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
bool finished = false;
long long loopCount = 0;
std::cout << "Start\n";
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i < (ARRAY_LENGTH - 1)) i++;
else i = 0;
}
clock_gettime(CLOCK_REALTIME, &ts);
double elapsedTime =
((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
std::cout << "Finish: " << loopCount << " loops; ";
std::cout << elapsedTime << " secs\n";
// 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}

The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.
There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).
At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).
Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.

Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.
Java's final JITted assembly code is similar to this (in reality repeated twice):
while (true) {
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
if (i >= ARRAY_LENGTH) i = 0;
}
With this loop I'm getting exact same timings (6.4s) between C++ and Java.
Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.
This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.
Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.
In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.
For example, if you restructure your loop just a bit:
while (!finished) {
for (i=0; i<ARRAY_LENGTH; ++i) {
loopCount++;
if (++intArray[i] >= FINISH_TRIGGER) {
finished=true;
break;
}
}
}
Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)
And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):
while (!finished) {
for (int i=0; i<ARRAY_LENGTH; ++i) {
++intArray[i];
}
loopCount+=ARRAY_LENGTH;
for (int i=0; i<ARRAY_LENGTH; ++i) {
if (intArray[i] >= FINISH_TRIGGER) {
loopCount-=ARRAY_LENGTH-i-1;
finished=true;
break;
}
}
}
Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

Related

Time how long a function runs (short duration)

I'm relatively new to Java programming, and I'm running into an issue calculating the amount of time it takes for a function to run.
First some background - I've got a lot of experience with Python, and I'm trying to recreate the functionality of the Jupyter Notebook/Lab %%timeit function, if you're familiar with that. Here's a pic of it in action (sorry, not enough karma to embed yet):
Snip of Jupyter %%timeit
What it does is run the contents of the cell (in this case a recursive function) either 1k, 10k, or 100k times, and give you the average run time of the function, and the standard deviation.
My first implementation (using the same recursive function) used System.nanoTime():
public static void main(String[] args) {
long t1, t2, diff;
long[] times = new long[1000];
int t;
for (int i=0; i< 1000; i++) {
t1 = System.nanoTime();
t = triangle(20);
t2 = System.nanoTime();
diff = t2-t1;
System.out.println(diff);
times[i] = diff;
}
long total = 0;
for (int j=0; j<times.length; j++) {
total += times[j];
}
System.out.println("Mean = " + total/1000.0);
}
But the mean is wildly thrown off -- for some reason, the first iteration of the function (on many runs) takes upwards of a million nanoseconds:
Pic of initial terminal output
Every iteration after the first dozen or so takes either 395 nanos or 0 -- so there could be a problem there too... not sure what's going on!
Also -- the code of the recursive function I'm timing:
static int triangle(int n) {
if (n == 1) {
return n;
} else {
return n + triangle(n -1);
}
}
Initially I had the line n = Math.abs(n) on the first line of the function, but then I removed it because... meh. I'm the only one using this.
I tried a number of different suggestions brought up in this SO post, but they each have their own problems... which I can go into if you need.
Anyway, thank you in advance for your help and expertise!

Performance optimization: C++ vs Java not performing as expected

I have written two programs implementing a simple algorithm for matrix multiplication, one in C++ and one in Java. Contrary to my expectations, the Java program runs about 2.5x faster than the C++ program. I am a novice at C++, and would like suggestions on what I can change in the C++ program to make it run faster.
My programs borrow code and data from this blog post http://martin-thoma.com/matrix-multiplication-python-java-cpp .
Here are the current compilation flags I am using:
g++ -O3 main.cc
javac Main.java
Here are the current compiler/runtime versions:
$ g++ --version
g++.exe (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
My computer is a ~2012 era core i3 laptop running windows with MinGW. Here are the current performance results:
$ time ./a.exe < ../Testing/2000.in
507584919
real 0m36.469s
user 0m0.031s
sys 0m0.030s
$ time java Main < ../Testing/2000.in
507584919
real 0m14.299s
user 0m0.031s
sys 0m0.015s
Here is the C++ program:
#include <iostream>
#include <cstdio>
using namespace std;
int *A;
int *B;
int height;
int width;
int * matMult(int A[], int B[]) {
int * C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
int main() {
std::ios::sync_with_stdio(false);
cin >> height;
cin >> width;
A = new int[width*height];
B = new int[width*height];
for (int i = 0; i < width*height; i++) {
cin >> A[i];
}
for (int i = 0; i < width*height; i++) {
cin >> B[i];
}
int *result = matMult(A,B);
cout << result[2];
}
Here is the java program:
import java.util.*;
import java.io.*;
public class Main {
static int[] A;
static int[] B;
static int height;
static int width;
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
height = Integer.parseInt(reader.readLine());
width = Integer.parseInt(reader.readLine());
A=new int[width*height];
B=new int[width*height];
int index = 0;
String thisLine;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
A[index] = Integer.parseInt(number);
index++;
}
}
}
index = 0;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
B[index] = Integer.parseInt(number);
index++;
}
}
}
int[] result = matMult(A,B);
System.out.println(result[2]);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static int[] matMult(int[] A, int[] B) {
int[] C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
}
Here is a link to a 2000x2000 test case: https://mega.nz/#!sglWxZqb!HBts_UlZnR4X9gZR7bG-ej3xf2A5vUv0wTDUW-kqFMA
Here is a link to a 2x2 test case: https://mega.nz/#!QwkV2SII!AtfGuxPV5bQeZtt9eHNNn36rnV4sGq0_sJzitjiFE8s
Any advice explaining what I am doing wrong in C++, or why my C++ implementation is running so much slower than Java here, would be much appreciated!
EDIT: As suggested, I modified the programs so that they do not actually perform a multiplication, but just read the arrays in and print out one number from each. Here are the performance results for that. The C++ program has slower IO. That only accounts for part of the difference however.
$ time ./IOonly.exe < ../Testing/2000.in
7
944
real 0m8.158s
user 0m0.000s
sys 0m0.046s
$ time java IOOnly < ../Testing/2000.in
7
944
real 0m1.461s
user 0m0.000s
sys 0m0.047s

I'm not able to analyze the java execution, since it creates a temporary executable module that disappears after it's been "used". However, I assume that it does execute SSE instructions to get that speed [or that it unrolls the loop, which clang++ does if you disable SSE instructions]
But compiling with g++ (4.9.2) and clang++, I can clearly see that clang optimises the loop to use SSE instructions, where gcc doesn't. The resulting code is thus exactly 4 times slower. Changing the code so that it uses a constant value of 2000 in each dimension [so compiler "knows" the dimensions of the height and width], the gcc compiler also generates code that takes around 8s (on my machine!), compared to 27s with "variable" value [the clang compiled code is marginally faster as well here, but within the noise I'd say].
Overall conclusion: Quality/cleverness of compiler will highly affect the performance of tight loops. The more complex and varied the code is, the more likely it is that the C++ solution will generate better code, where simple and easy to compile problems are quite likely to be better in Java code [as a rule, but not guaranteed]. I expect the java compiler uses profiling to determine the number of loops for example.
Edit:
The result of time can be used to determine if the reading of the file is taking a long time, but you need some kind of profiling tool to determine if the actual input is using a lot of CPU-time and such.
The java engine uses a "just-in-time compiler", which uses profiling to determine the number of times a particular piece of code is hit (you can do that for C++ too, and big projects often do!), which allows it to for example unroll a loop, or determine at runtime the number of iterations in a loop. Given that this code does 2000 * 2000 * 2000 loops, and the C++ compiler actually does a BETTER job when it KNOWS the size of the values is telling us that the Java runtime isn't actually doing better (at least not initially), just that it manages to improve the performance over time.
Unfortunately, due to the way that the java runtime works, it doesn't leave the binary code behind, so I can't really analyze what it does.
The key here is that the actual operations you are doing are simple, and the logic is simple, it's just an awful lot of them, and you are doing them using a trivial implementation. Both Java and C++ will benefit from manually unrolling the loop, for example.

C++ is not faster than Java by default
C++ is fast as a language, but soon as you incorporate libraries into the mix, you are bound to these libraries' speed.
The standard is hardly built for performance, period. The standard libraries are written with design and correctness in mind.
C++ gives you the opportunity to optimize!
If you are unhappy with the standard library's performance, you can, and you should, use your own optimized version.
For example, standard C++ IO objects are beautiful when it comes to design (stream, locales, facets, inner buffers) but that makes them terrible at performance.
If you are writing for Windows OS, you can use ReadFile and WriteConsole as your mechanism for IO.
If you switch to these functions instead of the standard libraries - your program outperforms Java by a few orders of magnitude.

Why slowdown in this identical code?

I was trying to measure the time to execute this loop :
for (boolean t : test) {
if (!t)
++count;
}
And was getting inconsistent results. Eventually I have managed to get consistent results with the following code :
public class Test {
public static void main(String[] args) {
int size = 100;
boolean[] test = new boolean[10_000_000];
java.util.Random r = new java.util.Random();
for (int n = 0; n < 10_000_000; ++n)
test[n] = !r.nextBoolean();
int expected = 0;
long acumulated = 0;
for (int repeat = -1; repeat < size; ++repeat) {
int count = 0;
long start = System.currentTimeMillis();
for (boolean t : test) {
if (!t)
++count;
}
long end = System.currentTimeMillis();
if (repeat != -1) // First run does not count, VM warming up
acumulated += end - start;
else // Use count to avoid compiler or JVM
expected = count; //optimization of inner loop
if ( count!=expected )
throw new Error("Tests don't run same ammount of times");
}
float average = (float) acumulated / size;
System.out.println("1st test : " + average);
int expectedBis = 0;
acumulated = 0;
if ( "reassign".equals(args[0])) {
for (int n = 0; n < 10_000_000; ++n)
test[n] = test[n];
}
for (int repeat = -1; repeat < size; ++repeat) {
int count = 0;
long start = System.currentTimeMillis();
for (boolean t : test) {
if (!t)
++count;
}
long end = System.currentTimeMillis();
if (repeat != -1) // First run does not count, VM warming up
acumulated += end - start;
else // Use count to avoid compiler or JVM
expectedBis = count; //optimization of inner loop
if ( count!=expected || count!=expectedBis)
throw new Error("Tests don't run same ammount of times");
}
average = (float) acumulated / size;
System.out.println("2nd test : " + average);
}
}
The results I get are :
$ java -jar Test.jar noreassign
1st test : 23.98
2nd test : 23.97
$ java -jar Test.jar reassign
1st test : 23.98
2nd test : 40.86
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (Gentoo package icedtea-7.2.5.5)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
The difference is in executing or not this loop before the 2nd test.
for (int n = 0; n < 10_000_000; ++n)
test[n] = test[n];
Why? Why does doing that reassignation cause those loops after it to take twice the time?
Getting profiling right is hard...

"As for why the JIT compiler causes such behaviour... that is beyond my skill and knowledge."
Three basic facts:
Code runs faster after JIT compilation.
JIT compilation is triggered after a chunk of code has run for a bit. (How long "a bit" is is influenced the JVM platform and command line options.)
JIT compilation takes time.
In your case, when you insert the big assignment loop between test 1 and test 2, you are most likely moving the time point at which JIT compilation is triggered ... from during test 2 to between the 2 tests.
The simple way address this in this case is to put the body of main into a loop and run it repeatedly. Then discard the anomalous results in the first few runs.
(Turning off JIT compilation is not a good answer. Normally, it is the performance characteristics of code after JIT compilation that is going to be indicative of how a real application performs ...)
By setting the compiler to NONE, you are disabling JIT compilation, taking it out of the equation.
This kind of anomaly is common when people attempt to write micro-benchmarks by hand. Read this Q&A:
How do I write a correct micro-benchmark in Java?

I would add this as a comment, but my reputation is too low, so it must be added as an answer.
I've created a jar with your exact code, and ran it several times. I also copied the code to C# and ran it in the .NET runtime as well.
Both Java and C# show the same exact time, with and without the 'reassign' loop.
What timing are you getting if you change the loop to
if ( "reassign".equals(args[0])) {
for (int n = 0; n < 5_000_000; ++n)
test[n] = test[n];
}
?

Marko Topolniks's and rossum's comments got me on the right direction.
It is a JIT compiler issue.
If I disable the JIT compiler I get these results :
$ java -jar Test.jar -Djava.compiler=NONE noreassign
1st test : 19.23
2nd test : 19.33
$ java -jar Test.jar -Djava.compiler=NONE reassign
1st test : 19.23
2nd test : 19.32
The strange slowdown dissapears once the JIT compiler is deactivated.
As for why the JIT compiler causes such behaviour... that is beyond my skill and knowledge.
But it does not happen in all JVMs as Marius Dornean's tests show.

No speedup in multithread program

I was playing with Go language concurrency and found something which is kinda opaque to me.
I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.
Here is Java program
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
and this is Go program
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook.
When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!
Even though Java code uses additional storage for Futures and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.
Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).
Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.
What I'm doing incorrectly here? Is there a way to speedup Go program?
UPD:
it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis() and Go program using time shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
Seems I have to be more attentive.
Still java program gives five time lesser times on the same case. But it is a matter for another question I think.

You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.
This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.
The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!
For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).

If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.

For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.

Loop counter in Java API

All,
While going through some of the files in Java API, I noticed many instances where the looping counter is being decremented rather than increment. i.e. in for and while loops in String class. Though this might be trivial, is there any significance for decrementing the counter rather than increment?

I've compiled two simple loops with eclipse 3.6 (java 6) and looked at the byte code whether we have some differences. Here's the code:
for(int i = 2; i >= 0; i--){}
for(int i = 0; i <= 2; i++){}
And this is the bytecode:
// 1st for loop - decrement 2 -> 0
0 iconst_2
1 istore_1 // i:=2
2 goto 8
5 inc 1 -1 // i+=(-1)
8 iload_1
9 ifge 5 // if (i >= 0) goto 5
// 2nd for loop - increment 0 -> 2
12 iconst_0
13 istore_1 // i:=0
14 goto 20
17 inc 1 1 // i+=1
20 iload_1
21 iconst 2
22 if_icmple 17 // if (i <= 2) goto 17
The increment/decrement operation should make no difference, it's either +1 or +(-1). The main difference in this typical(!) example is that in the first example we compare to 0 (ifge i), in the second we compare to a value (if_icmple i 2). And the comaprision is done in each iteration. So if there is any (slight) performance gain, I think it's because it's less costly to compare with 0 then to compare with other values. So I guess it's not incrementing/decrementing that makes the difference but the stop criteria.
So if you're in need to do some micro-optimization on source code level, try to write your loops in a way that you compare with zero, otherwise keep it as readable as possible (and incrementing is much easier to understand):
for (int i = 0; i <= 2; i++) {} // readable
for (int i = -2; i <= 0; i++) {} // micro-optimized and "faster" (hopefully)
Addition
Yesterday I did a very basic test - just created a 2000x2000 array and populated the cells based on calculations with the cell indices, once counting from 0->1999 for both rows and cells, another time backwards from 1999->0. I wasn't surprised that both scenarios had a similiar performance (185..210 ms on my machine).
So yes, there is a difference on byte code level (eclipse 3.6) but, hey, we're in 2010 now, it doesn't seem to make a significant difference nowadays. So again, and using Stephens words, "don't waste your time" with this kind of optimization. Keep the code readable and understandable.

When in doubt, benchmark.
public class IncDecTest
{
public static void main(String[] av)
{
long up = 0;
long down = 0;
long upStart, upStop;
long downStart, downStop;
long upStart2, upStop2;
long downStart2, downStop2;
upStart = System.currentTimeMillis();
for( long i = 0; i < 100000000; i++ )
{
up++;
}
upStop = System.currentTimeMillis();
downStart = System.currentTimeMillis();
for( long j = 100000000; j > 0; j-- )
{
down++;
}
downStop = System.currentTimeMillis();
upStart2 = System.currentTimeMillis();
for( long k = 0; k < 100000000; k++ )
{
up++;
}
upStop2 = System.currentTimeMillis();
downStart2 = System.currentTimeMillis();
for( long l = 100000000; l > 0; l-- )
{
down++;
}
downStop2 = System.currentTimeMillis();
assert (up == down);
System.out.println( "Up: " + (upStop - upStart));
System.out.println( "Down: " + (downStop - downStart));
System.out.println( "Up2: " + (upStop2 - upStart2));
System.out.println( "Down2: " + (downStop2 - downStart2));
}
}
With the following JVM:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Has the following output (ran it multiple times to make sure the JVM was loaded and to make sure the numbers settled down a little).
$ java -ea IncDecTest
Up: 86
Down: 84
Up2: 83
Down2: 84
These all come extremely close to one another and I have a feeling that any discrepancy is a fault of the JVM loading some code at some points and not others, or a background task happening, or simply falling over and getting rounded down on a millisecond boundary.
While at one point (early days of Java) there might have been some performance voodoo to be had, it seems to me that that is no longer the case.
Feel free to try running/modifying the code to see for yourself.

It is possible that this is a result of Sun engineers doing a whole lot of profiling and micro-optimization, and those examples that you found are the result of that. It is also possible that they are the result of Sun engineers "optimizing" based on deep knowledge of the JIT compilers ... or based on shallow / incorrect knowledge / voodoo thinking.
It is possible that these sequences:
are faster than the increment loops,
are no faster or slower than increment loops, or
are slower than increment loops for the latest JVMs, and the code is no longer optimal.
Either way, you should not emulate this practice in your code, unless thorough profiling with the latest JVMs demonstrates that:
your code really will benefit from optimization, and
the decrementing loop really is faster than the incrementing loop for your particular application.
And even then, you may find that your carefully hand optimized code is less than optimal on other platforms ... and that you need to repeat the process all over again.
These days, it is generally recognized that the best first strategy is to write simple code and leave optimization to the JIT compiler. Writing complicated code (such as loops that run in reverse) may actually foil the JIT compiler's attempts to optimize.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.