Loop counter in Java API

Loop counter in Java API - java

All,
While going through some of the files in Java API, I noticed many instances where the looping counter is being decremented rather than increment. i.e. in for and while loops in String class. Though this might be trivial, is there any significance for decrementing the counter rather than increment?

I've compiled two simple loops with eclipse 3.6 (java 6) and looked at the byte code whether we have some differences. Here's the code:
for(int i = 2; i >= 0; i--){}
for(int i = 0; i <= 2; i++){}
And this is the bytecode:
// 1st for loop - decrement 2 -> 0
0 iconst_2
1 istore_1 // i:=2
2 goto 8
5 inc 1 -1 // i+=(-1)
8 iload_1
9 ifge 5 // if (i >= 0) goto 5
// 2nd for loop - increment 0 -> 2
12 iconst_0
13 istore_1 // i:=0
14 goto 20
17 inc 1 1 // i+=1
20 iload_1
21 iconst 2
22 if_icmple 17 // if (i <= 2) goto 17
The increment/decrement operation should make no difference, it's either +1 or +(-1). The main difference in this typical(!) example is that in the first example we compare to 0 (ifge i), in the second we compare to a value (if_icmple i 2). And the comaprision is done in each iteration. So if there is any (slight) performance gain, I think it's because it's less costly to compare with 0 then to compare with other values. So I guess it's not incrementing/decrementing that makes the difference but the stop criteria.
So if you're in need to do some micro-optimization on source code level, try to write your loops in a way that you compare with zero, otherwise keep it as readable as possible (and incrementing is much easier to understand):
for (int i = 0; i <= 2; i++) {} // readable
for (int i = -2; i <= 0; i++) {} // micro-optimized and "faster" (hopefully)
Addition
Yesterday I did a very basic test - just created a 2000x2000 array and populated the cells based on calculations with the cell indices, once counting from 0->1999 for both rows and cells, another time backwards from 1999->0. I wasn't surprised that both scenarios had a similiar performance (185..210 ms on my machine).
So yes, there is a difference on byte code level (eclipse 3.6) but, hey, we're in 2010 now, it doesn't seem to make a significant difference nowadays. So again, and using Stephens words, "don't waste your time" with this kind of optimization. Keep the code readable and understandable.

When in doubt, benchmark.
public class IncDecTest
{
public static void main(String[] av)
{
long up = 0;
long down = 0;
long upStart, upStop;
long downStart, downStop;
long upStart2, upStop2;
long downStart2, downStop2;
upStart = System.currentTimeMillis();
for( long i = 0; i < 100000000; i++ )
{
up++;
}
upStop = System.currentTimeMillis();
downStart = System.currentTimeMillis();
for( long j = 100000000; j > 0; j-- )
{
down++;
}
downStop = System.currentTimeMillis();
upStart2 = System.currentTimeMillis();
for( long k = 0; k < 100000000; k++ )
{
up++;
}
upStop2 = System.currentTimeMillis();
downStart2 = System.currentTimeMillis();
for( long l = 100000000; l > 0; l-- )
{
down++;
}
downStop2 = System.currentTimeMillis();
assert (up == down);
System.out.println( "Up: " + (upStop - upStart));
System.out.println( "Down: " + (downStop - downStart));
System.out.println( "Up2: " + (upStop2 - upStart2));
System.out.println( "Down2: " + (downStop2 - downStart2));
}
}
With the following JVM:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Has the following output (ran it multiple times to make sure the JVM was loaded and to make sure the numbers settled down a little).
$ java -ea IncDecTest
Up: 86
Down: 84
Up2: 83
Down2: 84
These all come extremely close to one another and I have a feeling that any discrepancy is a fault of the JVM loading some code at some points and not others, or a background task happening, or simply falling over and getting rounded down on a millisecond boundary.
While at one point (early days of Java) there might have been some performance voodoo to be had, it seems to me that that is no longer the case.
Feel free to try running/modifying the code to see for yourself.

It is possible that this is a result of Sun engineers doing a whole lot of profiling and micro-optimization, and those examples that you found are the result of that. It is also possible that they are the result of Sun engineers "optimizing" based on deep knowledge of the JIT compilers ... or based on shallow / incorrect knowledge / voodoo thinking.
It is possible that these sequences:
are faster than the increment loops,
are no faster or slower than increment loops, or
are slower than increment loops for the latest JVMs, and the code is no longer optimal.
Either way, you should not emulate this practice in your code, unless thorough profiling with the latest JVMs demonstrates that:
your code really will benefit from optimization, and
the decrementing loop really is faster than the incrementing loop for your particular application.
And even then, you may find that your carefully hand optimized code is less than optimal on other platforms ... and that you need to repeat the process all over again.
These days, it is generally recognized that the best first strategy is to write simple code and leave optimization to the JIT compiler. Writing complicated code (such as loops that run in reverse) may actually foil the JIT compiler's attempts to optimize.

Related

Why 2 similar loop codes costs different time in java

I was confused by the codes as follows:
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 10000000;
final int jBound = 100;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
That's the first version, it costs 443ms on my computer.
first version result
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 100;
final int jBound = 10000000;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
The second version costs 832ms.
second version result
The only difference is that I simply swap the i and j.
This result is incredible, I test the same code in C and the difference in C is not that huge.
Why is this 2 similar codes so different in java?
My jdk version is openjdk-14.0.2

TL;DR - This is just a bad benchmark.
I did the following:
Create a Main class with a main method.
Copy in the two versions of the test as test1() and test2().
In the main method do this:
while(true) {
test1();
test2();
}
Here is the output I got (Java 8).
i:10000000 j:100
It costs 35 ms
i:100 j:10000000
It costs 33 ms
i:10000000 j:100
It costs 33 ms
i:100 j:10000000
It costs 25 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
....
So as you can see, when I run two versions of the same method alternately in the same JVM, the times for each method are roughly the same.
But more importantly, after a small number of iterations the time drops to ... zero! What has happened is that the JIT compiler has compiled the two methods and (probably) deduced that their loops can be optimized away.
It is not entirely clear why people are getting different times when the two versions are run separately. One possible explanation is that the first time run, the JVM executable is being read from disk, and the second time is already cached in RAM. Or something like that.
Another possible explanation is that JIT compilation kicks in earlier1 with one version of test() so the proportion of time spent in the slower interpreting (pre-JIT) phase is different between the two versions. (It may be possible to teas this out using JIT logging options.)
But it is immaterial really ... because the performance of a Java application while the JVM is warming up (loading code, JIT compiling, growing the heap to its working size, loading caches, etc) is generally speaking not important. And for the cases where it is important, look for a JVM that can do AOT compilation; e.g. GraalVM.
1 - This could be because of the way that the interpreter gathers stats. The general idea is that the bytecode interpreter accumulates statistics on things like branches until it has "enough". Then the JVM triggers the JIT compiler to compile the bytecodes to native code. When that is done, the code runs typically 10 or more times faster. The different looping patterns might it reach "enough" earlier in one version compared to the other. NB: I am speculating here. I offer zero evidence ...
The bottom line is that you have to be careful when writing Java benchmarks because the timings can be distorted by various JVM warmup effects.
For more information read: How do I write a correct micro-benchmark in Java?

I test it myself, I get same difference (around 16ms and 4ms).
After testing, I found that :
Declare 1M of variable take less time than multiple by 1 1M time.
How ?
I made a sum of 100
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
i *= 1;
i *= 1;
[... written 20 times]
i *= 1;
i *= 1;
}
And of 100 this:
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
int a = 0;
int aa = 0;
[... written 20 times]
int aaaaaaaaaaaaaaaaaaaaaa = 0;
int aaaaaaaaaaaaaaaaaaaaaaa = 0;
}
And I respectively get 8 and 3ms, which seems to correspond to what you get.
You can have different result if you have different processor.

you found the answer in algorithm books first chapter :
cost of producing and assigning is 1. so in first algorithm you have 2 declaration and assignation 10000000 and in second one you make it 100. so you reduce time ...
in first :
5 in main loop and 3 in second loop -> second loop is : 3*100 = 300
then 300 + 5 -> 305 * 10000000 = 3050000000
in second :
3*10000000 = 30000000 - > (30000000 + 5 )*100 = 3000000500
so the second one in algorithm is faster in theory but I think its back to multi cpu's ...which they can do 10000000 parallel job in first but only 100 parallel job in second .... so the first one became faster.

Why is this C++ code execution so slow compared to java?

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).
Conclusions after 1st 21 comments below ...
Test code:
g++ -o ... Java 5.5 times faster
g++ -O3 -o ... Java 2.9 times faster
g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.
My original project (much more complex than test code):
Java 1.8 times faster
C++ 1.9 times faster
C++ 2 times faster
Software environment:
Ubuntu 16.04 (64 bit).
Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Compilation: g++ -o cpp_test cpp_test.cpp
Java code:
public class JavaTest {
public static void main(String[] args) {
final int ARRAY_LENGTH = 100;
final int FINISH_TRIGGER = 100000000;
int[] intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
boolean finished = false;
long loopCount = 0;
System.out.println("Start");
long startTime = System.nanoTime();
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i <(ARRAY_LENGTH - 1)) i++;
else i = 0;
}
System.out.println("Finish: " + loopCount + " loops; " +
((System.nanoTime() - startTime)/1e9) + " secs");
// 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
}
}
C++ code:
//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
const int ARRAY_LENGTH = 100;
const int FINISH_TRIGGER = 100000000;
int *intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
bool finished = false;
long long loopCount = 0;
std::cout << "Start\n";
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i < (ARRAY_LENGTH - 1)) i++;
else i = 0;
}
clock_gettime(CLOCK_REALTIME, &ts);
double elapsedTime =
((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
std::cout << "Finish: " << loopCount << " loops; ";
std::cout << elapsedTime << " secs\n";
// 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}

The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.
There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).
At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).
Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.

Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.
Java's final JITted assembly code is similar to this (in reality repeated twice):
while (true) {
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
if (i >= ARRAY_LENGTH) i = 0;
}
With this loop I'm getting exact same timings (6.4s) between C++ and Java.
Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.
This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.
Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.
In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.
For example, if you restructure your loop just a bit:
while (!finished) {
for (i=0; i<ARRAY_LENGTH; ++i) {
loopCount++;
if (++intArray[i] >= FINISH_TRIGGER) {
finished=true;
break;
}
}
}
Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)
And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):
while (!finished) {
for (int i=0; i<ARRAY_LENGTH; ++i) {
++intArray[i];
}
loopCount+=ARRAY_LENGTH;
for (int i=0; i<ARRAY_LENGTH; ++i) {
if (intArray[i] >= FINISH_TRIGGER) {
loopCount-=ARRAY_LENGTH-i-1;
finished=true;
break;
}
}
}
Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

Java: why is computing faster than assigning value (int)?

The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key

It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.

The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;

Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling

To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.

Java efficiency

I'm playing with some piece of code calculating the time needed to compute some Java code to get a feeling of the efficiency or inefficiency of some of Java's functionality. Doing so I'm stuck now with some really strange effect I just can't explain myself. Maybe someone of you can help me understand it.
public class PerformanceCheck {
public static void main(String[] args) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
int maxTimes = 1000000000;
for (int i=0;i<10;i++) {
long time = System.currentTimeMillis();
for (int times=0;times<maxTimes;times++) {
// PERFORMANCE CHECK BLOCK START
if (removeList.size() > 0) {
testFunc(3);
}
// PERFORMANCE CHECK BLOCK END
}
long timeNow = System.currentTimeMillis();
System.out.println("time: " + (timeNow - time));
}
}
private static boolean testFunc(int test) {
return 5 > test;
}
}
Starting this results in a relatively long computation time (remember removeList is empty, so testFunc is not even called):
time: 2328
time: 2223
...
While replacing anything of the combination of removeList.size() > 0 and testFunc(3) with anything else has better results. For example:
...
if (removeList.size() == 0) {
testFunc(3);
}
...
Results in (testFunc is called every single time):
time: 8
time: 7
time: 0
time: 0
Even calling both functions independent from each other results in the lower computation time:
...
if (removeList.size() == 0);
testFunc(3);
...
Result:
time: 6
time: 5
time: 0
time: 0
...
Only this particular combination in my initial example takes so long. This is irritating me and I'd really like to understand it. What's so special about it?
Thanks.
Addition:
Changing testFunc() in the first example
if (removeList.size() > 0) {
testFunc(times);
}
to something else, like
private static int testFunc2(int test) {
return 5*test;
}
Will result in being fast again.

That is really surprising. The generated bytecode is identical except for the conditional, which is ifle vs ifne.
The results are much more sensible if you turn off the JIT with -Xint. The second version is 2x slower. So it's to do with what the JIT optimization.
I assume that it can optimize out the check in the second case but not the first (for whatever reason). Even though it means it does the work of the function, missing that conditional makes things much faster. It avoids pipeline stalls and all that.

While not directly related to this question, this is how you would correctly micro benchmark the code using Caliper. Below is a modified version of your code so that it will run with Caliper. The inner loops had to be modified some so that the VM will not optimize them out. It is surprisingly smart at realizing nothing was happening.
There are also a lot of nuances when benchmarking Java code. I wrote about some of the issues I ran into at Java Matrix Benchmark, such as how past history can effect current results. You will avoid many of those issues by using Caliper.
http://code.google.com/p/caliper/
Benchmarking issues with Java Matrix Benchmark
public class PerformanceCheck extends SimpleBenchmark {
public int timeFirstCase(int reps) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
removeList.add( new PerformanceCheck());
int ret = 0;
for( int i = 0; i < reps; i++ ) {
if (removeList.size() > 0) {
if( testFunc(i) )
ret++;
}
}
return ret;
}
public int timeSecondCase(int reps) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
removeList.add( new PerformanceCheck());
int ret = 0;
for( int i = 0; i < reps; i++ ) {
if (removeList.size() == 0) {
if( testFunc(i) )
ret++;
}
}
return ret;
}
private static boolean testFunc(int test) {
return 5 > test;
}
public static void main(String[] args) {
Runner.main(PerformanceCheck.class, args);
}
}
OUTPUT:
0% Scenario{vm=java, trial=0, benchmark=FirstCase} 0.60 ns; σ=0.00 ns # 3 trials
50% Scenario{vm=java, trial=0, benchmark=SecondCase} 1.92 ns; σ=0.22 ns # 10 trials
benchmark ns linear runtime
FirstCase 0.598 =========
SecondCase 1.925 ==============================
vm: java
trial: 0

Well, I am glad not having to deal with Java performance optimizations. I tried it myself with Java JDK 7 64-Bit. The results are arbitrary ;). It makes no difference which lists I am using or if I cache the result of size() before entering the loop. Also entirely wiping out the test function makes almost no difference (so it can't be a branch prediction hit either).
Optimization flags improve performance but are as arbitrary.
The only logical consequence here is that the JIT compiler sometimes is able to optimize away the statement (which is not that hard to be true), but it seems rather arbitrary. One of the many reasons why I prefer languages like C++, where the behaviour is at least deterministic, even if it is sometimes arbitrary.
BTW in the latest Eclipse, like it always was on Windows, running this code via IDE "Run" (no debug) is 10 times slower than running it from console, so much about that...

When the runtime compiler can figure out testFunc evaluates to a constant, I believe it does not evaluate the loop, which explains the speedup.
When the condition is removeList.size() == 0 the function testFunc(3) gets evaluated to a constant. When the condition is removeList.size() != 0 the inner code never gets evaluated so it can't be sped-up. You can modify your code as follows:
for (int times = 0; times < maxTimes; times++) {
testFunc(); // Removing this call makes the code slow again!
if (removeList.size() != 0) {
testFunc();
}
}
private static boolean testFunc() {
return testFunc(3);
}
When testFunc() is not initially called, the runtime compiler does not realize that testFunc() evaluates to a constant, so it cannot optimize the loop.
Certain functions like
private static int testFunc2(int test) {
return 5*test;
}
the compiler likely tries to pre-optimize (before execution), but apparently not for the case of an parameter is passed in as an integer and evaluated in a conditional.
Your benchmark returns times like
time: 107
time: 106
time: 0
time: 0
...
suggesting that it takes 2 iterations of the outer-loop for the runtime compiler to finish optimizing. Compiling with the -server flag would probably return all 0's in the benchmark.

The times are unrealistically fast per iteration. This means the JIT has detected that your code doesn't do anything and has eliminated it. Subtle changes can confuse the JIT and it can't determine the code doesn't do anything and it takes some time.
If you change the test to do something marginally useful, the difference will disappear.

These benchmarks are tough since compilers are so darned smart. One guess: Since the result of testFunc() is ignored, the compiler might be completely optimizing it out. Add a counter, something like
if (testFunc(3))
counter++;
And, just for thoroughness, do a System.out.println(counter) at the end.

Is Java's System.arraycopy() efficient for small arrays?

Is Java's System.arraycopy() efficient for small arrays, or does the fact that it's a native method make it likely to be substantially less efficient than a simple loop and a function call?
Do native methods incur additional performance overhead for crossing some kind of Java-system bridge?

Expanding a little on what Sid has written, it's very likely that System.arraycopy is just a JIT intrinsic; meaning that when code calls System.arraycopy, it will most probably be calling a JIT-specific implementation (once the JIT tags System.arraycopy as being "hot") that is not executed through the JNI interface, so it doesn't incur the normal overhead of native methods.
In general, executing native methods does have some overhead (going through the JNI interface, also some internal JVM operations cannot happen when native methods are being executed). But it's not because a method is marked as "native" that you're actually executing it using JNI. The JIT can do some crazy things.
Easiest way to check is, as has been suggested, writing a small benchmark, being careful with the normal caveats of Java microbenchmarks (warm up the code first, avoid code with no side-effects since the JIT just optimizes it as a no-op, etc).

Here is my benchmark code:
public void test(int copySize, int copyCount, int testRep) {
System.out.println("Copy size = " + copySize);
System.out.println("Copy count = " + copyCount);
System.out.println();
for (int i = testRep; i > 0; --i) {
copy(copySize, copyCount);
loop(copySize, copyCount);
}
System.out.println();
}
public void copy(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
System.arraycopy(src, 1, dst, 0, copySize);
dst[copySize] = src[copySize] + 1;
System.arraycopy(dst, 0, src, 0, copySize);
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Arraycopy: " + (end - begin) / 1e9 + " s");
}
public void loop(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
for (int i = copySize - 1; i >= 0; --i) {
dst[i] = src[i + 1];
}
dst[copySize] = src[copySize] + 1;
for (int i = copySize - 1; i >= 0; --i) {
src[i] = dst[i];
}
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Man. loop: " + (end - begin) / 1e9 + " s");
}
public int[] newSrc(int arraySize) {
int[] src = new int[arraySize];
for (int i = arraySize - 1; i >= 0; --i) {
src[i] = i;
}
return src;
}
From my tests, calling test() with copyCount = 10000000 (1e7) or greater allows the warm-up to be achieved during the first copy/loop call, so using testRep = 5 is enough; With copyCount = 1000000 (1e6) the warm-up need at least 2 or 3 iterations so testRep shall be increased in order to obtain usable results.
With my configuration (CPU Intel Core 2 Duo E8500 # 3.16GHz, Java SE 1.6.0_35-b10 and Eclipse 3.7.2) it appears from the benchmark that:
When copySize = 24, System.arraycopy() and the manual loop take almost the same time (sometimes one is very slightly faster than the other, other times it’s the contrary),
When copySize < 24, the manual loop is faster than System.arraycopy() (slightly faster with copySize = 23, really faster with copySize < 5),
When copySize > 24, System.arraycopy() is faster than the manual loop (slightly faster with copySize = 25, the ratio loop-time/arraycopy-time increasing as copySize increases).
Note: I’m not English native speaker, please excuse all my grammar/vocabulary errors.

This is a valid concern. For example, in java.nio.DirectByteBuffer.put(byte[]), the author tries to avoid a JNI copy for small number of elements
// These numbers represent the point at which we have empirically
// determined that the average cost of a JNI call exceeds the expense
// of an element by element copy. These numbers may change over time.
static final int JNI_COPY_TO_ARRAY_THRESHOLD = 6;
static final int JNI_COPY_FROM_ARRAY_THRESHOLD = 6;
For System.arraycopy(), we can examine how JDK uses it. For example, in ArrayList, System.arraycopy() is always used, never "element by element copy", regardless of length (even if it's 0). Since ArrayList is very performance conscious, we can derive that System.arraycopy() is the most effecient way of array copying regardless of length.

System.arraycopy use a memmove operation for moving words and assembly for moving other primitive types in C behind the scene. So it makes its best effort to move as much as efficient it can reach.

Instead of relying on speculation and possibly outdated information, I ran some benchmarks using caliper. In fact, Caliper comes with some examples, including a CopyArrayBenchmark that measures exactly this question! All you have to do is run
mvn exec:java -Dexec.mainClass=com.google.caliper.runner.CaliperMain -Dexec.args=examples.CopyArrayBenchmark
My results are based on Oracle's Java HotSpot(TM) 64-Bit Server VM, 1.8.0_31-b13, running on a mid-2010 MacBook Pro (macOS 10.11.6 with an Intel Arrandale i7, 8 GiB RAM). I don't believe that it's useful to post the raw timing data. Rather, I'll summarize the conclusions with the supporting visualizations.
In summary:
Writing a manual for loop to copy each element into a newly instantiated array is never advantageous, even for arrays as short as 5 elements.
Arrays.copyOf(array, array.length) and array.clone() are both consistently fast. These two techniques are nearly identical in performance; which one you choose is a matter of taste.
System.arraycopy(src, 0, dest, 0, src.length) is almost as fast as Arrays.copyOf(array, array.length) and array.clone(), but not quite consistently so. (See the case for 50000 ints.) Because of that, and the verbosity of the call, I would recommend System.arraycopy() if you need fine control over which elements get copied where.
Here are the timing plots:

Byte codes are executed natively anyways so it's likely that performance would be better than a loop.
So in case of a loop it would have to execute byte codes which will incur an overhead. While array copy should be straight memcopy.

Native functions should be faster than JVM functions, since there is no VM overhead. However for a lot of(>1000) very small(len<10) arrays it might be slower.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.