Java iterative vs recursive

Java iterative vs recursive - java

Can anyone explain why the following recursive method is faster than the iterative one (Both are doing it string concatenation) ? Isn't the iterative approach suppose to beat up the recursive one ? plus each recursive call adds a new layer on top of the stack which can be very space inefficient.
private static void string_concat(StringBuilder sb, int count){
if(count >= 9999) return;
string_concat(sb.append(count), count+1);
}
public static void main(String [] arg){
long s = System.currentTimeMillis();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 9999; i++){
sb.append(i);
}
System.out.println(System.currentTimeMillis()-s);
s = System.currentTimeMillis();
string_concat(new StringBuilder(),0);
System.out.println(System.currentTimeMillis()-s);
}
I ran the program multiple time, and the recursive one always ends up 3-4 times faster than the iterative one. What could be the main reason there that is causing the iterative one slower ?

See my comments.
Make sure you learn how to properly microbenchmark. You should be timing many iterations of both and averaging these for your times. Aside from that, you should make sure the VM isn't giving the second an unfair advantage by not compiling the first.
In fact, the default HotSpot compilation threshold (configurable via -XX:CompileThreshold) is 10,000 invokes, which might explain the results you see here. HotSpot doesn't really do any tail optimizations so it's quite strange that the recursive solution is faster. It's quite plausible that StringBuilder.append is compiled to native code primarily for the recursive solution.
I decided to rewrite the benchmark and see the results for myself.
public final class AppendMicrobenchmark {
static void recursive(final StringBuilder builder, final int n) {
if (n > 0) {
recursive(builder.append(n), n - 1);
}
}
static void iterative(final StringBuilder builder) {
for (int i = 10000; i >= 0; --i) {
builder.append(i);
}
}
public static void main(final String[] argv) {
/* warm-up */
for (int i = 200000; i >= 0; --i) {
new StringBuilder().append(i);
}
/* recursive benchmark */
long start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
recursive(new StringBuilder(), 10000);
}
System.out.printf("recursive: %.2fus\n", (System.nanoTime() - start) / 1000000D);
/* iterative benchmark */
start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
iterative(new StringBuilder());
}
System.out.printf("iterative: %.2fus\n", (System.nanoTime() - start) / 1000000D);
}
}
Here are my results...
C:\dev\scrap>java AppendMicrobenchmark
recursive: 405.41us
iterative: 313.20us
C:\dev\scrap>java -server AppendMicrobenchmark
recursive: 397.43us
iterative: 312.14us
These are times for each approach averaged over 1000 trials.
Essentially, the problems with your benchmark are that it doesn't average over many trials (law of large numbers), and that it is highly dependent on the ordering of the individual benchmarks. The original result I was given for yours:
C:\dev\scrap>java StringBuilderBenchmark
80
41
This made very little sense to me. Recursion on the HotSpot VM is more than likely not going to be as fast as iteration because as of yet it does not implement any sort of tail optimization that you might find used for functional languages.
Now, the funny thing that happens here is that the default HotSpot JIT compilation threshold is 10,000 invokes. Your iterative benchmark will more than likely be executing for the most part before append is compiled. On the other hand, your recursive approach should be comparatively fast since it will more than likely enjoy append after it is compiled. To eliminate this from influencing the results, I passed -XX:CompileThreshold=0 and found...
C:\dev\scrap>java -XX:CompileThreshold=0 StringBuilderBenchmark
8
8
So, when it comes down to it, they're both roughly equal in speed. Note however that the iterative appears to be a bit faster if you average with higher precision. Order might still make a difference in my benchmark, too, as the latter benchmark will have the advantage of the VM having collected more statistics for its dynamic optimizations.

Related

Why is my java program becoming gradually slower?

I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}

If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.

More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.

It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.

It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.

What happens inside the JVM so that a method invocation in Java becomes slower when you call it somewhere else in your code?

The short code below isolates the problem. Basically I'm timing the method addToStorage. I start by executing it one million times and I'm able to get its time down to around 723 nanoseconds. Then I do a short pause (using a busy spinning method not to release the cpu core) and time the method again N times, on a different code location. For my surprise I find that the smaller the N the bigger is the addToStorage latency.
For example:
If N = 1 then I get 3.6 micros
If N = 2 then I get 3.1 and 2.5 micros
if N = 5 then I get 3.7, 1.8, 1.7, 1.5 and 1.5 micros
Does anyone know why this is happening and how to fix it? I would like my method to consistently perform at the fastest time possible, no matter where I call it.
Note: I would not think it is thread related since I'm not using Thread.sleep. I've also tested using taskset to pin my thread to a cpu core with the same results.
import java.util.ArrayList;
import java.util.List;
public class JvmOdd {
private final StringBuilder sBuilder = new StringBuilder(1024);
private final List<String> storage = new ArrayList<String>(1024 * 1024);
public void addToStorage() {
sBuilder.setLength(0);
sBuilder.append("Blah1: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah2: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah3: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah4: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah5: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah6: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah7: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah8: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah9: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah10: ").append(System.nanoTime()).append('\n');
storage.add(sBuilder.toString());
}
public static long mySleep(long t) {
long x = 0;
for(int i = 0; i < t * 10000; i++) {
x += System.currentTimeMillis() / System.nanoTime();
}
return x;
}
public static void main(String[] args) throws Exception {
int warmup = Integer.parseInt(args[0]);
int mod = Integer.parseInt(args[1]);
int passes = Integer.parseInt(args[2]);
int sleep = Integer.parseInt(args[3]);
JvmOdd jo = new JvmOdd();
// first warm up
for(int i = 0; i < warmup; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i % mod == 0) System.out.println(time);
}
// now see how fast the method is:
while(true) {
System.out.println();
// Thread.sleep(sleep);
mySleep(sleep);
long minTime = Long.MAX_VALUE;
for(int i = 0; i < passes; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i > 0) System.out.print(',');
System.out.print(time);
minTime = Math.min(time, minTime);
}
System.out.println("\nMinTime: " + minTime);
}
}
}
Executing:
$ java -server -cp . JvmOdd 1000000 100000 1 5000
59103
820
727
772
734
767
730
726
840
736
3404
MinTime: 3404

There is so much going on in here that I don't know where to start. But lets start here....
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
The latency of addToStoarge() cannot be measured using this technique. It simply runs for too quickly meaning you're likely below the resolution of the clock. Without running this, my guess is that your measures are dominated by clock edge counts. You'll need to bulk up the unit of work to get a measure with lower levels of noise in it.
As for what is happening? There are a number of call site optimizations the most important being inlining. Inlining would totally eliminate the call site but it's a path specific optimization. If you call the method from a different place, that would follow the slow path of performing a virtual method lookup followed by a jump to that code. So to see the benefits of inlining from a different path, that path would also have to be "warmed up".
I would strongly recommend that you look at both JMH (delivered with the JDK). There are facilities in there such as blackhole which will help with the effects of CPU clocks winding down. You might also want to evaluate the quality of the bench with the help of tools like JITWatch (Adopt OpenJDK project) which will take logs produced by the JIT and help you interrupt them.

There is so much to this subject, but the bottom line is that you can't write a simplistic benchmark like this and expect it to tell you anything useful. You will need to use JMH.
I suggest watching this: https://www.infoq.com/presentations/jmh about microbenchmarking and JMH
There's also a chapter on microbenchmarking & JMH in my book: http://shop.oreilly.com/product/0636920042983.do

Java internally uses JIT(Just in Compiler). Based on the number of times the same method executes it optimizes the instruction and perform better. For lesser values, the usage of method would be normal which may not fall under optimization that shows the execution time more. When the same method called more time, it uses JIT and executes in lesser time because of the optimized instruction for the same method execution.

Multiplication time in BigInteger

My mini benchmark:
import java.math.*;
import java.util.*;
import java.io.*;
public class c
{
static Random rnd = new Random();
public static String addDigits(String a, int n)
{
if(a==null) return null;
if(n<=0) return a;
for(int i=0; i<n; i++)
a+=rnd.nextInt(10);
return a;
}
public static void main(String[] args) throws IOException
{
int n = 10000; \\number of iterations
int k = 10; \\number of digits added at each iteration
BigInteger a;
BigInteger b;
String as = "";
String bs = "";
as += rnd.nextInt(9)+1;
bs += rnd.nextInt(9)+1;
a = new BigInteger(as);
b = new BigInteger(bs);
FileWriter fw = new FileWriter("c.txt");
long t1 = System.nanoTime();
a.multiply(b);
long t2 = System.nanoTime();
//fw.write("1,"+(t2-t1)+"\n");
if(k>0) {
as = addDigits(as, k-1);
bs = addDigits(as, k-1);
}
for(int i=0; i<n; i++)
{
a = new BigInteger(as);
b = new BigInteger(bs);
t1 = System.nanoTime();
a.multiply(b);
t2 = System.nanoTime();
fw.write(((i+1)*k)+","+(t2-t1)+"\n");
if(i < n-1)
{
as = addDigits(as, k);
bs = addDigits(as, k);
}
System.out.println((i+1)*k);
}
fw.close();
}
}
It measures multiplication time of n-digit BigInteger
Result:
You can easily see the trend but why there is so big noise above 50000 digits?
It is because of garbage collector or is there something else that affects my results?
When performing the test, there were no other applications running.
Result from test with only odd digits. The test was shorter (n=1000, k=100)
Odd digits (n=10000, k=10)
As you can see there is a huge noise between 65000 and 70000. I wonder why...
Odd digits (n=10000, k=10), System.gc() every 1000 iterations
Results in noise between 50000-70000

I also suspect this is a JVM warmup effect. Not warmup involving classloading or the JIT compiler, but warmup of the heap.
Put a (java) loop around the whole benchmark, and run it a number of times. (If this gives you the same graphs as before ... you will have evidence that this is not a warmup effect. Currently you don't have any empirical evidence one way or the other.)
Another possibility is that the noise is caused by your benchmark's interactions with the OS and/or other stuff running on the machine.
You are writing your timing data to an unbuffered stream. That means LOTS of syscalls, and (potentially) lots of fine-grained disc writes.
You are making LOTS of calls to nanoTime(), and that might introduce noise.
If something else is running on your machine (e.g. you are web browsing) that will slow down your benchmark for a bit and introduce noise.
There could be competition over physical memory ... if you've got too much running on your machine for the amount of RAM.
Finally, a certain amount of noise is inevitable, because each of those multiply calls generates garbage, and the garbage collector is going to need to work to deal with it.
Finally finally, if you manually run the garbage collector (or increase the heap size) to "smooth out" the data points, what you are actually doing is concealing one of the costs of multiply calls. The resulting graphs looks nice, but it is misleading:
The noisiness reflects what will happen in real life.
The true cost of the multiply actually includes the amortized cost of running the GC to deal with the garbage generated by the call.
To get a measurements that reflect the way that BigInteger behaves in real life, you need to run the test a large number of times, calculate average times and fit a curve to the average data-points.
Remember, the real aim of the game is to get scientifically valid results ... not a smooth curve.

If you do a microbenchmark, you must "warm up" the JVM first to let the JIT optimize the code, and then you can measure the performance. Otherwise you are measuring the work done by the JIT and that can change the result on each run.
The "noise" happens probably because the cache of the CPU is exceeded and the performance starts degrading.

Java efficiency

I'm playing with some piece of code calculating the time needed to compute some Java code to get a feeling of the efficiency or inefficiency of some of Java's functionality. Doing so I'm stuck now with some really strange effect I just can't explain myself. Maybe someone of you can help me understand it.
public class PerformanceCheck {
public static void main(String[] args) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
int maxTimes = 1000000000;
for (int i=0;i<10;i++) {
long time = System.currentTimeMillis();
for (int times=0;times<maxTimes;times++) {
// PERFORMANCE CHECK BLOCK START
if (removeList.size() > 0) {
testFunc(3);
}
// PERFORMANCE CHECK BLOCK END
}
long timeNow = System.currentTimeMillis();
System.out.println("time: " + (timeNow - time));
}
}
private static boolean testFunc(int test) {
return 5 > test;
}
}
Starting this results in a relatively long computation time (remember removeList is empty, so testFunc is not even called):
time: 2328
time: 2223
...
While replacing anything of the combination of removeList.size() > 0 and testFunc(3) with anything else has better results. For example:
...
if (removeList.size() == 0) {
testFunc(3);
}
...
Results in (testFunc is called every single time):
time: 8
time: 7
time: 0
time: 0
Even calling both functions independent from each other results in the lower computation time:
...
if (removeList.size() == 0);
testFunc(3);
...
Result:
time: 6
time: 5
time: 0
time: 0
...
Only this particular combination in my initial example takes so long. This is irritating me and I'd really like to understand it. What's so special about it?
Thanks.
Addition:
Changing testFunc() in the first example
if (removeList.size() > 0) {
testFunc(times);
}
to something else, like
private static int testFunc2(int test) {
return 5*test;
}
Will result in being fast again.

That is really surprising. The generated bytecode is identical except for the conditional, which is ifle vs ifne.
The results are much more sensible if you turn off the JIT with -Xint. The second version is 2x slower. So it's to do with what the JIT optimization.
I assume that it can optimize out the check in the second case but not the first (for whatever reason). Even though it means it does the work of the function, missing that conditional makes things much faster. It avoids pipeline stalls and all that.

While not directly related to this question, this is how you would correctly micro benchmark the code using Caliper. Below is a modified version of your code so that it will run with Caliper. The inner loops had to be modified some so that the VM will not optimize them out. It is surprisingly smart at realizing nothing was happening.
There are also a lot of nuances when benchmarking Java code. I wrote about some of the issues I ran into at Java Matrix Benchmark, such as how past history can effect current results. You will avoid many of those issues by using Caliper.
http://code.google.com/p/caliper/
Benchmarking issues with Java Matrix Benchmark
public class PerformanceCheck extends SimpleBenchmark {
public int timeFirstCase(int reps) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
removeList.add( new PerformanceCheck());
int ret = 0;
for( int i = 0; i < reps; i++ ) {
if (removeList.size() > 0) {
if( testFunc(i) )
ret++;
}
}
return ret;
}
public int timeSecondCase(int reps) {
List<PerformanceCheck> removeList = new LinkedList<PerformanceCheck>();
removeList.add( new PerformanceCheck());
int ret = 0;
for( int i = 0; i < reps; i++ ) {
if (removeList.size() == 0) {
if( testFunc(i) )
ret++;
}
}
return ret;
}
private static boolean testFunc(int test) {
return 5 > test;
}
public static void main(String[] args) {
Runner.main(PerformanceCheck.class, args);
}
}
OUTPUT:
0% Scenario{vm=java, trial=0, benchmark=FirstCase} 0.60 ns; σ=0.00 ns # 3 trials
50% Scenario{vm=java, trial=0, benchmark=SecondCase} 1.92 ns; σ=0.22 ns # 10 trials
benchmark ns linear runtime
FirstCase 0.598 =========
SecondCase 1.925 ==============================
vm: java
trial: 0

Well, I am glad not having to deal with Java performance optimizations. I tried it myself with Java JDK 7 64-Bit. The results are arbitrary ;). It makes no difference which lists I am using or if I cache the result of size() before entering the loop. Also entirely wiping out the test function makes almost no difference (so it can't be a branch prediction hit either).
Optimization flags improve performance but are as arbitrary.
The only logical consequence here is that the JIT compiler sometimes is able to optimize away the statement (which is not that hard to be true), but it seems rather arbitrary. One of the many reasons why I prefer languages like C++, where the behaviour is at least deterministic, even if it is sometimes arbitrary.
BTW in the latest Eclipse, like it always was on Windows, running this code via IDE "Run" (no debug) is 10 times slower than running it from console, so much about that...

When the runtime compiler can figure out testFunc evaluates to a constant, I believe it does not evaluate the loop, which explains the speedup.
When the condition is removeList.size() == 0 the function testFunc(3) gets evaluated to a constant. When the condition is removeList.size() != 0 the inner code never gets evaluated so it can't be sped-up. You can modify your code as follows:
for (int times = 0; times < maxTimes; times++) {
testFunc(); // Removing this call makes the code slow again!
if (removeList.size() != 0) {
testFunc();
}
}
private static boolean testFunc() {
return testFunc(3);
}
When testFunc() is not initially called, the runtime compiler does not realize that testFunc() evaluates to a constant, so it cannot optimize the loop.
Certain functions like
private static int testFunc2(int test) {
return 5*test;
}
the compiler likely tries to pre-optimize (before execution), but apparently not for the case of an parameter is passed in as an integer and evaluated in a conditional.
Your benchmark returns times like
time: 107
time: 106
time: 0
time: 0
...
suggesting that it takes 2 iterations of the outer-loop for the runtime compiler to finish optimizing. Compiling with the -server flag would probably return all 0's in the benchmark.

The times are unrealistically fast per iteration. This means the JIT has detected that your code doesn't do anything and has eliminated it. Subtle changes can confuse the JIT and it can't determine the code doesn't do anything and it takes some time.
If you change the test to do something marginally useful, the difference will disappear.

These benchmarks are tough since compilers are so darned smart. One guess: Since the result of testFunc() is ignored, the compiler might be completely optimizing it out. Add a counter, something like
if (testFunc(3))
counter++;
And, just for thoroughness, do a System.out.println(counter) at the end.

Is Java's System.arraycopy() efficient for small arrays?

Is Java's System.arraycopy() efficient for small arrays, or does the fact that it's a native method make it likely to be substantially less efficient than a simple loop and a function call?
Do native methods incur additional performance overhead for crossing some kind of Java-system bridge?

Expanding a little on what Sid has written, it's very likely that System.arraycopy is just a JIT intrinsic; meaning that when code calls System.arraycopy, it will most probably be calling a JIT-specific implementation (once the JIT tags System.arraycopy as being "hot") that is not executed through the JNI interface, so it doesn't incur the normal overhead of native methods.
In general, executing native methods does have some overhead (going through the JNI interface, also some internal JVM operations cannot happen when native methods are being executed). But it's not because a method is marked as "native" that you're actually executing it using JNI. The JIT can do some crazy things.
Easiest way to check is, as has been suggested, writing a small benchmark, being careful with the normal caveats of Java microbenchmarks (warm up the code first, avoid code with no side-effects since the JIT just optimizes it as a no-op, etc).

Here is my benchmark code:
public void test(int copySize, int copyCount, int testRep) {
System.out.println("Copy size = " + copySize);
System.out.println("Copy count = " + copyCount);
System.out.println();
for (int i = testRep; i > 0; --i) {
copy(copySize, copyCount);
loop(copySize, copyCount);
}
System.out.println();
}
public void copy(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
System.arraycopy(src, 1, dst, 0, copySize);
dst[copySize] = src[copySize] + 1;
System.arraycopy(dst, 0, src, 0, copySize);
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Arraycopy: " + (end - begin) / 1e9 + " s");
}
public void loop(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
for (int i = copySize - 1; i >= 0; --i) {
dst[i] = src[i + 1];
}
dst[copySize] = src[copySize] + 1;
for (int i = copySize - 1; i >= 0; --i) {
src[i] = dst[i];
}
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Man. loop: " + (end - begin) / 1e9 + " s");
}
public int[] newSrc(int arraySize) {
int[] src = new int[arraySize];
for (int i = arraySize - 1; i >= 0; --i) {
src[i] = i;
}
return src;
}
From my tests, calling test() with copyCount = 10000000 (1e7) or greater allows the warm-up to be achieved during the first copy/loop call, so using testRep = 5 is enough; With copyCount = 1000000 (1e6) the warm-up need at least 2 or 3 iterations so testRep shall be increased in order to obtain usable results.
With my configuration (CPU Intel Core 2 Duo E8500 # 3.16GHz, Java SE 1.6.0_35-b10 and Eclipse 3.7.2) it appears from the benchmark that:
When copySize = 24, System.arraycopy() and the manual loop take almost the same time (sometimes one is very slightly faster than the other, other times it’s the contrary),
When copySize < 24, the manual loop is faster than System.arraycopy() (slightly faster with copySize = 23, really faster with copySize < 5),
When copySize > 24, System.arraycopy() is faster than the manual loop (slightly faster with copySize = 25, the ratio loop-time/arraycopy-time increasing as copySize increases).
Note: I’m not English native speaker, please excuse all my grammar/vocabulary errors.

This is a valid concern. For example, in java.nio.DirectByteBuffer.put(byte[]), the author tries to avoid a JNI copy for small number of elements
// These numbers represent the point at which we have empirically
// determined that the average cost of a JNI call exceeds the expense
// of an element by element copy. These numbers may change over time.
static final int JNI_COPY_TO_ARRAY_THRESHOLD = 6;
static final int JNI_COPY_FROM_ARRAY_THRESHOLD = 6;
For System.arraycopy(), we can examine how JDK uses it. For example, in ArrayList, System.arraycopy() is always used, never "element by element copy", regardless of length (even if it's 0). Since ArrayList is very performance conscious, we can derive that System.arraycopy() is the most effecient way of array copying regardless of length.

System.arraycopy use a memmove operation for moving words and assembly for moving other primitive types in C behind the scene. So it makes its best effort to move as much as efficient it can reach.

Instead of relying on speculation and possibly outdated information, I ran some benchmarks using caliper. In fact, Caliper comes with some examples, including a CopyArrayBenchmark that measures exactly this question! All you have to do is run
mvn exec:java -Dexec.mainClass=com.google.caliper.runner.CaliperMain -Dexec.args=examples.CopyArrayBenchmark
My results are based on Oracle's Java HotSpot(TM) 64-Bit Server VM, 1.8.0_31-b13, running on a mid-2010 MacBook Pro (macOS 10.11.6 with an Intel Arrandale i7, 8 GiB RAM). I don't believe that it's useful to post the raw timing data. Rather, I'll summarize the conclusions with the supporting visualizations.
In summary:
Writing a manual for loop to copy each element into a newly instantiated array is never advantageous, even for arrays as short as 5 elements.
Arrays.copyOf(array, array.length) and array.clone() are both consistently fast. These two techniques are nearly identical in performance; which one you choose is a matter of taste.
System.arraycopy(src, 0, dest, 0, src.length) is almost as fast as Arrays.copyOf(array, array.length) and array.clone(), but not quite consistently so. (See the case for 50000 ints.) Because of that, and the verbosity of the call, I would recommend System.arraycopy() if you need fine control over which elements get copied where.
Here are the timing plots:

Byte codes are executed natively anyways so it's likely that performance would be better than a loop.
So in case of a loop it would have to execute byte codes which will incur an overhead. While array copy should be straight memcopy.

Native functions should be faster than JVM functions, since there is no VM overhead. However for a lot of(>1000) very small(len<10) arrays it might be slower.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.