Is Arrays.stream(array_name).sum() slower than iterative approach? - java

I was coding a leetcode problem : https://oj.leetcode.com/problems/gas-station/ using Java 8.
My solution got TLE when I used Arrays.stream(integer_array).sum() to compute sum while the same solution got accepted using iteration to calculate the sum of elements in array. The best possible time complexity for this problem is O(n) and I am surprised to get TLE when using streaming API's from Java 8. I have implemented the solution in O(n) only.
import java.util.Arrays;
public class GasStation {
public int canCompleteCircuit(int[] gas, int[] cost) {
int start = 0, i = 0, runningCost = 0, totalGas = 0, totalCost = 0;
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
// for (int item : gas) totalGas += item;
// for (int item : cost) totalCost += item;
if (totalGas < totalCost)
return -1;
while (start > i || (start == 0 && i < gas.length)) {
runningCost += gas[i];
if (runningCost >= cost[i]) {
runningCost -= cost[i++];
} else {
runningCost -= gas[i];
if (--start < 0)
start = gas.length - 1;
runningCost += (gas[start] - cost[start]);
}
}
return start;
}
public static void main(String[] args) {
GasStation sol = new GasStation();
int[] gas = new int[] { 10, 5, 7, 14, 9 };
int[] cost = new int[] { 8, 5, 14, 3, 1 };
System.out.println(sol.canCompleteCircuit(gas, cost));
gas = new int[] { 10 };
cost = new int[] { 8 };
System.out.println(sol.canCompleteCircuit(gas, cost));
}
}
The solution gets accepted when,
I comment the following two lines: (calculating sum using streaming)
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
and uncomment the following two lines (calculating sum using iteration):
//for (int item : gas) totalGas += item;
//for (int item : cost) totalCost += item;
Now the solution gets accepted. Why streaming API in Java 8 is slower for large input than iteration for primitives?

The first step in dealing with problems like this is to bring the code into a controlled environment. That means running it in the JVM you control (and can invoke) and running tests inside a good benchmark harness like JMH. Analyze, don't speculate.
Here's a benchmark I whipped up using JMH to do some analysis on this:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#State(Scope.Benchmark)
public class ArraySum {
static final long SEED = -897234L;
#Param({"1000000"})
int sz;
int[] array;
#Setup
public void setup() {
Random random = new Random(SEED);
array = new int[sz];
Arrays.setAll(array, i -> random.nextInt());
}
#Benchmark
public int sumForLoop() {
int sum = 0;
for (int a : array)
sum += a;
return sum;
}
#Benchmark
public int sumStream() {
return Arrays.stream(array).sum();
}
}
Basically this creates an array of a million ints and sums them twice: once using a for-loop and once using streams. Running the benchmark produces a bunch of output (elided for brevity and for dramatic effect) but the summary results are below:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 3 514.473 398.512 us/op
ArraySum.sumStream 1000000 avgt 3 7355.971 3170.697 us/op
Whoa! That Java 8 streams stuff is teh SUXX0R! It's 14 times slower than a for-loop, don't use it!!!1!
Well, no. First let's go over these results, and then look more closely to see if we can figure out what's going on.
The summary shows the two benchmark methods, with the "sz" parameter of a million. It's possible to vary this parameter but it doesn't turn out to make a difference in this case. I also only ran the benchmark methods 3 times, as you can see from the "samples" column. (There were also only 3 warmup iterations, not visible here.) The score is in microseconds per operation, and clearly the stream code is much, much slower than the for-loop code. But note also the score error: that's the amount of variability in the different runs. JMH helpfully prints out the standard deviation of the results (not shown here) but you can easily see that the score error is a significant fraction of reported score. This reduces our confidence in the score.
Running more iterations should help. More warmup iterations will let the JIT do more work and settle down before running the benchmarks, and running more benchmark iterations will smooth out any errors from transient activity elsewhere on my system. So let's try 10 warmup iterations and 10 benchmark iterations:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 504.803 34.010 us/op
ArraySum.sumStream 1000000 avgt 10 7128.942 178.688 us/op
Performance is overall a little faster, and the measurement error is also quite a bit smaller, so running more iterations has had the desired effect. But the streams code is still considerably slower than the for-loop code. What's going on?
A large clue can be obtained by looking at the individual timings of the streams method:
# Warmup Iteration 1: 570.490 us/op
# Warmup Iteration 2: 491.765 us/op
# Warmup Iteration 3: 756.951 us/op
# Warmup Iteration 4: 7033.500 us/op
# Warmup Iteration 5: 7350.080 us/op
# Warmup Iteration 6: 7425.829 us/op
# Warmup Iteration 7: 7029.441 us/op
# Warmup Iteration 8: 7208.584 us/op
# Warmup Iteration 9: 7104.160 us/op
# Warmup Iteration 10: 7372.298 us/op
What happened? The first few iterations were reasonably fast, but then the 4th and subsequent iterations (and all the benchmark iterations that follow) were suddenly much slower.
I've seen this before. It was in this question and this answer elsewhere on SO. I recommend reading that answer; it explains how the JVM's inlining decisions in this case result in poorer performance.
A bit of background here: a for-loop compiles to a very simple increment-and-test loop, and can easily be handled by usual optimization techniques like loop peeling and unrolling. The streams code, while not very complex in this case, is actually quite complex compared to the for-loop code; there's a fair bit of setup, and each loop requires at least one method call. Thus, the JIT optimizations, particularly its inlining decisions, are critical to making the streams code go fast. And it's possible for it to go wrong.
Another background point is that integer summation is about the simplest possible operation you can think of to do in a loop or stream. This will tend to make the fixed overhead of stream setup look relatively more expensive. It is also so simple that it can trigger pathologies in the inlining policy.
The suggestion from the other answer was to add the JVM option -XX:MaxInlineLevel=12 to increase the amount of code that can be inlined. Rerunning the benchmark with that option gives:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 502.379 27.859 us/op
ArraySum.sumStream 1000000 avgt 10 498.572 24.195 us/op
Ah, much nicer. Disabling tiered compilation using -XX:-TieredCompilation also had the effect of avoiding the pathological behavior. I also found that making the loop computation even a bit more expensive, e.g. summing squares of integers -- that is, adding a single multiply -- also avoids the pathological behavior.
Now, your question is about running in the context of the leetcode environment, which seems to run the code in a JVM that you don't have any control over, so you can't change the inlining or compilation options. And you probably don't want to make your computation more complex to avoid the pathology either. So for this case, you might as well just stick to the good old for-loop. But don't be afraid to use streams, even for dealing with primitive arrays. It can perform quite well, aside from some narrow edge cases.

The normal iteration approach is going to be pretty much as fast as anything can be, but streams have a variety of overheads: even though it's coming directly from a stream, there's probably going to be a primitive Spliterator involved and lots of other objects being generated.
In general, you should expect the "normal approach" to usually be faster than streams unless you're both using parallelization and your data is very large.

My benchmark (see code below) shows that streaming approach is about 10-15% slower than iterative. Interestingly enough, parallel stream results vary greatly on my 4 core (i7) macbook pro, but, while I have seen a them a few times being about 30% faster than iterative, the most common result is almost three times slower than sequential.
Here is the benchmark code:
import java.util.*;
import java.util.function.*;
public class StreamingBenchmark {
private static void benchmark(String name, LongSupplier f) {
long start = System.currentTimeMillis(), sum = 0;
for(int count = 0; count < 1000; count ++) sum += f.getAsLong();
System.out.println(String.format(
"%10s in %d millis. Sum = %d",
name, System.currentTimeMillis() - start, sum
));
}
public static void main(String argv[]) {
int data[] = new int[1000000];
Random randy = new Random();
for(int i = 0; i < data.length; i++) data[i] = randy.nextInt();
benchmark("iterative", () -> { int s = 0; for(int n: data) s+=n; return s; });
benchmark("stream", () -> Arrays.stream(data).sum());
benchmark("parallel", () -> Arrays.stream(data).parallel().sum());
}
}
Here is the output from a few runs:
iterative in 350 millis. Sum = 564821058000
stream in 394 millis. Sum = 564821058000
parallel in 883 millis. Sum = 564821058000
iterative in 340 millis. Sum = -295411382000
stream in 376 millis. Sum = -295411382000
parallel in 1031 millis. Sum = -295411382000
iterative in 365 millis. Sum = 1205763898000
stream in 379 millis. Sum = 1205763898000
parallel in 1053 millis. Sum = 1205763898000
etc.
This got me curious, and I also tried running equivalent logic in scala:
object Scarr {
def main(argv: Array[String]) = {
val randy = new java.util.Random
val data = (1 to 1000000).map { _ => randy.nextInt }.toArray
val start = System.currentTimeMillis
var sum = 0l;
for ( _ <- 1 to 1000 ) sum += data.sum
println(sum + " in " + (System.currentTimeMillis - start) + " millis.")
}
}
This took 14 seconds! About 40 times(!) longer than streaming in java. Ouch!

The sum() method is syntactically equivalent to return reduce(0, Integer::sum); In a large list, there will be more overhead in making all the method calls than the basic by-hand for-loop iteration. The byte code for the for(int i : numbers) iteration is only very slightly more complicated than that generated by the by-hand for-loop. The stream operation is possibly faster in parallel-friendly environments (though maybe not for primitive methods), but unless we know that the environment is parallel-friendly (and it may not be since leetcode itself is probably designed to favor low-level over abstract since it's measuring efficiency rather than legibility).
The sum operation done in any of the three ways (Arrays.stream(int[]).sum, for (int i : ints){total+=i;}, and for(int i = 0; i < ints.length; i++){total+=i;} should be relatively similar in efficiency. I used the following test class (which sums a hundred million integers between 0 and 4096 a hundred times each and records the average times). All of them returned in very similar timeframes. It even attempts to limit parallel processing by occupying all but one of the available cores in while(true) loops, but I still found no particular difference:
public class SumTester {
private static final int ARRAY_SIZE = 100_000_000;
private static final int ITERATION_LIMIT = 100;
private static final int INT_VALUE_LIMIT = 4096;
public static void main(String[] args) {
Random random = new Random();
int[] numbers = new int[ARRAY_SIZE];
IntStream.range(0, ARRAY_SIZE).forEach(i->numbers[i] = random.nextInt(INT_VALUE_LIMIT));
Map<String, ToLongFunction<int[]>> inputs = new HashMap<String, ToLongFunction<int[]>>();
NanoTimer initializer = NanoTimer.start();
System.out.println("initialized NanoTimer in " + initializer.microEnd() + " microseconds");
inputs.put("sumByStream", SumTester::sumByStream);
inputs.put("sumByIteration", SumTester::sumByIteration);
inputs.put("sumByForLoop", SumTester::sumByForLoop);
System.out.println("Parallelables: ");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
int cores = Runtime.getRuntime().availableProcessors();
List<CancelableThreadEater> threadEaters = new ArrayList<CancelableThreadEater>();
if (cores > 1){
threadEaters = occupyThreads(cores - 1);
}
// Only one core should be left to our class
System.out.println("\nSingleCore (" + threadEaters.size() + " of " + cores + " cores occupied)");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
for (CancelableThreadEater cte : threadEaters){
cte.end();
}
System.out.println("Complete!");
}
public static long sumByStream(int[] numbers){
return Arrays.stream(numbers).sum();
}
public static long sumByIteration(int[] numbers){
int total = 0;
for (int i : numbers){
total += i;
}
return total;
}
public static long sumByForLoop(int[] numbers){
int total = 0;
for (int i = 0; i < numbers.length; i++){
total += numbers[i];
}
return total;
}
public static void averageTimeFor(int iterations, Map<String, ToLongFunction<int[]>> testMap, int[] numbers){
Map<String, Long> durationMap = new HashMap<String, Long>();
Map<String, Long> sumMap = new HashMap<String, Long>();
for (String methodName : testMap.keySet()){
durationMap.put(methodName, 0L);
sumMap.put(methodName, 0L);
}
for (int i = 0; i < iterations; i++){
for (String methodName : testMap.keySet()){
int[] newNumbers = Arrays.copyOf(numbers, ARRAY_SIZE);
ToLongFunction<int[]> function = testMap.get(methodName);
NanoTimer nt = NanoTimer.start();
long sum = function.applyAsLong(newNumbers);
long duration = nt.microEnd();
sumMap.put(methodName, sum);
durationMap.put(methodName, durationMap.get(methodName) + duration);
}
}
for (String methodName : testMap.keySet()){
long duration = durationMap.get(methodName) / iterations;
long sum = sumMap.get(methodName);
System.out.println(methodName + ": result '" + sum + "', elapsed time: " + duration + " milliseconds on average over " + iterations + " iterations");
}
}
private static List<CancelableThreadEater> occupyThreads(int numThreads){
List<CancelableThreadEater> result = new ArrayList<CancelableThreadEater>();
for (int i = 0; i < numThreads; i++){
CancelableThreadEater cte = new CancelableThreadEater();
result.add(cte);
new Thread(cte).start();
}
return result;
}
private static class CancelableThreadEater implements Runnable {
private Boolean stop = false;
public void run(){
boolean canContinue = true;
while (canContinue){
synchronized(stop){
if (stop){
canContinue = false;
}
}
}
}
public void end(){
synchronized(stop){
stop = true;
}
}
}
}
which returned
initialized NanoTimer in 22 microseconds
Parallelables:
sumByIteration: result '-1413860413', elapsed time: 35844 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 35414 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 35218 milliseconds on average over 100 iterations
SingleCore (3 of 4 cores occupied)
sumByIteration: result '-1413860413', elapsed time: 37010 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 38375 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 37990 milliseconds on average over 100 iterations
Complete!
That said, there's no real reason to do the sum() operation in this case. You are iterating through each array, for a total of three iterations and the last one may be a longer-than-normal iteration. It's possible to calculate correctly with one full simultaneous iteration of the arrays and one short-circuiting iteration. It may be possible to do it even more efficiently, but I couldn't figure out any better way to do it than I did. My solution ended up being one of the fastest java solutions on the chart - it ran in 223ms, which was in amongst the middle pack of python solutions.
I'll add my solution to the problem if you care to see it, but I hope the actual question is answered here.

Stream function is relatively slow. So during leetcode contest or any algorithms contest, always prefer classic loops over stream functions as large inputs are prone to TLE. This can in turn cause penality, which would affect your final ranking.
A detailed explanation is mentioned here https://stackoverflow.com/a/27994074/6185191

I also came across this issue while doing this pretty basic LeetCode problem. The first code that I submitted used Java Stream API's Arrays.stream().sum() to compute the Array sum, which gave a time of 6ms.
While the classic for loop just took a time of 1ms to iterate through the same array. Now that's insane! The Stream API method, takes atleast 6x the time, than your simple for loop. So yeah! always go with the simpler and classic method.

Related

Why 2 similar loop codes costs different time in java

I was confused by the codes as follows:
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 10000000;
final int jBound = 100;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
That's the first version, it costs 443ms on my computer.
first version result
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 100;
final int jBound = 10000000;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
The second version costs 832ms.
second version result
The only difference is that I simply swap the i and j.
This result is incredible, I test the same code in C and the difference in C is not that huge.
Why is this 2 similar codes so different in java?
My jdk version is openjdk-14.0.2
TL;DR - This is just a bad benchmark.
I did the following:
Create a Main class with a main method.
Copy in the two versions of the test as test1() and test2().
In the main method do this:
while(true) {
test1();
test2();
}
Here is the output I got (Java 8).
i:10000000 j:100
It costs 35 ms
i:100 j:10000000
It costs 33 ms
i:10000000 j:100
It costs 33 ms
i:100 j:10000000
It costs 25 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
....
So as you can see, when I run two versions of the same method alternately in the same JVM, the times for each method are roughly the same.
But more importantly, after a small number of iterations the time drops to ... zero! What has happened is that the JIT compiler has compiled the two methods and (probably) deduced that their loops can be optimized away.
It is not entirely clear why people are getting different times when the two versions are run separately. One possible explanation is that the first time run, the JVM executable is being read from disk, and the second time is already cached in RAM. Or something like that.
Another possible explanation is that JIT compilation kicks in earlier1 with one version of test() so the proportion of time spent in the slower interpreting (pre-JIT) phase is different between the two versions. (It may be possible to teas this out using JIT logging options.)
But it is immaterial really ... because the performance of a Java application while the JVM is warming up (loading code, JIT compiling, growing the heap to its working size, loading caches, etc) is generally speaking not important. And for the cases where it is important, look for a JVM that can do AOT compilation; e.g. GraalVM.
1 - This could be because of the way that the interpreter gathers stats. The general idea is that the bytecode interpreter accumulates statistics on things like branches until it has "enough". Then the JVM triggers the JIT compiler to compile the bytecodes to native code. When that is done, the code runs typically 10 or more times faster. The different looping patterns might it reach "enough" earlier in one version compared to the other. NB: I am speculating here. I offer zero evidence ...
The bottom line is that you have to be careful when writing Java benchmarks because the timings can be distorted by various JVM warmup effects.
For more information read: How do I write a correct micro-benchmark in Java?
I test it myself, I get same difference (around 16ms and 4ms).
After testing, I found that :
Declare 1M of variable take less time than multiple by 1 1M time.
How ?
I made a sum of 100
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
i *= 1;
i *= 1;
[... written 20 times]
i *= 1;
i *= 1;
}
And of 100 this:
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
int a = 0;
int aa = 0;
[... written 20 times]
int aaaaaaaaaaaaaaaaaaaaaa = 0;
int aaaaaaaaaaaaaaaaaaaaaaa = 0;
}
And I respectively get 8 and 3ms, which seems to correspond to what you get.
You can have different result if you have different processor.
you found the answer in algorithm books first chapter :
cost of producing and assigning is 1. so in first algorithm you have 2 declaration and assignation 10000000 and in second one you make it 100. so you reduce time ...
in first :
5 in main loop and 3 in second loop -> second loop is : 3*100 = 300
then 300 + 5 -> 305 * 10000000 = 3050000000
in second :
3*10000000 = 30000000 - > (30000000 + 5 )*100 = 3000000500
so the second one in algorithm is faster in theory but I think its back to multi cpu's ...which they can do 10000000 parallel job in first but only 100 parallel job in second .... so the first one became faster.

Why is my java program becoming gradually slower?

I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}
If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.
More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.
It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.
It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.

Running time of the same code blocks is different in java. why is that? [duplicate]

This question already has answers here:
Java benchmarking - why is the second loop faster?
(6 answers)
Closed 9 years ago.
I had the below code. I just wanted to check the running time of a code block. And mistakenly i had copied and pasted the same code again and get an interesting result. Though the code block is the same the running times are different. And the code block 1 taking more time than the others. If i switch the code blocks (say i move the code blocks 4 to the top) then code block 4 will be taking more time than others.
I used two different types of Arrays in my code blocks to check it depends on that. And the result is same. If the code blocks has the same type of arrays then the top most code block is taking more time. See the below code and the given out put.
public class ABBYtest {
public static void main(String[] args) {
long startTime;
long endTime;
//code block 1
startTime = System.nanoTime();
Long a[] = new Long[10];
for (int i = 0; i < a.length; i++) {
a[i] = 12l;
}
Arrays.sort(a);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 1 = " + (endTime - startTime));
//code block 6
startTime = System.nanoTime();
Long aa[] = new Long[10];
for (int i = 0; i < aa.length; i++) {
aa[i] = 12l;
}
Arrays.sort(aa);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 6 = " + (endTime - startTime));
//code block 7
startTime = System.nanoTime();
Long aaa[] = new Long[10];
for (int i = 0; i < aaa.length; i++) {
aaa[i] = 12l;
}
Arrays.sort(aaa);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 7 = " + (endTime - startTime));
//code block 2
startTime = System.nanoTime();
long c[] = new long[10];
for (int i = 0; i < c.length; i++) {
c[i] = 12l;
}
Arrays.sort(c);
endTime = System.nanoTime();
System.out.println("code block (has long array) 2 = " + (endTime - startTime));
//code block 3
startTime = System.nanoTime();
long d[] = new long[10];
for (int i = 0; i < d.length; i++) {
d[i] = 12l;
}
Arrays.sort(d);
endTime = System.nanoTime();
System.out.println("code block (has long array) 3 = " + (endTime - startTime));
//code block 4
startTime = System.nanoTime();
long b[] = new long[10];
for (int i = 0; i < b.length; i++) {
b[i] = 12l;
}
Arrays.sort(b);
endTime = System.nanoTime();
System.out.println("code block (has long array) 4 = " + (endTime - startTime));
//code block 5
startTime = System.nanoTime();
Long e[] = new Long[10];
for (int i = 0; i < e.length; i++) {
e[i] = 12l;
}
Arrays.sort(e);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 5 = " + (endTime - startTime));
}
}
The running times:
code block (has Long array) 1 = 802565
code block (has Long array) 6 = 6158
code block (has Long array) 7 = 4619
code block (has long array) 2 = 171906
code block (has long array) 3 = 4105
code block (has long array) 4 = 3079
code block (has Long array) 5 = 8210
As we can see the first code block which contains the Long array will take more time than others which contain Long arrays. And it is the same for the first code block which contains long array.
Can anyone explain this behavior. or Am i doing some mistake here ??
Faulty benchmarking. The non exhaustive list of what is wrong:
No warmup: single shot measurements are almost always wrong;
Mixing several codepaths in the single method: we probably start compiling the method with the execution data available only for the first loop in the method;
Sources are predictable: should the loop compile, we can actually predict the result;
Results are dead-code eliminated: should the loop compile, we can throw the loop it away
Here is how you do it arguably right with jmh:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3, time = 1)
#Fork(10)
#State(Scope.Thread)
public class Longs {
public static final int COUNT = 10;
private Long[] refLongs;
private long[] primLongs;
/*
* Implementation notes:
* - copying the array from the field keeps the constant
* optimizations away, but we implicitly counting the
* costs of arraycopy() in;
* - two additional baseline experiments quantify the
* scale of arraycopy effects (note you can't directly
* subtract the baseline scores from the tests, because
* the code is mixed together;
* - the resulting arrays are always fed back into JMH
* to prevent dead-code elimination
*/
#Setup
public void setup() {
primLongs = new long[COUNT];
for (int i = 0; i < COUNT; i++) {
primLongs[i] = 12l;
}
refLongs = new Long[COUNT];
for (int i = 0; i < COUNT; i++) {
refLongs[i] = 12l;
}
}
#GenerateMicroBenchmark
public long[] prim_baseline() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
return d;
}
#GenerateMicroBenchmark
public long[] prim_sort() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}
#GenerateMicroBenchmark
public Long[] ref_baseline() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
return d;
}
#GenerateMicroBenchmark
public Long[] ref_sort() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}
}
...this yields:
Benchmark Mode Samples Mean Mean error Units
o.s.Longs.prim_baseline avgt 30 19.604 0.327 ns/op
o.s.Longs.prim_sort avgt 30 51.217 1.873 ns/op
o.s.Longs.ref_baseline avgt 30 16.935 0.087 ns/op
o.s.Longs.ref_sort avgt 30 25.199 0.430 ns/op
At this point you may start to wonder why sorting Long[] and sorting long[] takes different time. The answer lies in the Array.sort() overloads: OpenJDK sorts primitive and reference arrays via different algos (references with TimSort, primitives with dual-pivot quicksort). Here's the highlight of choosing another algo with -Djava.util.Arrays.useLegacyMergeSort=true, which falls back to merge sort for references:
Benchmark Mode Samples Mean Mean error Units
o.s.Longs.prim_baseline avgt 30 19.675 0.291 ns/op
o.s.Longs.prim_sort avgt 30 50.882 1.550 ns/op
o.s.Longs.ref_baseline avgt 30 16.742 0.089 ns/op
o.s.Longs.ref_sort avgt 30 64.207 1.047 ns/op
Hope that helps to explain the difference.
The explanation above barely scratch the surface about the performance of sorting. The performance is very different when presented with different source data (including available pre-sorted subsequences, their patterns and run lengths, sizes of the data itself).
Can anyone explain this behavior. or Am i doing some mistake here ??
Your problem is a badly written benchmark. You do not take account of JVM warmup effects. Things like the overheads of loading code, initial expansion of the heap, and JIT compilation. In addition, startup of an application always generates extra garbage that needs to be collected.
In addition, if your application itself generates garbage (and I expect that sort and / or println are doing that) then you need to take account of possible GC runs during the "steady state" phase of your benchmark application's run.
See this Q&A for hints on how to write valid Java benchmarks:
How do I write a correct micro-benchmark in Java?
There are numerous other articles on this. Google for "how to write a java benchmark".
In this example, I suspect that the first code block takes so much longer than the rest because of (initially) bytecode interpretation followed by the overhead of JIT compilation. You may well be garbage collecting to deal with temporary objects created during loading and JIT compilation. The high value for the 4th measurement is most likely due to another garbage collection cycle.
However, one would need to turn on some JVM logging to figure out the real cause.
Just to add to what everyone else is saying. Java will not necessarily compile everything. When it analyses the code for optimization, java will choose to interpret code that is not used extensively a fair amount of the time. If you look at the byte codes your Long arrays should always take more time and certainly space complexity than your long arrays, but as has been pointed out, warmup effects will have an effect.
This is could be due to a few things:
As noted by syrion, Java's virtual machine is allowed to perform optimizations on your code as it is running. Your first block is likely taking longer because Java hasn't yet optimized your code fully. As the first block runs, the JVM is applying changes which can then be utilized in the other blocks.
Your processor could be caching the results of your code, speeding up future blocks. This is similar to the previous point, but can vary even between identical JVM implementations.
While your program is running, your computer is also performing other tasks. These include handling the OS's UI, checking for program updates, etc. For this reason, some blocks of code can be slower than others, because your computer isn't concentrating as much resources towards its execution.
Java's virtual machine is garbage collected. That is to say, at unspecified points during your program's execution, the JVM takes some time to clean up any objects that are no longer used.
Points 1 and 2 are likely the cause for the large difference in the first block's execution time. Point 3 could be the reason for the smaller fluctuations, and point 4, as noted by Stephen, probably caused the large stall in block 3.
Another thing that I didn't notice is your use of both long and Long. The object form contains a larger memory overhead, and both are subject to different optimizations.

Java iterative vs recursive

Can anyone explain why the following recursive method is faster than the iterative one (Both are doing it string concatenation) ? Isn't the iterative approach suppose to beat up the recursive one ? plus each recursive call adds a new layer on top of the stack which can be very space inefficient.
private static void string_concat(StringBuilder sb, int count){
if(count >= 9999) return;
string_concat(sb.append(count), count+1);
}
public static void main(String [] arg){
long s = System.currentTimeMillis();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 9999; i++){
sb.append(i);
}
System.out.println(System.currentTimeMillis()-s);
s = System.currentTimeMillis();
string_concat(new StringBuilder(),0);
System.out.println(System.currentTimeMillis()-s);
}
I ran the program multiple time, and the recursive one always ends up 3-4 times faster than the iterative one. What could be the main reason there that is causing the iterative one slower ?
See my comments.
Make sure you learn how to properly microbenchmark. You should be timing many iterations of both and averaging these for your times. Aside from that, you should make sure the VM isn't giving the second an unfair advantage by not compiling the first.
In fact, the default HotSpot compilation threshold (configurable via -XX:CompileThreshold) is 10,000 invokes, which might explain the results you see here. HotSpot doesn't really do any tail optimizations so it's quite strange that the recursive solution is faster. It's quite plausible that StringBuilder.append is compiled to native code primarily for the recursive solution.
I decided to rewrite the benchmark and see the results for myself.
public final class AppendMicrobenchmark {
static void recursive(final StringBuilder builder, final int n) {
if (n > 0) {
recursive(builder.append(n), n - 1);
}
}
static void iterative(final StringBuilder builder) {
for (int i = 10000; i >= 0; --i) {
builder.append(i);
}
}
public static void main(final String[] argv) {
/* warm-up */
for (int i = 200000; i >= 0; --i) {
new StringBuilder().append(i);
}
/* recursive benchmark */
long start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
recursive(new StringBuilder(), 10000);
}
System.out.printf("recursive: %.2fus\n", (System.nanoTime() - start) / 1000000D);
/* iterative benchmark */
start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
iterative(new StringBuilder());
}
System.out.printf("iterative: %.2fus\n", (System.nanoTime() - start) / 1000000D);
}
}
Here are my results...
C:\dev\scrap>java AppendMicrobenchmark
recursive: 405.41us
iterative: 313.20us
C:\dev\scrap>java -server AppendMicrobenchmark
recursive: 397.43us
iterative: 312.14us
These are times for each approach averaged over 1000 trials.
Essentially, the problems with your benchmark are that it doesn't average over many trials (law of large numbers), and that it is highly dependent on the ordering of the individual benchmarks. The original result I was given for yours:
C:\dev\scrap>java StringBuilderBenchmark
80
41
This made very little sense to me. Recursion on the HotSpot VM is more than likely not going to be as fast as iteration because as of yet it does not implement any sort of tail optimization that you might find used for functional languages.
Now, the funny thing that happens here is that the default HotSpot JIT compilation threshold is 10,000 invokes. Your iterative benchmark will more than likely be executing for the most part before append is compiled. On the other hand, your recursive approach should be comparatively fast since it will more than likely enjoy append after it is compiled. To eliminate this from influencing the results, I passed -XX:CompileThreshold=0 and found...
C:\dev\scrap>java -XX:CompileThreshold=0 StringBuilderBenchmark
8
8
So, when it comes down to it, they're both roughly equal in speed. Note however that the iterative appears to be a bit faster if you average with higher precision. Order might still make a difference in my benchmark, too, as the latter benchmark will have the advantage of the VM having collected more statistics for its dynamic optimizations.

Performance test independent of the number of iterations

Trying to answer to this ticket : What is the difference between instanceof and Class.isAssignableFrom(...)?
I made a performance test :
class A{}
class B extends A{}
A b = new B();
void execute(){
boolean test = A.class.isAssignableFrom(b.getClass());
// boolean test = A.class.isInstance(b);
// boolean test = b instanceof A;
}
#Test
public void testPerf() {
// Warmup the code
for (int i = 0; i < 100; ++i)
execute();
// Time it
int count = 100000;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
execute();
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms.);
}
Which gave me :
A.class.isAssignableFrom(b.getClass()) : 100000 iterations took 15ms
A.class.isInstance(b) : 100000 iterations took 12ms
b instanceof A : 100000 iterations took 6ms
But playing with the number of iterations, I can see the performance is constant. For Integer.MAX_VALUE :
A.class.isAssignableFrom(b.getClass()) : 2147483647 iterations took 15ms
A.class.isInstance(b) : 2147483647 iterations took 12ms
b instanceof A : 2147483647 iterations took 6ms
Thinking it was a compiler optimization (I ran this test with JUnit), I changed it into this :
#Test
public void testPerf() {
boolean test = false;
// Warmup the code
for (int i = 0; i < 100; ++i)
test |= b instanceof A;
// Time it
int count = Integer.MAX_VALUE;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
test |= b instanceof A;
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms. AVG= " + TimeUnit.NANOSECONDS.toMillis(elapsed/count));
System.out.println(test);
}
But the performance is still "independent" of the number of iterations.
Could someone explain that behavior ?
A hundred iterations is not nearly enough for warmup. The default compile threshold is 10000 iterations (a hundred times more), so best go at least a bit over that threshold.
Once the compilation has been triggered, the world is not stopped; the compilation takes place in the background. That means that its effect will start being observable only after a slight delay.
There is ample space for optimization of your test in such a way that the entire loop is collapsed into its final result. That would explain the constant numbers.
Anyway, I always do the benchmarks by having an outer method call the inner method something like 10 times. The inner method does a big number of iterations, say 10,000 or more, as needed to make its runtime rise into at least tens of milliseconds. I don't even bother with nanoTime since if microsecond precision is important to you, it is just a sign of measuring too short a time interval.
When you do it like this, you are making it easy for the JIT to execute a compiled version of the inner method after it was substituted for the interpreted version. Another benefit is that you get assurance that the times of the inner method are stabilizing.
If you want to make a real benchmark of a simple function, you should use a micro-benchmarking tool, like Caliper. It will be much simpler that trying to make your own benchmark.
The JIT compiler can eliminate loops which don't anything. This can be triggered after 10,000 iterations.
What I suspect you are timing is how long it takes for the JIT to detect that the loop doesn't do anything and remove it. This will be a little longer than it takes to do 10,000 iterations.

Categories

Resources