I was confused by the codes as follows:
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 10000000;
final int jBound = 100;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
That's the first version, it costs 443ms on my computer.
first version result
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 100;
final int jBound = 10000000;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
The second version costs 832ms.
second version result
The only difference is that I simply swap the i and j.
This result is incredible, I test the same code in C and the difference in C is not that huge.
Why is this 2 similar codes so different in java?
My jdk version is openjdk-14.0.2
TL;DR - This is just a bad benchmark.
I did the following:
Create a Main class with a main method.
Copy in the two versions of the test as test1() and test2().
In the main method do this:
while(true) {
test1();
test2();
}
Here is the output I got (Java 8).
i:10000000 j:100
It costs 35 ms
i:100 j:10000000
It costs 33 ms
i:10000000 j:100
It costs 33 ms
i:100 j:10000000
It costs 25 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
....
So as you can see, when I run two versions of the same method alternately in the same JVM, the times for each method are roughly the same.
But more importantly, after a small number of iterations the time drops to ... zero! What has happened is that the JIT compiler has compiled the two methods and (probably) deduced that their loops can be optimized away.
It is not entirely clear why people are getting different times when the two versions are run separately. One possible explanation is that the first time run, the JVM executable is being read from disk, and the second time is already cached in RAM. Or something like that.
Another possible explanation is that JIT compilation kicks in earlier1 with one version of test() so the proportion of time spent in the slower interpreting (pre-JIT) phase is different between the two versions. (It may be possible to teas this out using JIT logging options.)
But it is immaterial really ... because the performance of a Java application while the JVM is warming up (loading code, JIT compiling, growing the heap to its working size, loading caches, etc) is generally speaking not important. And for the cases where it is important, look for a JVM that can do AOT compilation; e.g. GraalVM.
1 - This could be because of the way that the interpreter gathers stats. The general idea is that the bytecode interpreter accumulates statistics on things like branches until it has "enough". Then the JVM triggers the JIT compiler to compile the bytecodes to native code. When that is done, the code runs typically 10 or more times faster. The different looping patterns might it reach "enough" earlier in one version compared to the other. NB: I am speculating here. I offer zero evidence ...
The bottom line is that you have to be careful when writing Java benchmarks because the timings can be distorted by various JVM warmup effects.
For more information read: How do I write a correct micro-benchmark in Java?
I test it myself, I get same difference (around 16ms and 4ms).
After testing, I found that :
Declare 1M of variable take less time than multiple by 1 1M time.
How ?
I made a sum of 100
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
i *= 1;
i *= 1;
[... written 20 times]
i *= 1;
i *= 1;
}
And of 100 this:
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
int a = 0;
int aa = 0;
[... written 20 times]
int aaaaaaaaaaaaaaaaaaaaaa = 0;
int aaaaaaaaaaaaaaaaaaaaaaa = 0;
}
And I respectively get 8 and 3ms, which seems to correspond to what you get.
You can have different result if you have different processor.
you found the answer in algorithm books first chapter :
cost of producing and assigning is 1. so in first algorithm you have 2 declaration and assignation 10000000 and in second one you make it 100. so you reduce time ...
in first :
5 in main loop and 3 in second loop -> second loop is : 3*100 = 300
then 300 + 5 -> 305 * 10000000 = 3050000000
in second :
3*10000000 = 30000000 - > (30000000 + 5 )*100 = 3000000500
so the second one in algorithm is faster in theory but I think its back to multi cpu's ...which they can do 10000000 parallel job in first but only 100 parallel job in second .... so the first one became faster.
Related
The short code below isolates the problem. Basically I'm timing the method addToStorage. I start by executing it one million times and I'm able to get its time down to around 723 nanoseconds. Then I do a short pause (using a busy spinning method not to release the cpu core) and time the method again N times, on a different code location. For my surprise I find that the smaller the N the bigger is the addToStorage latency.
For example:
If N = 1 then I get 3.6 micros
If N = 2 then I get 3.1 and 2.5 micros
if N = 5 then I get 3.7, 1.8, 1.7, 1.5 and 1.5 micros
Does anyone know why this is happening and how to fix it? I would like my method to consistently perform at the fastest time possible, no matter where I call it.
Note: I would not think it is thread related since I'm not using Thread.sleep. I've also tested using taskset to pin my thread to a cpu core with the same results.
import java.util.ArrayList;
import java.util.List;
public class JvmOdd {
private final StringBuilder sBuilder = new StringBuilder(1024);
private final List<String> storage = new ArrayList<String>(1024 * 1024);
public void addToStorage() {
sBuilder.setLength(0);
sBuilder.append("Blah1: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah2: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah3: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah4: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah5: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah6: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah7: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah8: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah9: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah10: ").append(System.nanoTime()).append('\n');
storage.add(sBuilder.toString());
}
public static long mySleep(long t) {
long x = 0;
for(int i = 0; i < t * 10000; i++) {
x += System.currentTimeMillis() / System.nanoTime();
}
return x;
}
public static void main(String[] args) throws Exception {
int warmup = Integer.parseInt(args[0]);
int mod = Integer.parseInt(args[1]);
int passes = Integer.parseInt(args[2]);
int sleep = Integer.parseInt(args[3]);
JvmOdd jo = new JvmOdd();
// first warm up
for(int i = 0; i < warmup; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i % mod == 0) System.out.println(time);
}
// now see how fast the method is:
while(true) {
System.out.println();
// Thread.sleep(sleep);
mySleep(sleep);
long minTime = Long.MAX_VALUE;
for(int i = 0; i < passes; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i > 0) System.out.print(',');
System.out.print(time);
minTime = Math.min(time, minTime);
}
System.out.println("\nMinTime: " + minTime);
}
}
}
Executing:
$ java -server -cp . JvmOdd 1000000 100000 1 5000
59103
820
727
772
734
767
730
726
840
736
3404
MinTime: 3404
There is so much going on in here that I don't know where to start. But lets start here....
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
The latency of addToStoarge() cannot be measured using this technique. It simply runs for too quickly meaning you're likely below the resolution of the clock. Without running this, my guess is that your measures are dominated by clock edge counts. You'll need to bulk up the unit of work to get a measure with lower levels of noise in it.
As for what is happening? There are a number of call site optimizations the most important being inlining. Inlining would totally eliminate the call site but it's a path specific optimization. If you call the method from a different place, that would follow the slow path of performing a virtual method lookup followed by a jump to that code. So to see the benefits of inlining from a different path, that path would also have to be "warmed up".
I would strongly recommend that you look at both JMH (delivered with the JDK). There are facilities in there such as blackhole which will help with the effects of CPU clocks winding down. You might also want to evaluate the quality of the bench with the help of tools like JITWatch (Adopt OpenJDK project) which will take logs produced by the JIT and help you interrupt them.
There is so much to this subject, but the bottom line is that you can't write a simplistic benchmark like this and expect it to tell you anything useful. You will need to use JMH.
I suggest watching this: https://www.infoq.com/presentations/jmh about microbenchmarking and JMH
There's also a chapter on microbenchmarking & JMH in my book: http://shop.oreilly.com/product/0636920042983.do
Java internally uses JIT(Just in Compiler). Based on the number of times the same method executes it optimizes the instruction and perform better. For lesser values, the usage of method would be normal which may not fall under optimization that shows the execution time more. When the same method called more time, it uses JIT and executes in lesser time because of the optimized instruction for the same method execution.
I was coding a leetcode problem : https://oj.leetcode.com/problems/gas-station/ using Java 8.
My solution got TLE when I used Arrays.stream(integer_array).sum() to compute sum while the same solution got accepted using iteration to calculate the sum of elements in array. The best possible time complexity for this problem is O(n) and I am surprised to get TLE when using streaming API's from Java 8. I have implemented the solution in O(n) only.
import java.util.Arrays;
public class GasStation {
public int canCompleteCircuit(int[] gas, int[] cost) {
int start = 0, i = 0, runningCost = 0, totalGas = 0, totalCost = 0;
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
// for (int item : gas) totalGas += item;
// for (int item : cost) totalCost += item;
if (totalGas < totalCost)
return -1;
while (start > i || (start == 0 && i < gas.length)) {
runningCost += gas[i];
if (runningCost >= cost[i]) {
runningCost -= cost[i++];
} else {
runningCost -= gas[i];
if (--start < 0)
start = gas.length - 1;
runningCost += (gas[start] - cost[start]);
}
}
return start;
}
public static void main(String[] args) {
GasStation sol = new GasStation();
int[] gas = new int[] { 10, 5, 7, 14, 9 };
int[] cost = new int[] { 8, 5, 14, 3, 1 };
System.out.println(sol.canCompleteCircuit(gas, cost));
gas = new int[] { 10 };
cost = new int[] { 8 };
System.out.println(sol.canCompleteCircuit(gas, cost));
}
}
The solution gets accepted when,
I comment the following two lines: (calculating sum using streaming)
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
and uncomment the following two lines (calculating sum using iteration):
//for (int item : gas) totalGas += item;
//for (int item : cost) totalCost += item;
Now the solution gets accepted. Why streaming API in Java 8 is slower for large input than iteration for primitives?
The first step in dealing with problems like this is to bring the code into a controlled environment. That means running it in the JVM you control (and can invoke) and running tests inside a good benchmark harness like JMH. Analyze, don't speculate.
Here's a benchmark I whipped up using JMH to do some analysis on this:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#State(Scope.Benchmark)
public class ArraySum {
static final long SEED = -897234L;
#Param({"1000000"})
int sz;
int[] array;
#Setup
public void setup() {
Random random = new Random(SEED);
array = new int[sz];
Arrays.setAll(array, i -> random.nextInt());
}
#Benchmark
public int sumForLoop() {
int sum = 0;
for (int a : array)
sum += a;
return sum;
}
#Benchmark
public int sumStream() {
return Arrays.stream(array).sum();
}
}
Basically this creates an array of a million ints and sums them twice: once using a for-loop and once using streams. Running the benchmark produces a bunch of output (elided for brevity and for dramatic effect) but the summary results are below:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 3 514.473 398.512 us/op
ArraySum.sumStream 1000000 avgt 3 7355.971 3170.697 us/op
Whoa! That Java 8 streams stuff is teh SUXX0R! It's 14 times slower than a for-loop, don't use it!!!1!
Well, no. First let's go over these results, and then look more closely to see if we can figure out what's going on.
The summary shows the two benchmark methods, with the "sz" parameter of a million. It's possible to vary this parameter but it doesn't turn out to make a difference in this case. I also only ran the benchmark methods 3 times, as you can see from the "samples" column. (There were also only 3 warmup iterations, not visible here.) The score is in microseconds per operation, and clearly the stream code is much, much slower than the for-loop code. But note also the score error: that's the amount of variability in the different runs. JMH helpfully prints out the standard deviation of the results (not shown here) but you can easily see that the score error is a significant fraction of reported score. This reduces our confidence in the score.
Running more iterations should help. More warmup iterations will let the JIT do more work and settle down before running the benchmarks, and running more benchmark iterations will smooth out any errors from transient activity elsewhere on my system. So let's try 10 warmup iterations and 10 benchmark iterations:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 504.803 34.010 us/op
ArraySum.sumStream 1000000 avgt 10 7128.942 178.688 us/op
Performance is overall a little faster, and the measurement error is also quite a bit smaller, so running more iterations has had the desired effect. But the streams code is still considerably slower than the for-loop code. What's going on?
A large clue can be obtained by looking at the individual timings of the streams method:
# Warmup Iteration 1: 570.490 us/op
# Warmup Iteration 2: 491.765 us/op
# Warmup Iteration 3: 756.951 us/op
# Warmup Iteration 4: 7033.500 us/op
# Warmup Iteration 5: 7350.080 us/op
# Warmup Iteration 6: 7425.829 us/op
# Warmup Iteration 7: 7029.441 us/op
# Warmup Iteration 8: 7208.584 us/op
# Warmup Iteration 9: 7104.160 us/op
# Warmup Iteration 10: 7372.298 us/op
What happened? The first few iterations were reasonably fast, but then the 4th and subsequent iterations (and all the benchmark iterations that follow) were suddenly much slower.
I've seen this before. It was in this question and this answer elsewhere on SO. I recommend reading that answer; it explains how the JVM's inlining decisions in this case result in poorer performance.
A bit of background here: a for-loop compiles to a very simple increment-and-test loop, and can easily be handled by usual optimization techniques like loop peeling and unrolling. The streams code, while not very complex in this case, is actually quite complex compared to the for-loop code; there's a fair bit of setup, and each loop requires at least one method call. Thus, the JIT optimizations, particularly its inlining decisions, are critical to making the streams code go fast. And it's possible for it to go wrong.
Another background point is that integer summation is about the simplest possible operation you can think of to do in a loop or stream. This will tend to make the fixed overhead of stream setup look relatively more expensive. It is also so simple that it can trigger pathologies in the inlining policy.
The suggestion from the other answer was to add the JVM option -XX:MaxInlineLevel=12 to increase the amount of code that can be inlined. Rerunning the benchmark with that option gives:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 502.379 27.859 us/op
ArraySum.sumStream 1000000 avgt 10 498.572 24.195 us/op
Ah, much nicer. Disabling tiered compilation using -XX:-TieredCompilation also had the effect of avoiding the pathological behavior. I also found that making the loop computation even a bit more expensive, e.g. summing squares of integers -- that is, adding a single multiply -- also avoids the pathological behavior.
Now, your question is about running in the context of the leetcode environment, which seems to run the code in a JVM that you don't have any control over, so you can't change the inlining or compilation options. And you probably don't want to make your computation more complex to avoid the pathology either. So for this case, you might as well just stick to the good old for-loop. But don't be afraid to use streams, even for dealing with primitive arrays. It can perform quite well, aside from some narrow edge cases.
The normal iteration approach is going to be pretty much as fast as anything can be, but streams have a variety of overheads: even though it's coming directly from a stream, there's probably going to be a primitive Spliterator involved and lots of other objects being generated.
In general, you should expect the "normal approach" to usually be faster than streams unless you're both using parallelization and your data is very large.
My benchmark (see code below) shows that streaming approach is about 10-15% slower than iterative. Interestingly enough, parallel stream results vary greatly on my 4 core (i7) macbook pro, but, while I have seen a them a few times being about 30% faster than iterative, the most common result is almost three times slower than sequential.
Here is the benchmark code:
import java.util.*;
import java.util.function.*;
public class StreamingBenchmark {
private static void benchmark(String name, LongSupplier f) {
long start = System.currentTimeMillis(), sum = 0;
for(int count = 0; count < 1000; count ++) sum += f.getAsLong();
System.out.println(String.format(
"%10s in %d millis. Sum = %d",
name, System.currentTimeMillis() - start, sum
));
}
public static void main(String argv[]) {
int data[] = new int[1000000];
Random randy = new Random();
for(int i = 0; i < data.length; i++) data[i] = randy.nextInt();
benchmark("iterative", () -> { int s = 0; for(int n: data) s+=n; return s; });
benchmark("stream", () -> Arrays.stream(data).sum());
benchmark("parallel", () -> Arrays.stream(data).parallel().sum());
}
}
Here is the output from a few runs:
iterative in 350 millis. Sum = 564821058000
stream in 394 millis. Sum = 564821058000
parallel in 883 millis. Sum = 564821058000
iterative in 340 millis. Sum = -295411382000
stream in 376 millis. Sum = -295411382000
parallel in 1031 millis. Sum = -295411382000
iterative in 365 millis. Sum = 1205763898000
stream in 379 millis. Sum = 1205763898000
parallel in 1053 millis. Sum = 1205763898000
etc.
This got me curious, and I also tried running equivalent logic in scala:
object Scarr {
def main(argv: Array[String]) = {
val randy = new java.util.Random
val data = (1 to 1000000).map { _ => randy.nextInt }.toArray
val start = System.currentTimeMillis
var sum = 0l;
for ( _ <- 1 to 1000 ) sum += data.sum
println(sum + " in " + (System.currentTimeMillis - start) + " millis.")
}
}
This took 14 seconds! About 40 times(!) longer than streaming in java. Ouch!
The sum() method is syntactically equivalent to return reduce(0, Integer::sum); In a large list, there will be more overhead in making all the method calls than the basic by-hand for-loop iteration. The byte code for the for(int i : numbers) iteration is only very slightly more complicated than that generated by the by-hand for-loop. The stream operation is possibly faster in parallel-friendly environments (though maybe not for primitive methods), but unless we know that the environment is parallel-friendly (and it may not be since leetcode itself is probably designed to favor low-level over abstract since it's measuring efficiency rather than legibility).
The sum operation done in any of the three ways (Arrays.stream(int[]).sum, for (int i : ints){total+=i;}, and for(int i = 0; i < ints.length; i++){total+=i;} should be relatively similar in efficiency. I used the following test class (which sums a hundred million integers between 0 and 4096 a hundred times each and records the average times). All of them returned in very similar timeframes. It even attempts to limit parallel processing by occupying all but one of the available cores in while(true) loops, but I still found no particular difference:
public class SumTester {
private static final int ARRAY_SIZE = 100_000_000;
private static final int ITERATION_LIMIT = 100;
private static final int INT_VALUE_LIMIT = 4096;
public static void main(String[] args) {
Random random = new Random();
int[] numbers = new int[ARRAY_SIZE];
IntStream.range(0, ARRAY_SIZE).forEach(i->numbers[i] = random.nextInt(INT_VALUE_LIMIT));
Map<String, ToLongFunction<int[]>> inputs = new HashMap<String, ToLongFunction<int[]>>();
NanoTimer initializer = NanoTimer.start();
System.out.println("initialized NanoTimer in " + initializer.microEnd() + " microseconds");
inputs.put("sumByStream", SumTester::sumByStream);
inputs.put("sumByIteration", SumTester::sumByIteration);
inputs.put("sumByForLoop", SumTester::sumByForLoop);
System.out.println("Parallelables: ");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
int cores = Runtime.getRuntime().availableProcessors();
List<CancelableThreadEater> threadEaters = new ArrayList<CancelableThreadEater>();
if (cores > 1){
threadEaters = occupyThreads(cores - 1);
}
// Only one core should be left to our class
System.out.println("\nSingleCore (" + threadEaters.size() + " of " + cores + " cores occupied)");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
for (CancelableThreadEater cte : threadEaters){
cte.end();
}
System.out.println("Complete!");
}
public static long sumByStream(int[] numbers){
return Arrays.stream(numbers).sum();
}
public static long sumByIteration(int[] numbers){
int total = 0;
for (int i : numbers){
total += i;
}
return total;
}
public static long sumByForLoop(int[] numbers){
int total = 0;
for (int i = 0; i < numbers.length; i++){
total += numbers[i];
}
return total;
}
public static void averageTimeFor(int iterations, Map<String, ToLongFunction<int[]>> testMap, int[] numbers){
Map<String, Long> durationMap = new HashMap<String, Long>();
Map<String, Long> sumMap = new HashMap<String, Long>();
for (String methodName : testMap.keySet()){
durationMap.put(methodName, 0L);
sumMap.put(methodName, 0L);
}
for (int i = 0; i < iterations; i++){
for (String methodName : testMap.keySet()){
int[] newNumbers = Arrays.copyOf(numbers, ARRAY_SIZE);
ToLongFunction<int[]> function = testMap.get(methodName);
NanoTimer nt = NanoTimer.start();
long sum = function.applyAsLong(newNumbers);
long duration = nt.microEnd();
sumMap.put(methodName, sum);
durationMap.put(methodName, durationMap.get(methodName) + duration);
}
}
for (String methodName : testMap.keySet()){
long duration = durationMap.get(methodName) / iterations;
long sum = sumMap.get(methodName);
System.out.println(methodName + ": result '" + sum + "', elapsed time: " + duration + " milliseconds on average over " + iterations + " iterations");
}
}
private static List<CancelableThreadEater> occupyThreads(int numThreads){
List<CancelableThreadEater> result = new ArrayList<CancelableThreadEater>();
for (int i = 0; i < numThreads; i++){
CancelableThreadEater cte = new CancelableThreadEater();
result.add(cte);
new Thread(cte).start();
}
return result;
}
private static class CancelableThreadEater implements Runnable {
private Boolean stop = false;
public void run(){
boolean canContinue = true;
while (canContinue){
synchronized(stop){
if (stop){
canContinue = false;
}
}
}
}
public void end(){
synchronized(stop){
stop = true;
}
}
}
}
which returned
initialized NanoTimer in 22 microseconds
Parallelables:
sumByIteration: result '-1413860413', elapsed time: 35844 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 35414 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 35218 milliseconds on average over 100 iterations
SingleCore (3 of 4 cores occupied)
sumByIteration: result '-1413860413', elapsed time: 37010 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 38375 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 37990 milliseconds on average over 100 iterations
Complete!
That said, there's no real reason to do the sum() operation in this case. You are iterating through each array, for a total of three iterations and the last one may be a longer-than-normal iteration. It's possible to calculate correctly with one full simultaneous iteration of the arrays and one short-circuiting iteration. It may be possible to do it even more efficiently, but I couldn't figure out any better way to do it than I did. My solution ended up being one of the fastest java solutions on the chart - it ran in 223ms, which was in amongst the middle pack of python solutions.
I'll add my solution to the problem if you care to see it, but I hope the actual question is answered here.
Stream function is relatively slow. So during leetcode contest or any algorithms contest, always prefer classic loops over stream functions as large inputs are prone to TLE. This can in turn cause penality, which would affect your final ranking.
A detailed explanation is mentioned here https://stackoverflow.com/a/27994074/6185191
I also came across this issue while doing this pretty basic LeetCode problem. The first code that I submitted used Java Stream API's Arrays.stream().sum() to compute the Array sum, which gave a time of 6ms.
While the classic for loop just took a time of 1ms to iterate through the same array. Now that's insane! The Stream API method, takes atleast 6x the time, than your simple for loop. So yeah! always go with the simpler and classic method.
The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key
It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.
The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;
Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling
To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.
Trying to answer to this ticket : What is the difference between instanceof and Class.isAssignableFrom(...)?
I made a performance test :
class A{}
class B extends A{}
A b = new B();
void execute(){
boolean test = A.class.isAssignableFrom(b.getClass());
// boolean test = A.class.isInstance(b);
// boolean test = b instanceof A;
}
#Test
public void testPerf() {
// Warmup the code
for (int i = 0; i < 100; ++i)
execute();
// Time it
int count = 100000;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
execute();
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms.);
}
Which gave me :
A.class.isAssignableFrom(b.getClass()) : 100000 iterations took 15ms
A.class.isInstance(b) : 100000 iterations took 12ms
b instanceof A : 100000 iterations took 6ms
But playing with the number of iterations, I can see the performance is constant. For Integer.MAX_VALUE :
A.class.isAssignableFrom(b.getClass()) : 2147483647 iterations took 15ms
A.class.isInstance(b) : 2147483647 iterations took 12ms
b instanceof A : 2147483647 iterations took 6ms
Thinking it was a compiler optimization (I ran this test with JUnit), I changed it into this :
#Test
public void testPerf() {
boolean test = false;
// Warmup the code
for (int i = 0; i < 100; ++i)
test |= b instanceof A;
// Time it
int count = Integer.MAX_VALUE;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
test |= b instanceof A;
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms. AVG= " + TimeUnit.NANOSECONDS.toMillis(elapsed/count));
System.out.println(test);
}
But the performance is still "independent" of the number of iterations.
Could someone explain that behavior ?
A hundred iterations is not nearly enough for warmup. The default compile threshold is 10000 iterations (a hundred times more), so best go at least a bit over that threshold.
Once the compilation has been triggered, the world is not stopped; the compilation takes place in the background. That means that its effect will start being observable only after a slight delay.
There is ample space for optimization of your test in such a way that the entire loop is collapsed into its final result. That would explain the constant numbers.
Anyway, I always do the benchmarks by having an outer method call the inner method something like 10 times. The inner method does a big number of iterations, say 10,000 or more, as needed to make its runtime rise into at least tens of milliseconds. I don't even bother with nanoTime since if microsecond precision is important to you, it is just a sign of measuring too short a time interval.
When you do it like this, you are making it easy for the JIT to execute a compiled version of the inner method after it was substituted for the interpreted version. Another benefit is that you get assurance that the times of the inner method are stabilizing.
If you want to make a real benchmark of a simple function, you should use a micro-benchmarking tool, like Caliper. It will be much simpler that trying to make your own benchmark.
The JIT compiler can eliminate loops which don't anything. This can be triggered after 10,000 iterations.
What I suspect you are timing is how long it takes for the JIT to detect that the loop doesn't do anything and remove it. This will be a little longer than it takes to do 10,000 iterations.
All,
While going through some of the files in Java API, I noticed many instances where the looping counter is being decremented rather than increment. i.e. in for and while loops in String class. Though this might be trivial, is there any significance for decrementing the counter rather than increment?
I've compiled two simple loops with eclipse 3.6 (java 6) and looked at the byte code whether we have some differences. Here's the code:
for(int i = 2; i >= 0; i--){}
for(int i = 0; i <= 2; i++){}
And this is the bytecode:
// 1st for loop - decrement 2 -> 0
0 iconst_2
1 istore_1 // i:=2
2 goto 8
5 inc 1 -1 // i+=(-1)
8 iload_1
9 ifge 5 // if (i >= 0) goto 5
// 2nd for loop - increment 0 -> 2
12 iconst_0
13 istore_1 // i:=0
14 goto 20
17 inc 1 1 // i+=1
20 iload_1
21 iconst 2
22 if_icmple 17 // if (i <= 2) goto 17
The increment/decrement operation should make no difference, it's either +1 or +(-1). The main difference in this typical(!) example is that in the first example we compare to 0 (ifge i), in the second we compare to a value (if_icmple i 2). And the comaprision is done in each iteration. So if there is any (slight) performance gain, I think it's because it's less costly to compare with 0 then to compare with other values. So I guess it's not incrementing/decrementing that makes the difference but the stop criteria.
So if you're in need to do some micro-optimization on source code level, try to write your loops in a way that you compare with zero, otherwise keep it as readable as possible (and incrementing is much easier to understand):
for (int i = 0; i <= 2; i++) {} // readable
for (int i = -2; i <= 0; i++) {} // micro-optimized and "faster" (hopefully)
Addition
Yesterday I did a very basic test - just created a 2000x2000 array and populated the cells based on calculations with the cell indices, once counting from 0->1999 for both rows and cells, another time backwards from 1999->0. I wasn't surprised that both scenarios had a similiar performance (185..210 ms on my machine).
So yes, there is a difference on byte code level (eclipse 3.6) but, hey, we're in 2010 now, it doesn't seem to make a significant difference nowadays. So again, and using Stephens words, "don't waste your time" with this kind of optimization. Keep the code readable and understandable.
When in doubt, benchmark.
public class IncDecTest
{
public static void main(String[] av)
{
long up = 0;
long down = 0;
long upStart, upStop;
long downStart, downStop;
long upStart2, upStop2;
long downStart2, downStop2;
upStart = System.currentTimeMillis();
for( long i = 0; i < 100000000; i++ )
{
up++;
}
upStop = System.currentTimeMillis();
downStart = System.currentTimeMillis();
for( long j = 100000000; j > 0; j-- )
{
down++;
}
downStop = System.currentTimeMillis();
upStart2 = System.currentTimeMillis();
for( long k = 0; k < 100000000; k++ )
{
up++;
}
upStop2 = System.currentTimeMillis();
downStart2 = System.currentTimeMillis();
for( long l = 100000000; l > 0; l-- )
{
down++;
}
downStop2 = System.currentTimeMillis();
assert (up == down);
System.out.println( "Up: " + (upStop - upStart));
System.out.println( "Down: " + (downStop - downStart));
System.out.println( "Up2: " + (upStop2 - upStart2));
System.out.println( "Down2: " + (downStop2 - downStart2));
}
}
With the following JVM:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Has the following output (ran it multiple times to make sure the JVM was loaded and to make sure the numbers settled down a little).
$ java -ea IncDecTest
Up: 86
Down: 84
Up2: 83
Down2: 84
These all come extremely close to one another and I have a feeling that any discrepancy is a fault of the JVM loading some code at some points and not others, or a background task happening, or simply falling over and getting rounded down on a millisecond boundary.
While at one point (early days of Java) there might have been some performance voodoo to be had, it seems to me that that is no longer the case.
Feel free to try running/modifying the code to see for yourself.
It is possible that this is a result of Sun engineers doing a whole lot of profiling and micro-optimization, and those examples that you found are the result of that. It is also possible that they are the result of Sun engineers "optimizing" based on deep knowledge of the JIT compilers ... or based on shallow / incorrect knowledge / voodoo thinking.
It is possible that these sequences:
are faster than the increment loops,
are no faster or slower than increment loops, or
are slower than increment loops for the latest JVMs, and the code is no longer optimal.
Either way, you should not emulate this practice in your code, unless thorough profiling with the latest JVMs demonstrates that:
your code really will benefit from optimization, and
the decrementing loop really is faster than the incrementing loop for your particular application.
And even then, you may find that your carefully hand optimized code is less than optimal on other platforms ... and that you need to repeat the process all over again.
These days, it is generally recognized that the best first strategy is to write simple code and leave optimization to the JIT compiler. Writing complicated code (such as loops that run in reverse) may actually foil the JIT compiler's attempts to optimize.