I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10
In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.
Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.
Related
Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?
A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.
In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.
I have this following code
public class BenchMark {
public static void main(String args[]) {
doLinear();
doLinear();
doLinear();
doLinear();
}
private static void doParallel() {
IntStream range = IntStream.range(1, 6).parallel();
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("parallel: " +reduce + " -- Time: " + (endTime - startTime));
}
private static void doLinear() {
IntStream range = IntStream.range(1, 6);
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("linear: " +reduce + " -- Time: " + (endTime - startTime));
}
}
I was trying to benchmark streams but came through this execution time steadily decreasing upon calling the same function again and again
Output:
linear: 120 -- Time: 57008226
linear: 120 -- Time: 23202
linear: 120 -- Time: 17192
linear: 120 -- Time: 17802
Process finished with exit code 0
There is a huge difference between first and second execution time.
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there ?
Is there anyway to avoid this optimization so I can benchmark true execution time ?
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?
The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.
The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).
After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.
Is there anyway to avoid this optimization so I can benchmark true execution time?
You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.
For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.
UPDATE: Refer #Marko's answer for an explanation of the initial latency due to lambda linkage.
The higher execution time for the first call is probably a result of the JIT effect. In short, the JIT compilation of the byte codes into native machine code occurs during the first time your method is called. The JVM then attempts further optimization by identifying frequently-called (hot) methods, and re-generate their codes for higher performance.
Is there anyway to avoid this optimization so I can benchmark true execution time ?
You can certainly account for the JVM initial warm-up by excluding the first few result. Then increase the number of repeated calls to your method in a loop of tens of thousands of iterations, and average the results.
There are a few more options that you might want to consider adding to your execution to help reduce noises as discussed in this post. There are also some good tips from this post too.
true execution time
There's no thing like "true execution time". If you need to solve this task only once, the true execution time would be the time of the first test (along with time to startup the JVM itself). In general the time spent for execution of given piece of code depends on many things:
Whether this piece of code is interpreted, JIT-compiled by C1 or C2 compiler. Note that there are not just three options. If you call one method from another, one of them might be interpreted and another might be C2-compiled.
For C2 compiler: how this code was executed previously, so what's in branch and type profile. The polluted type profile can drastically reduce the performance.
Garbage collector state: whether it interrupts the execution or not
Compilation queue: whether JIT-compiler compiles other code simultaneously (which may slow down the execution of current code)
The memory layout: how objects located in the memory, how many cache lines should be loaded to access all the necessary data.
CPU branch predictor state which depends on the previous code execution and may increase or decrease number of branch mispredictions.
And so on and so forth. So even if you measure something in the isolated benchmark, this does not mean that the speed of the same code in the production will be the same. It may differ in the order of magnitude. So before measuring something you should ask yourself why you want to measure this thing. Usually you don't care how long some part of your program is executed. What you usually care is the latency and the throughput of the whole program. So profile the whole program and optimize the slowest parts. Probably the thing you are measuring is not the slowest.
Java VM loads a class into memory first time the class is used.
So the difference between 1st and 2nd run may be caused by class loading.
I've a requirement to capture the execution time of some code in iterations. I've decided to use a Map<Integer,Long> for capturing this data where Integer(key) is the iteration number and Long(value) is the time consumed by that iteration in milliseconds.
I've written the below java code to compute the time taken for each iteration. I want to ensure that the time taken by all iterations is zero before invoking actual code. Surprisingly, the below code behaves differently for every execution.
Sometimes, I get the desired output(zero millisecond for all iterations), but at times I do get positive and even negative values for some random iterations.
I've tried replacing System.currentTimeMillis(); with below code:
new java.util.Date().getTime();
System.nanoTime();
org.apache.commons.lang.time.StopWatch
but still no luck.
Any suggestions as why some iterations take additional time and how to eliminate it?
package com.stackoverflow.programmer;
import java.util.HashMap;
import java.util.Map;
public class TestTimeConsumption {
public static void main(String[] args) {
Integer totalIterations = 100000;
Integer nonZeroMilliSecondsCounter = 0;
Map<Integer, Long> timeTakenMap = new HashMap<>();
for (Integer iteration = 1; iteration <= totalIterations; iteration++) {
timeTakenMap.put(iteration, getTimeConsumed(iteration));
if (timeTakenMap.get(iteration) != 0) {
nonZeroMilliSecondsCounter++;
System.out.format("Iteration %6d has taken %d millisecond(s).\n", iteration,
timeTakenMap.get(iteration));
}
}
System.out.format("Total non zero entries : %d", nonZeroMilliSecondsCounter);
}
private static Long getTimeConsumed(Integer iteration) {
long startTime = System.currentTimeMillis();
// Execute code for which execution time needs to be captured
long endTime = System.currentTimeMillis();
return (endTime - startTime);
}
}
Here's the sample output from 5 different executions of the same code:
Execution #1 (NOT OK)
Iteration 42970 has taken 1 millisecond(s).
Total non zero entries : 1
Execution #2 (OK)
Total non zero entries : 0
Execution #3 (OK)
Total non zero entries : 0
Execution #4 (NOT OK)
Iteration 65769 has taken -1 millisecond(s).
Total non zero entries : 1
Execution #5 (NOT OK)
Iteration 424 has taken 1 millisecond(s).
Iteration 33053 has taken 1 millisecond(s).
Iteration 76755 has taken -1 millisecond(s).
Total non zero entries : 3
I am looking for a Java based solution that ensures that all
iterations consume zero milliseconds consistently. I prefer to
accomplish this using pure Java code without using a profiler.
Note: I was also able to accomplish this through C code.
Your HashMap performance may be dropping if it is resizing. The default capacity is 16 which you are exceeding. If you know the expected capacity up front, create the HashMap with the appropriate size taking into account the default load factor of 0.75
If you rerun iterations without defining a new map and the Integer key does not start again from zero, you will need to resize the map taking into account the total of all possible iterations.
int capacity = (int) ((100000/0.75)+1);
Map<Integer, Long> timeTakenMap = new HashMap<>(capacity);
As you are starting to learn here, writing microbenchmarks in Java is not as easy as one would first assume. Everybody gets bitten at some point, even the hardened performance experts who have been doing it for years.
A lot is going on within the JVM and the OS that skews the results, such as GC, hotspot on the fly optimisations, recompilations, clock corrections, thread contention/scheduling, memory contention and cache misses. To name just a few. And sadly these skews are not consistent, and they can very easily dominate a microbenchmark.
To answer your immediate question of why the timings can some times go negative, it is because currentTimeMillis is designed to capture wall clock time and not elapsed time. No wall clock is accurate on a computer and there are times when the clock will be adjusted.. very possibly backwards. More detail on Java's clocks can be read on the following Oracle Blog Inside the Oracle Hotspot VM clocks.
Further details and support of nanoTime verses currentTimeMillis can be read here.
Before continuing with your own benchmark, I strongly recommend that you read how do I write a currect micro benchmark in java. The quick synopses is to 1) warm up the JVM before taking results, 2) jump through hoops to avoid dead code elimination, 3) ensure that nothing else is running on the same machine but accept that there will be thread scheduling going on.. you may even want to pin threads to cores, depends on how far you want to take this, 4) use a framework specifically designed for microbenchmarking such as JMH or for quick light weight spikes JUnitMosaic gives good results.
I'm not sure if I understand your question.
You're trying to execute a certain set of statements S, and expect the execution time to be zero. You then test this premise by executing it a number of times and verifying the result.
That is a strange expectation to have: anything consumes some time, and possibly even more. Hence, although it would be possible to test successfully, that does not prove that no time has been used, since your program is save_time();execute(S);compare_time(). Even if execute(S) is nothing, your timing is discrete, and as such, it is possible that the 'tick' of your wallclock just happens to happen just between save_time and compare_time, leading to some time having been visibly past.
As such, I'd expect your C program to behave exactly the same. Have you run that multiple times? What happens when you increase the iterations to over millions? If it still does not occur, then apparently your C compiler has optimized the code in such a way that no time is measured, and apparently, Java doesn't.
Or am I understanding you wrong?
You hint it right... System.currentTimeMillis(); is the way to go in this case.
There is no warranty that increasing the value of the integer object i represent either a millisecond or a Cycle-Time in no system...
you should take the System.currentTimeMillis() and calculated the elapsed time
Example:
public static void main(String[] args) {
long lapsedTime = System.currentTimeMillis();
doFoo();
lapsedTime -= System.currentTimeMillis();
System.out.println("Time:" + -lapsedTime);
}
I am also not sure exactly, You're trying to execute a certain code, and try to get the execution for each iteration of execution.
I hope I understand correct, if that so than i would suggest please use
System.nanoTime() instead of System.currentTimeMillis(); because if your statement of block has very small enough you always get Zero in Millisecond.
Simple Ex could be:
public static void main(String[] args) {
long lapsedTime = System.nanoTime();
//do your stuff here.
lapsedTime -= System.nanoTime();
System.out.println("Time Taken" + -lapsedTime);
}
If System.nanoTime() and System.currentTimeMillis(); are nothing much difference. But its just how much accurate result you need and some time difference in millisecond you may get Zero in case if you your set of statement are not more in each iteration.
Hi guys I'm trying to make a load generator and my goal is to compare how much of my system's resources are consumed when spawning Erlang processes as compared to spawning threads (Java). I am doing this by having the program count to 1000000000 10 times. Java takes roughly 35 seconds to finish the whole process with 10 threads created, Erlang takes ages with 10 processes, I grew impatient with it because it spent over 4 minutes counting. If I just make Erlang and Java count to 1000000000 without spawning threads/processes, Erlang takes 1 minute and 32 seconds and Java takes a good 3 or so seconds. I know Erlang is not made for crunching numbers but that much of a difference is alarming, why is there such a big difference ? Both use my CPU to 100% but no spike in RAM. I am not sure what other methods can be used to make this comparison, I am open to any suggestions as well.
here is the code for both versions
-module(loop).
-compile(export_all).
start(NumberOfProcesses) ->
loop(0, NumberOfProcesses).
%%Processes to spawn
loop(A, NumberOfProcesses) ->
if A < NumberOfProcesses ->
spawn(loop, outerCount, [0]),
loop(A+1, NumberOfProcesses);
true -> ok
end.
%%outer loop
outerCount(A) ->
if A < 10 ->
innerCount(0),
outerCount(A + 1);
true -> ok
end.
%%inner loop
innerCount(A) ->
if A < 1000000000 ->
innerCount(A+1);
true -> ok
end.
and java
import java.util.Scanner;
class Loop implements Runnable
{
public static void main(String[] args)
{
System.out.println("Input number of processes");
Scanner scan = new Scanner(System.in);
String theNumber = scan.nextLine();
for (int t = 0; t < Integer.parseInt(theNumber); t++)
{
new Thread(new Loop()).start();
}
}
public void run()
{
int i;
for (i = 0; i < 10; i++)
{
for (int j = 0; j < 1000000000; j++);
}
}
}
Are you running a 32- or 64-bit version of Erlang? If it's 32 bit, then the inner loop limit 1000000000 won't fit in a single-word fixnum (max 28 bits incl. sign), and the loop will start to do bignum arithmetic on the heap which is way way more expensive than just incrementing a word and looping (it will also cause garbage collection to happen now and then, to get rid of old unused numbers from the heap). Changing the outer loop from 10 to 1000 and removing 2 zeros correspondingly from the inner loop should make it use fixnum arithmetic only even on a 32-bit BEAM.
Then, it's also a question of whether the Java version is actually doing any work at all, or if the loop gets optimized away to a no-op at some point. (The Erlang compiler doesn't do that sort of trick - at least not yet.)
RichardC answer gives some clue to understand the difference of execution time. I will add also that if your java code is compiled, it may benefits a lot from the predictive branching of the microprocessor, and thus make a better use of the cache memories.
But the more important in my opinion is that you are not choosing the right ratio of Process/processing to evaluate the cost of process spawning.
The test use 10 processes that does some significant work. I would have chosen a test where many processes are spawned (some thousands? I don't know how much threads the JVM can manage) each process doing very few things, for example this code which spawn at each step twice the number of process and wait for the deepest processes to send back the message done. With a depth of 17, which means 262143 processes in total and 131072 returned messages, it takes less than 0.5 s on my very slow PC, that is less than 2µs per process (of course the dual core dual thread should be used)
-module (cascade).
-compile([export_all]).
test() ->
timer:tc(?MODULE,start,[]).
start() ->
spawn(?MODULE,child,[self(),17]),
loop(1024*128).
loop(0) -> done;
loop(N) ->
receive
done -> loop(N-1)
end.
child(P,0) -> P ! done;
child(P,N) ->
spawn(?MODULE,child,[P,N-1]),
spawn(?MODULE,child,[P,N-1]).
There are a few problems here.
I don't know how you can evaluate what the Java compiler is doing, but I'd wager it's optimizing the loop out of existence. I think you'd have to have the loop do something meaningful to make any sort of comparison.
More importantly, the Erlang code is not doing what you think it's doing, as best as I can tell. It appears that each process is counting up to 1000000000, and then doing it again for a total of 10 times.
Perhaps worse, your functions are not tail recursive, so your functions keep accumulating in memory waiting for the last one to execute. (Edit: I may be wrong about that. Unaccustomed to the if statement.)
Here's Erlang that does what you want it to do. It's still very slow.
-module(realloop).
-compile(export_all).
start(N) ->
loop(0, N).
loop(N, N) ->
io:format("Spawned ~B processes~n", [N]);
loop(A, N) ->
spawn(realloop, count, [0, 1000000000]),
loop(A+1, N).
count(Upper, Upper) ->
io:format("Reached ~B~n", [Upper]);
count(Lower, Upper) ->
count(Lower+1, Upper).
I made a toy program to test Java's concurrency performance. I put it here:
https://docs.google.com/open?id=0B4e6u_s5iHT6MTNkZGM5ODQtNjZmYi00NTMwLWJlMjUtYzViOWZlMDM5NGVi
It accepts an integer number as the argument that indicates how many threads to use. The program just figures out prime numbers from a range. A generic version is obtained by commenting line 44~53, and it generates nearly perfect scalability.
However, when I uncommenting line 44~53, which does simple computation locally, and adjust the variable s to a value big enough, scalability may disappear.
My question is whether my toy program uses shared data which may result in degraded concurrency performance. And how to explain the disappeared scalability (I think low level overhead, like garbage collection, causes that)? Any solution can solve problems like this case?
The code in question is:
int s = 32343;
ArrayList<Integer> al = new ArrayList<Integer>(s);
for (int c = 0; c < s; c++) {
al.add(c);
}
Iterator<Integer> it = al.iterator();
if (it.hasNext()) {
int c = it.next();
c = c++;
}
Of course this will degrade performance if you increase the value of s, since s controls how many things you put into the list. But that has very little to do with concurrency or scalability. If you write code telling the computer to waste time doing thousands or millions of throw-away computations, then of course your performance will degrade.
In more technical terms, the time-complexity of this section of code is O(2n) (it takes n operations to build the list, and then n operations to iterate it and increment each value), where n is equal to s. So the bigger you make s, the longer it will take to execute this code.
In terms of why this would seem to make the benefits of concurrency smaller, have you considered the memory implications as s becomes larger? For instance, are you sure the Java heap is large enough to hold everything in memory without anything getting swapped out to disk? And even if nothing is getting swapped out, by making the length of the ArrayList larger you are giving the garbage collector more work to do when it runs (and possibly increasing the frequency at which it runs). Note that depending upon the implementation, the garbage collector may be pausing all of your threads each time it runs.
I wonder, if you allocate a single ArrayList instance per thread, at the time the thread is created, and then reuse that in the call to isPrime() instead of creating a new list each time, does that improve things?
Edit: Here's a fixed up version: http://pastebin.com/6vR7Uhez
It gives the following output on my machine:
------------------start------------------
1 threads' runtimes:
1 3766.0
maximum: 3766.0
main time: 3766.0
------------------end------------------
------------------start------------------
2 threads' runtimes:
1 897.0
2 2483.0
maximum: 2483.0
main time: 2483.0
------------------end------------------
------------------start------------------
4 threads' runtimes:
1 576.0
2 1473.0
3 568.0
4 1569.0
maximum: 1569.0
main time: 1569.0
------------------end------------------
------------------start------------------
8 threads' runtimes:
1 389.0
2 965.0
3 396.0
4 956.0
5 398.0
6 976.0
7 386.0
8 933.0
maximum: 976.0
main time: 978.0
------------------end------------------
...which shows nearly linear scaling as the number of threads is ramped up. The problems that I fixed were a combination of points raised above and in John Vint's (now deleted) answer, as well as incorrect/unnecessary use of ConcurrentLinkedQueue structures and some questionable timing logic.
If we enable GC logging and profile both versions, we can see that the original version spends about 10x as much time running garbage-collection than the modified version:
Original: [ParNew: 17401K->750K(19136K), 0.0040010 secs] 38915K->22264K(172188K), 0.0040227 secs]
Modified: [ParNew: 17024K->0K(19136K), 0.0002879 secs] 28180K->11156K(83008K), 0.0003094 secs]
Which implies to me that between the constant list allocations and Integer autoboxing, the original implementation was simply churning through too many objects, which places too much load on the GC, which degraded the performance of your threads to the point where there was no benefit (or even a negative benefit) to creating more threads.
So all this says to me is that if you want to get good scaling out of concurrency in Java, whether your task is large or small, you have to pay attention to how you are using memory, be aware of potentially hidden pitfalls and inefficiencies, and optimize away the inefficient bits.