Hi guys I'm trying to make a load generator and my goal is to compare how much of my system's resources are consumed when spawning Erlang processes as compared to spawning threads (Java). I am doing this by having the program count to 1000000000 10 times. Java takes roughly 35 seconds to finish the whole process with 10 threads created, Erlang takes ages with 10 processes, I grew impatient with it because it spent over 4 minutes counting. If I just make Erlang and Java count to 1000000000 without spawning threads/processes, Erlang takes 1 minute and 32 seconds and Java takes a good 3 or so seconds. I know Erlang is not made for crunching numbers but that much of a difference is alarming, why is there such a big difference ? Both use my CPU to 100% but no spike in RAM. I am not sure what other methods can be used to make this comparison, I am open to any suggestions as well.
here is the code for both versions
-module(loop).
-compile(export_all).
start(NumberOfProcesses) ->
loop(0, NumberOfProcesses).
%%Processes to spawn
loop(A, NumberOfProcesses) ->
if A < NumberOfProcesses ->
spawn(loop, outerCount, [0]),
loop(A+1, NumberOfProcesses);
true -> ok
end.
%%outer loop
outerCount(A) ->
if A < 10 ->
innerCount(0),
outerCount(A + 1);
true -> ok
end.
%%inner loop
innerCount(A) ->
if A < 1000000000 ->
innerCount(A+1);
true -> ok
end.
and java
import java.util.Scanner;
class Loop implements Runnable
{
public static void main(String[] args)
{
System.out.println("Input number of processes");
Scanner scan = new Scanner(System.in);
String theNumber = scan.nextLine();
for (int t = 0; t < Integer.parseInt(theNumber); t++)
{
new Thread(new Loop()).start();
}
}
public void run()
{
int i;
for (i = 0; i < 10; i++)
{
for (int j = 0; j < 1000000000; j++);
}
}
}
Are you running a 32- or 64-bit version of Erlang? If it's 32 bit, then the inner loop limit 1000000000 won't fit in a single-word fixnum (max 28 bits incl. sign), and the loop will start to do bignum arithmetic on the heap which is way way more expensive than just incrementing a word and looping (it will also cause garbage collection to happen now and then, to get rid of old unused numbers from the heap). Changing the outer loop from 10 to 1000 and removing 2 zeros correspondingly from the inner loop should make it use fixnum arithmetic only even on a 32-bit BEAM.
Then, it's also a question of whether the Java version is actually doing any work at all, or if the loop gets optimized away to a no-op at some point. (The Erlang compiler doesn't do that sort of trick - at least not yet.)
RichardC answer gives some clue to understand the difference of execution time. I will add also that if your java code is compiled, it may benefits a lot from the predictive branching of the microprocessor, and thus make a better use of the cache memories.
But the more important in my opinion is that you are not choosing the right ratio of Process/processing to evaluate the cost of process spawning.
The test use 10 processes that does some significant work. I would have chosen a test where many processes are spawned (some thousands? I don't know how much threads the JVM can manage) each process doing very few things, for example this code which spawn at each step twice the number of process and wait for the deepest processes to send back the message done. With a depth of 17, which means 262143 processes in total and 131072 returned messages, it takes less than 0.5 s on my very slow PC, that is less than 2µs per process (of course the dual core dual thread should be used)
-module (cascade).
-compile([export_all]).
test() ->
timer:tc(?MODULE,start,[]).
start() ->
spawn(?MODULE,child,[self(),17]),
loop(1024*128).
loop(0) -> done;
loop(N) ->
receive
done -> loop(N-1)
end.
child(P,0) -> P ! done;
child(P,N) ->
spawn(?MODULE,child,[P,N-1]),
spawn(?MODULE,child,[P,N-1]).
There are a few problems here.
I don't know how you can evaluate what the Java compiler is doing, but I'd wager it's optimizing the loop out of existence. I think you'd have to have the loop do something meaningful to make any sort of comparison.
More importantly, the Erlang code is not doing what you think it's doing, as best as I can tell. It appears that each process is counting up to 1000000000, and then doing it again for a total of 10 times.
Perhaps worse, your functions are not tail recursive, so your functions keep accumulating in memory waiting for the last one to execute. (Edit: I may be wrong about that. Unaccustomed to the if statement.)
Here's Erlang that does what you want it to do. It's still very slow.
-module(realloop).
-compile(export_all).
start(N) ->
loop(0, N).
loop(N, N) ->
io:format("Spawned ~B processes~n", [N]);
loop(A, N) ->
spawn(realloop, count, [0, 1000000000]),
loop(A+1, N).
count(Upper, Upper) ->
io:format("Reached ~B~n", [Upper]);
count(Lower, Upper) ->
count(Lower+1, Upper).
Related
I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10
In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.
Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.
I am trying to learn concurrency in Java, but whatever I do, 2 threads run in serial, not parallel, so I am not able to replicate common concurrency issues explained in tutorials (like thread interference and memory consistency errors). Sample code:
public class Synchronization {
static int v;
public static void main(String[] args) {
Runnable r0 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v++;
System.out.println(v);
}
};
Runnable r1 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v--;
System.out.println(v);
}
};
Thread t0 = new Thread(r0);
Thread t1 = new Thread(r1);
t0.start();
t1.start();
}
}
This always give me a result starting from 1 and ending with 0 (whatever the loop length is). For example, the code above gives me every time:
1
2
3
4
5
6
7
8
9
10
9
8
7
6
5
4
3
2
1
0
Sometimes, the second thread starts first and the results are the same but negative, so it is still running in serial.
Tried in both Intellij and Eclipse with identical results. CPU has 2 cores if it matters.
UPDATE: it finally became reproducible with huge loops (starting from 1_000_000), though still not every time and just with small amount of final discrepancy. Also seems like making operations in loops "heavier", like printing thread name makes it more reproducible as well. Manually adding sleep to thread also works, but it makes experiment less cleaner, so to say. The reason doesn't seems to be that first loop finishes before the second starts, because I see both loops printing to console while continuing operating and still giving me 0 at the end. The reasons seems more like a thread race for same variable. I will dig deeper into that, thanks.
Seems like first started thread just never give a chance to second in Thread Race to take a variable/second one just never have a time to even start (couldn't say for sure), so the second almost* always will be waiting until first loop will be finished.
Some heavy operation will mix the result:
TimeUnit.MILLISECONDS.sleep(100);
*it is not always true, but you are was lucky in your tests
Starting a thread is heavyweight operation, meaning that it will take some time to perform. Due that fact, by the time you start second thread, first is finished.
The reasoning why sometimes it is in "revert order" is due how thread scheduler works. By the specs there are not guarantees about thread execution order - having that in mind, we know that it is possible for second thread to run first (and finish)
Increase iteration count to something meaningful like 10000 and see what will happen then.
This is called lucky timing as per Brian Goetz (Author of Java Concurrency In Practice). Since there is no synchronization to the static variable v it is clear that this class is not thread-safe.
I've a requirement to capture the execution time of some code in iterations. I've decided to use a Map<Integer,Long> for capturing this data where Integer(key) is the iteration number and Long(value) is the time consumed by that iteration in milliseconds.
I've written the below java code to compute the time taken for each iteration. I want to ensure that the time taken by all iterations is zero before invoking actual code. Surprisingly, the below code behaves differently for every execution.
Sometimes, I get the desired output(zero millisecond for all iterations), but at times I do get positive and even negative values for some random iterations.
I've tried replacing System.currentTimeMillis(); with below code:
new java.util.Date().getTime();
System.nanoTime();
org.apache.commons.lang.time.StopWatch
but still no luck.
Any suggestions as why some iterations take additional time and how to eliminate it?
package com.stackoverflow.programmer;
import java.util.HashMap;
import java.util.Map;
public class TestTimeConsumption {
public static void main(String[] args) {
Integer totalIterations = 100000;
Integer nonZeroMilliSecondsCounter = 0;
Map<Integer, Long> timeTakenMap = new HashMap<>();
for (Integer iteration = 1; iteration <= totalIterations; iteration++) {
timeTakenMap.put(iteration, getTimeConsumed(iteration));
if (timeTakenMap.get(iteration) != 0) {
nonZeroMilliSecondsCounter++;
System.out.format("Iteration %6d has taken %d millisecond(s).\n", iteration,
timeTakenMap.get(iteration));
}
}
System.out.format("Total non zero entries : %d", nonZeroMilliSecondsCounter);
}
private static Long getTimeConsumed(Integer iteration) {
long startTime = System.currentTimeMillis();
// Execute code for which execution time needs to be captured
long endTime = System.currentTimeMillis();
return (endTime - startTime);
}
}
Here's the sample output from 5 different executions of the same code:
Execution #1 (NOT OK)
Iteration 42970 has taken 1 millisecond(s).
Total non zero entries : 1
Execution #2 (OK)
Total non zero entries : 0
Execution #3 (OK)
Total non zero entries : 0
Execution #4 (NOT OK)
Iteration 65769 has taken -1 millisecond(s).
Total non zero entries : 1
Execution #5 (NOT OK)
Iteration 424 has taken 1 millisecond(s).
Iteration 33053 has taken 1 millisecond(s).
Iteration 76755 has taken -1 millisecond(s).
Total non zero entries : 3
I am looking for a Java based solution that ensures that all
iterations consume zero milliseconds consistently. I prefer to
accomplish this using pure Java code without using a profiler.
Note: I was also able to accomplish this through C code.
Your HashMap performance may be dropping if it is resizing. The default capacity is 16 which you are exceeding. If you know the expected capacity up front, create the HashMap with the appropriate size taking into account the default load factor of 0.75
If you rerun iterations without defining a new map and the Integer key does not start again from zero, you will need to resize the map taking into account the total of all possible iterations.
int capacity = (int) ((100000/0.75)+1);
Map<Integer, Long> timeTakenMap = new HashMap<>(capacity);
As you are starting to learn here, writing microbenchmarks in Java is not as easy as one would first assume. Everybody gets bitten at some point, even the hardened performance experts who have been doing it for years.
A lot is going on within the JVM and the OS that skews the results, such as GC, hotspot on the fly optimisations, recompilations, clock corrections, thread contention/scheduling, memory contention and cache misses. To name just a few. And sadly these skews are not consistent, and they can very easily dominate a microbenchmark.
To answer your immediate question of why the timings can some times go negative, it is because currentTimeMillis is designed to capture wall clock time and not elapsed time. No wall clock is accurate on a computer and there are times when the clock will be adjusted.. very possibly backwards. More detail on Java's clocks can be read on the following Oracle Blog Inside the Oracle Hotspot VM clocks.
Further details and support of nanoTime verses currentTimeMillis can be read here.
Before continuing with your own benchmark, I strongly recommend that you read how do I write a currect micro benchmark in java. The quick synopses is to 1) warm up the JVM before taking results, 2) jump through hoops to avoid dead code elimination, 3) ensure that nothing else is running on the same machine but accept that there will be thread scheduling going on.. you may even want to pin threads to cores, depends on how far you want to take this, 4) use a framework specifically designed for microbenchmarking such as JMH or for quick light weight spikes JUnitMosaic gives good results.
I'm not sure if I understand your question.
You're trying to execute a certain set of statements S, and expect the execution time to be zero. You then test this premise by executing it a number of times and verifying the result.
That is a strange expectation to have: anything consumes some time, and possibly even more. Hence, although it would be possible to test successfully, that does not prove that no time has been used, since your program is save_time();execute(S);compare_time(). Even if execute(S) is nothing, your timing is discrete, and as such, it is possible that the 'tick' of your wallclock just happens to happen just between save_time and compare_time, leading to some time having been visibly past.
As such, I'd expect your C program to behave exactly the same. Have you run that multiple times? What happens when you increase the iterations to over millions? If it still does not occur, then apparently your C compiler has optimized the code in such a way that no time is measured, and apparently, Java doesn't.
Or am I understanding you wrong?
You hint it right... System.currentTimeMillis(); is the way to go in this case.
There is no warranty that increasing the value of the integer object i represent either a millisecond or a Cycle-Time in no system...
you should take the System.currentTimeMillis() and calculated the elapsed time
Example:
public static void main(String[] args) {
long lapsedTime = System.currentTimeMillis();
doFoo();
lapsedTime -= System.currentTimeMillis();
System.out.println("Time:" + -lapsedTime);
}
I am also not sure exactly, You're trying to execute a certain code, and try to get the execution for each iteration of execution.
I hope I understand correct, if that so than i would suggest please use
System.nanoTime() instead of System.currentTimeMillis(); because if your statement of block has very small enough you always get Zero in Millisecond.
Simple Ex could be:
public static void main(String[] args) {
long lapsedTime = System.nanoTime();
//do your stuff here.
lapsedTime -= System.nanoTime();
System.out.println("Time Taken" + -lapsedTime);
}
If System.nanoTime() and System.currentTimeMillis(); are nothing much difference. But its just how much accurate result you need and some time difference in millisecond you may get Zero in case if you your set of statement are not more in each iteration.
Greetings noble community,
I want to have the following loop:
for(i = 0; i < MAX; i++)
A[i] = B[i] + C[i];
This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.
(for simplicity, assume MAX is a multiple of 4)
Option 1:
for(i = tid; i < MAX; i += 4)
A[i] = B[i] + C[i];
Option 2:
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
A[i] = B[i] + C[i];
My question is if there's one that is more efficient then the other and why?
The second one is better than the first one. Simple answer: the second one minimize false sharing
Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.
When would this happen?
Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.
class Foo {
private int foo1;
private int foo2;
}
How bad is false sharing?
I refer Example 6 from the Gallery of Processor Cache Effects
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
How to detect false sharing?
Linux Perf could be used to detect cache misses and therefore help you analysis such problem.
refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:
Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses # 0.01% of all L1-dcache hits [50.51%]
It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.
Is there some good practice in industry?
LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm
The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.
For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:
abstract class RingBufferPad
{
protected long p1, p2, p3, p4, p5, p6, p7;
}
Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:
this.entries = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];
source
You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast
There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in #qqibrow's answer; I won't explain that here as there's already a good answer explaining it.
The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:
A = B + C;
and the corresponding instruction in the vector unit will do multiple additions at the same time:
A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;
(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)
Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
A[i+0] = B[i+0] + C[i+0];
A[i+1] = B[i+1] + C[i+1];
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}
However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).
It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.
I am wondering how to reach a compromise between fast-cancel-responsiveness and performance with my threads which body look similar to this loop:
for(int i=0; i<HUGE_NUMBER; ++i) {
//some easy computation like adding numbers
//which are result of previous iteration of this loop
}
If a computation in loop body is quite easy then adding simple check-reaction to each iteration:
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
may slow down execution of the code.
Even if I change the above condition to:
if (i % 100 && Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
Then compilator cannot just precompute values of i and check condition only in some specific situations since HUGE_NUMBER is variable and can have different values.
So I'd like to ask if there's any smart way of adding such check to a presented code knowing that:
HUGE_NUMBER is variable and can have different values
loop body consists of some easy-to-compute, but relying on prevoius computations code.
What I want to say is that one iteration of a loop is quite fast, but HUGE_NUMBER of iterations can take a little more time and this is what I want to avoid.
First of all, use Thread.interrupted() instead of Thread.currentThread().isInterrupted() in that case.
You should think about if checking the interruption flag really slows down your calculation too much! One the one hand, if the loop body is VERY simple, even a huge number of iterations (the upper limit is Integer.MAX_VALUE) will run in a few seconds. Even when checking the interruption flag will result in an overhead of 20 or 30%, this will not add very much to the total runtime of your algorithm.
On the other hand, if the loop body is not that simple and so it will run longer, testing the interruption flag will not be a remarkable overhead I think.
Don't do tricks like if (i % 10000 == 0), as this will slow down calculation much more than a 'short' Thread.interrupted().
There is one small trick that you could use - but think twice because it makes your code more complex and less readable:
Whenever you have a loop like that:
for (int i = 0; i < max; i++) {
// loop-body using i
}
you can split up the total range of i into several intervals of size INTERVAL_SIZE:
int start = 0;
while (start < max) {
final int next = Math.min(start + INTERVAL_SIZE, max);
for(int i = start; i < next; i++) {
// loop-body using i
}
start = next;
}
Now you can add your interruption check right before or after the inner loop!
I've done some tests on my system (JDK 7) using the following loop-body
if (i % 2 == 0) x++;
and Integer.MAX_VALUE / 2 iterations. The results are as follows (after warm-up):
Simple loop without any interruption checks: 1,949 ms
Simple loop with check per iteration: 2,219 ms (+14%)
Simple loop with check per 1 million-th iteration using modulo: 3,166 ms (+62%)
Simple loop with check per 1 million-th iteration using bit-mask: 2,653 ms (+36%)
Interval-loop as described above with check in outer loop: 1,972 ms (+1.1%)
So even if the loop-body is as simple as above, the overhead for a per-iteration check is only 14%! So it's recommended to not do any tricks but simply check the interruption flag via Thread.interrupted() in every iteration!
Make your calculation an Iterator.
Although this does not sound terribly useful the benefit here is that you can then quite easily write filter iterators that can be surprisingly flexible. They can be added and removed simply - even through configuration if you wish. There are a number of benefits - try it.
You can then add a filtering Iterator that watches the time and checks for interrupt on a regular basis - or something even more flexible.
You can even add further filtering without compromising the original calculation by interspersing it with brittle status checks.