The costs of streams and closures in Java 8

The costs of streams and closures in Java 8 - java

I'm rewriting an application that involves dealing with objects in order of 10 millions using Java 8 and I noticed that streams can slow down the application up to 25%. Interestingly, this happens when my collections are empty as well, so it's the constant initialization time of stream. To reproduce the problem, consider the following code:
long start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
long end = System.nanoTime();
System.out.println((end - start)/1000_000);
start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
end = System.nanoTime();
System.out.println((end - start)/1000_000);
The result is as follows: 224 vs. 5 ms.
If I use forEach on set directly, i.e., set.forEach(), the result will be: 12 vs 5ms.
Finally, if I create the closure outside once as
Consumer<? super String> consumer = s -> System.out.println(s);
and use set.forEach(c) the result will be 7 vs 5 ms.
Of course, the nubmers are small and my benchmarking is very primitive, but does this example shows that there is an overhead in initializing streams and closures?
(Actually, since set is empty, the initialization cost of closures should not be important in this case, but nevertheless, should I consider creating closures before hand instead of on-the-fly)

The cost you see here is not associated with the "closures" at all but with the cost of Stream initialization.
Let's take your three sample codes:
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
This one creates a new Stream instance at each loop; at least for the first 10k iterations, see below. After those 10k iterations, well, the JIT is probably smart enough to see that it's a no-op anyway.
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
Here the JIT kicks in again: empty set? Well, that's a no-op, end of story.
set.forEach(System.out::println);
An Iterator is created for the set, which is always empty? Same story, the JIT kicks in.
The problem with your code to start with is that you fail to account for the JIT; for realistic measurements, run at least 10k loops before measuring, since 10k executions is what the JIT requires to kick in (at least, HotSpot acts this way).
Now, lambdas: they are call sites, and they are linked only once; but the cost of the initial linkage is still there, of course, and in your loops, you include this cost. Try and run only one loop before doing your measurements so that this cost is out of the way.
All in all, this is not a valid microbenchmark. Use caliper, or jmh, to really measure the performance.
An excellent video to see how lambdas work here. It is a little old now, and the JVM is much better than it was at this time with lambdas.
If you want to know more, look for literature about invokedynamic.

Related

Why is parallel stream slower?

I was playing around with infinite streams and made this program for benchmarking. Basically the bigger the number you provide, the faster it will finish. However, I was amazed to find that using a parellel stream resulted in exponentially worse performance compared to a sequential stream. Intuitively, one would expect an infinite stream of random numbers to be generated and evaluated much faster in a multi-threaded environment, but this appears not to be the case. Why is this?
final int target = Integer.parseInt(args[0]);
if (target <= 0) {
System.err.println("Target must be between 1 and 2147483647");
return;
}
final long startTime, endTime;
startTime = System.currentTimeMillis();
System.out.println(
IntStream.generate(() -> new Double(Math.random()*2147483647).intValue())
//.parallel()
.filter(i -> i <= target)
.findFirst()
.getAsInt()
);
endTime = System.currentTimeMillis();
System.out.println("Execution time: "+(endTime-startTime)+" ms");

I totally agree with the other comments and answers but indeed your test behaves strange in case that the target is very low. On my modest laptop the parallel version is on average about 60x slower when very low targets are given. This extreme difference cannot be explained by the overhead of the parallelization in the stream APIs so I was also amazed :-). IMO the culprit lies here:
Math.random()
Internally this call relies on a global instance of java.util.Random. In the documentation of Random it is written:
Instances of java.util.Random are threadsafe. However, the concurrent
use of the same java.util.Random instance across threads may encounter
contention and consequent poor performance. Consider instead using
ThreadLocalRandom in multithreaded designs.
So I think that the really poor performance of the parallel execution compared to the sequential one is explained by the thread contention in random rather than any other overheads. If you use ThreadLocalRandom instead (as recommended in the documentation) then the performance difference will not be so dramatic. Another option would be to implement a more advanced number supplier.

The cost of passing work to multiple thread is expensive es the first time you do it. This cost is fairly fixed so even if your task is trivial, the overhead is relatively high.
One of the problems you have is that highly inefficient code is a very poor way to determine how well a solution performs. Also, how it runs the first time and how it runs after a few seconds can often be 100x different (can be much more) I suggest using an example which is already optimal and only then attempt to use multiple threads.
e.g.
long start = System.nanoTime();
int value = (int) (Math.random() * (target+1L));
long time = System.nanoTime() - value;
// don't time IO as it is sooo much slower
System.out.println(value);
Note: this will not be efficient until the code has warmuped and been compiled. i.e. ignore the first 2-5 seconds this code is run.

Following suggestions from various answers, I think I've fixed it. I'm not sure what the exact bottleneck was but on an i5-4590T the parallel version with the following code performs faster than the sequential variant. For brevity, I've included only the relevant parts of the (refactored) code:
static IntStream getComputation() {
return IntStream
.generate(() -> ThreadLocalRandom.current().nextInt(2147483647));
}
static void computeSequential(int target) {
for (int loop = 0; loop < target; loop++) {
final int result = getComputation()
.filter(i -> i <= target)
.findAny()
.getAsInt();
System.out.println(result);
}
}
static void computeParallel(int target) {
IntStream.range(0, target)
.parallel()
.forEach(loop -> {
final int result = getComputation()
.parallel()
.filter(i -> i <= target)
.findAny()
.getAsInt();
System.out.println(result);
});
}
EDIT: I should also note that I put it all in a loop to get longer running times.

Efficiency of Multithreaded Loops

Greetings noble community,
I want to have the following loop:
for(i = 0; i < MAX; i++)
A[i] = B[i] + C[i];
This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.
(for simplicity, assume MAX is a multiple of 4)
Option 1:
for(i = tid; i < MAX; i += 4)
A[i] = B[i] + C[i];
Option 2:
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
A[i] = B[i] + C[i];
My question is if there's one that is more efficient then the other and why?

The second one is better than the first one. Simple answer: the second one minimize false sharing
Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.
When would this happen?
Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.
class Foo {
private int foo1;
private int foo2;
}
How bad is false sharing?
I refer Example 6 from the Gallery of Processor Cache Effects
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
How to detect false sharing?
Linux Perf could be used to detect cache misses and therefore help you analysis such problem.
refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:
Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses # 0.01% of all L1-dcache hits [50.51%]
It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.
Is there some good practice in industry?
LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm
The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.
For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:
abstract class RingBufferPad
{
protected long p1, p2, p3, p4, p5, p6, p7;
}
Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:
this.entries = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];
source
You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in #qqibrow's answer; I won't explain that here as there's already a good answer explaining it.
The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:
A = B + C;
and the corresponding instruction in the vector unit will do multiple additions at the same time:
A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;
(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)
Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
A[i+0] = B[i+0] + C[i+0];
A[i+1] = B[i+1] + C[i+1];
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}
However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).
It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.

Making computation-threads cancellable in a smart way

I am wondering how to reach a compromise between fast-cancel-responsiveness and performance with my threads which body look similar to this loop:
for(int i=0; i<HUGE_NUMBER; ++i) {
//some easy computation like adding numbers
//which are result of previous iteration of this loop
}
If a computation in loop body is quite easy then adding simple check-reaction to each iteration:
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
may slow down execution of the code.
Even if I change the above condition to:
if (i % 100 && Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
Then compilator cannot just precompute values of i and check condition only in some specific situations since HUGE_NUMBER is variable and can have different values.
So I'd like to ask if there's any smart way of adding such check to a presented code knowing that:
HUGE_NUMBER is variable and can have different values
loop body consists of some easy-to-compute, but relying on prevoius computations code.
What I want to say is that one iteration of a loop is quite fast, but HUGE_NUMBER of iterations can take a little more time and this is what I want to avoid.

First of all, use Thread.interrupted() instead of Thread.currentThread().isInterrupted() in that case.
You should think about if checking the interruption flag really slows down your calculation too much! One the one hand, if the loop body is VERY simple, even a huge number of iterations (the upper limit is Integer.MAX_VALUE) will run in a few seconds. Even when checking the interruption flag will result in an overhead of 20 or 30%, this will not add very much to the total runtime of your algorithm.
On the other hand, if the loop body is not that simple and so it will run longer, testing the interruption flag will not be a remarkable overhead I think.
Don't do tricks like if (i % 10000 == 0), as this will slow down calculation much more than a 'short' Thread.interrupted().
There is one small trick that you could use - but think twice because it makes your code more complex and less readable:
Whenever you have a loop like that:
for (int i = 0; i < max; i++) {
// loop-body using i
}
you can split up the total range of i into several intervals of size INTERVAL_SIZE:
int start = 0;
while (start < max) {
final int next = Math.min(start + INTERVAL_SIZE, max);
for(int i = start; i < next; i++) {
// loop-body using i
}
start = next;
}
Now you can add your interruption check right before or after the inner loop!
I've done some tests on my system (JDK 7) using the following loop-body
if (i % 2 == 0) x++;
and Integer.MAX_VALUE / 2 iterations. The results are as follows (after warm-up):
Simple loop without any interruption checks: 1,949 ms
Simple loop with check per iteration: 2,219 ms (+14%)
Simple loop with check per 1 million-th iteration using modulo: 3,166 ms (+62%)
Simple loop with check per 1 million-th iteration using bit-mask: 2,653 ms (+36%)
Interval-loop as described above with check in outer loop: 1,972 ms (+1.1%)
So even if the loop-body is as simple as above, the overhead for a per-iteration check is only 14%! So it's recommended to not do any tricks but simply check the interruption flag via Thread.interrupted() in every iteration!

Make your calculation an Iterator.
Although this does not sound terribly useful the benefit here is that you can then quite easily write filter iterators that can be surprisingly flexible. They can be added and removed simply - even through configuration if you wish. There are a number of benefits - try it.
You can then add a filtering Iterator that watches the time and checks for interrupt on a regular basis - or something even more flexible.
You can even add further filtering without compromising the original calculation by interspersing it with brittle status checks.

Java loop efficiency ("for" vs. "foreach") [duplicate]

This question already has answers here:
Is there a performance difference between a for loop and a for-each loop?
(16 answers)
Closed 7 years ago.
(A question for those who know well the JVM compilation and optimization tricks... :-)
Is there any of the "for" and "foreach" patterns clearly superior to the other?
Consider the following two examples:
public void forLoop(String[] text)
{
if (text != null)
{
for (int i=0; i<text.length; i++)
{
// Do something with text[i]
}
}
}
public void foreachLoop(String[] text)
{
if (text != null)
{
for (String s : text)
{
// Do something with s, exactly as with text[i]
}
}
}
Is forLoop faster or slower than foreachLoop?
Assuming that in both cases the text array did not need any do sanity checks, is there a clear winner or still too close to make a call?
EDIT: As noted in some of the answers, the performance should be identical for arrays, whereas the "foreach" pattern could be slightly better for Abstract Data Types like a List. See also this answer which discusses the subject.

From section 14.14.2 of the JLS:
Otherwise, the Expression necessarily has an array type, T[]. Let L1 ... Lm be the (possibly empty) sequence of labels immediately preceding the enhanced for statement. Then the meaning of the enhanced for statement is given by the following basic for statement:
T[] a = Expression;
L1: L2: ... Lm:
for (int i = 0; i < a.length; i++) {
VariableModifiersopt Type Identifier = a[i];
Statement
}
In other words, I'd expect them to end up being compiled to the same code.
There's definitely a clear winner: the enhanced for loop is more readable. That should be your primary concern - you should only even consider micro-optimizing this sort of thing when you've proved that the most readable form doesn't perform as well as you want.

You can write your own simple test, which measure the execution time.
long start = System.currentTimeMillis();
forLoop(text);
long end = System.currentTimeMillis();
long result = end - start;
result is execution time.

Since you are using an array type, the performance difference wouldn't matter. They would end up giving the same performance after going through the optimization funnel.
But if you are using ADTs like List, then the forEachLoop is obviously the best choice compared to multiple get(i) calls.

You should choose the option which is more readable almost every time, unless you know you have a performance issue.
In this case, I would say they are guaranteed to be the same.
The only difference is you extra check for text.length which is likely to be slower, rather than faster.
I would also ensure text is never null statically. e.g. using an #NotNull annotation. Its better to catch these issues at compile/build time (and it would be faster)

There is no performance penalty for using the for-each loop, even for arrays. In fact, it may offer a slight performance advantage over an ordinary for loop in some circumstances, as it computes the limit of the array index only once. For details follow this post.

newInstance() vs new

Is there a penalty for calling newInstance() or is it the same mechanism underneath?
How much overhead if any does newInstance() have over the new keyword* ?
*: Discounting the fact that newInstance() implies using reflection.

In a real world test, creating 18129 instances of a class via "Constuctor.newInstance" passing in 10 arguments -vs- creating the instances via "new" the program took no measurable difference in time.
This wasn't any sort of micro benchmark.
This is with JDK 1.6.0_12 on Windows 7 x86 beta.
Given that Constructor.newInstance is going to be very simlilar to Class.forName.newInstance I would say that the overhead is hardly anything given the functionality that you can get using newInstance over new.
As always you should test it yourself to see.

Be wary of microbenchmarks, but I found this blog entry where someone found that new was about twice as fast as newInstance with JDK 1.4. It sounds like there is a difference, and as expected, reflection is slower. However, it sounds like the difference may not be a dealbreaker, depending on the frequency of new object instances vs how much computation is being done.

There's a difference - in 10 times. Absolute difference is tiny but if your write low latency application cumulative CPU time lost might be significant.
long start, end;
int X = 10000000;
ArrayList l = new ArrayList(X);
start = System.nanoTime();
for(int i = 0; i < X; i++){
String.class.newInstance();
}
end = System.nanoTime();
log("T: ", (end - start)/X);
l.clear();
start = System.nanoTime();
for(int i = 0; i < X; i++){
new String();
}
end = System.nanoTime();
log("T: ", (end - start)/X);
Output:
T: 105
T: 11
Tested on Xeon W3565 # 3.2Ghz, Java 1.6

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

The costs of streams and closures in Java 8 - java

Related

Why is parallel stream slower?

Efficiency of Multithreaded Loops

Making computation-threads cancellable in a smart way

Java loop efficiency ("for" vs. "foreach") [duplicate]

newInstance() vs new

Categories

Resources