Concatenating parallel streams

Concatenating parallel streams - java

Suppose that I have two int[] arrays input1 and input2. I want to take only positive numbers from the first one, take distinct numbers from the second one, merge them together, sort and store into the resulting array. This can be performed using streams:
int[] result = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
I want to speed up the task, so I consider to make the stream parallel. Usually this just means that I can insert .parallel() anywhere between the stream construction and terminal operation and the result will be the same. The JavaDoc for IntStream.concat says that the resulting stream will be parallel if any of the input streams is parallel. So I thought that making parallel() either input1 stream or input2 stream or the concatenated stream will produce the same result.
Actually I was wrong: if I add .parallel() to the resulting stream, it seems that the input streams remain sequential. Moreover, I can mark the input streams (either of them or both) as .parallel(), then turn the resulting stream to .sequential(), but the input remains parallel. So actually there are 8 possibilities: either of input1, input2 and concatenated stream can be parallel or not:
int[] sss = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
int[] ssp = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).parallel().sorted().toArray();
int[] sps = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sequential().sorted().toArray();
int[] spp = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sorted().toArray();
int[] pss = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).distinct()).sequential().sorted().toArray();
int[] psp = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
int[] pps = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sequential().sorted().toArray();
int[] ppp = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sorted().toArray();
I benchmarked all the versions for different input sizes (using JDK 8u45 64bit on Core i5 4xCPU, Win7) and got different results for every case:
Benchmark (n) Mode Cnt Score Error Units
ConcatTest.SSS 100 avgt 20 7.094 ± 0.069 us/op
ConcatTest.SSS 10000 avgt 20 1542.820 ± 22.194 us/op
ConcatTest.SSS 1000000 avgt 20 350173.723 ± 7140.406 us/op
ConcatTest.SSP 100 avgt 20 6.176 ± 0.043 us/op
ConcatTest.SSP 10000 avgt 20 907.855 ± 8.448 us/op
ConcatTest.SSP 1000000 avgt 20 264193.679 ± 6744.169 us/op
ConcatTest.SPS 100 avgt 20 16.548 ± 0.175 us/op
ConcatTest.SPS 10000 avgt 20 1831.569 ± 13.582 us/op
ConcatTest.SPS 1000000 avgt 20 500736.204 ± 37932.197 us/op
ConcatTest.SPP 100 avgt 20 23.871 ± 0.285 us/op
ConcatTest.SPP 10000 avgt 20 1141.273 ± 9.310 us/op
ConcatTest.SPP 1000000 avgt 20 400582.847 ± 27330.492 us/op
ConcatTest.PSS 100 avgt 20 7.162 ± 0.241 us/op
ConcatTest.PSS 10000 avgt 20 1593.332 ± 7.961 us/op
ConcatTest.PSS 1000000 avgt 20 383920.286 ± 6650.890 us/op
ConcatTest.PSP 100 avgt 20 9.877 ± 0.382 us/op
ConcatTest.PSP 10000 avgt 20 883.639 ± 13.596 us/op
ConcatTest.PSP 1000000 avgt 20 257921.422 ± 7649.434 us/op
ConcatTest.PPS 100 avgt 20 16.412 ± 0.129 us/op
ConcatTest.PPS 10000 avgt 20 1816.782 ± 10.875 us/op
ConcatTest.PPS 1000000 avgt 20 476311.713 ± 19154.558 us/op
ConcatTest.PPP 100 avgt 20 23.078 ± 0.622 us/op
ConcatTest.PPP 10000 avgt 20 1128.889 ± 7.964 us/op
ConcatTest.PPP 1000000 avgt 20 393699.222 ± 56397.445 us/op
From these results I can only conclude that parallelization of distinct() step reduces the overall performance (at least in my tests).
So I have the following questions:
Are there any official guidelines on how to better use the parallelization with concatenated streams? It's not always feasible to test all possible combinations (especially when concatenating more than two streams), so having some "rules of thumb" would be nice.
Seems that if I concatenate the streams created directly from collection/array (without intermediate operations performed before concatenation), then results do not depend so much on the location of parallel() . Is this true?
Are there any other cases besides concatenation where the result depends on at which point the stream pipeline is parallelized?

The specification precisely describes what you get—when you consider that, unlike other operations, we are not talking about a single pipeline but three distinct Streams which retain their properties independent of the others.
The specification says: “The resulting stream is […] parallel if either of the input streams is parallel.” and that’s what you get; if either input stream is parallel, the resulting stream is parallel (but you can turn it to sequential afterwards). But changing the resulting stream to parallel or sequential does not change the nature of the input streams nor does feeding a parallel and a sequential stream into concat.
Regarding the performance consequences, consult the documentation, paragraph “Stream operations and pipelines”:
Intermediate operations are further divided into stateless and stateful operations. Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
Stateful operations may need to process the entire input before producing a result. For example, one cannot produce any results from sorting a stream until one has seen all elements of the stream. As a result, under parallel computation, some pipelines containing stateful intermediate operations may require multiple passes on the data or may need to buffer significant data. Pipelines containing exclusively stateless intermediate operations can be processed in a single pass, whether sequential or parallel, with minimal data buffering.
You have chosen the very two named stateful operations and combined them. So the .sorted() operation of the resulting stream requires a buffering of the entire content before it can start the sorting which implies a completion of the distinct operation. The distinct operation is obviously hard to parallelize as the threads have to synchronize about the already seen values.
So to answer you first question, it’s not about concat but simply that distinct doesn’t benefit from parallel execution.
This also renders your second question obsolete as your are performing entirely different operations in the two concatenated streams so you can’t do the same with a pre-concatenated collection/array. Concatenating the arrays and running distinct on the resulting array is unlikely to yield better results.
Regarding your third question, flatMap’s behavior regarding parallel streams may be a source of surprises…

Related

Confusing branches and L1-dcache-loads in the output of JMH's LinuxPerfNormProfiler

I'm measuring costs of busy-waiting implemented with two approaches:
// 1
while(run) { // looping on volatile flag
}
// 2
while(run) { // looping on volatile flag
Thread.onSpinWait();
}
Complete code of the examples is available on GitHub via link1 and link2.
I'm running the benchmarks with -prof perfnorm and they yield:
Benchmark Mode Cnt Score Error Units
WhileTrueBenchmark.whileTrue avgt 20 6460.700 ± 109.333 ns/op
WhileTrueBenchmark.whileTrue:CPI avgt 4 0.444 ± 0.005 clks/insn
WhileTrueBenchmark.whileTrue:IPC avgt 4 2.252 ± 0.026 insns/clk
WhileTrueBenchmark.whileTrue:L1-dcache-loads avgt 4 51523.529 ± 3556.009 #/op
WhileTrueBenchmark.whileTrue:branches avgt 4 13981.285 ± 958.249 #/op
WhileTrueBenchmark.whileTrue:cycles avgt 4 36407.576 ± 2434.292 #/op
WhileTrueBenchmark.whileTrue:instructions avgt 4 81985.523 ± 6300.983 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait avgt 20 6463.334 ± 49.922 ns/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:CPI avgt 4 2.143 ± 0.056 clks/insn
ThreadOnSpinWaitPlainBenchmark.onSpinWait:IPC avgt 4 0.467 ± 0.012 insns/clk
ThreadOnSpinWaitPlainBenchmark.onSpinWait:L1-dcache-loads avgt 4 7262.587 ± 324.600 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:branches avgt 4 2951.111 ± 162.867 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:cycles avgt 4 36307.064 ± 1516.787 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:instructions avgt 4 16943.396 ± 820.446 #/op
So from the output we see that for the same time elapsed we have the same cycle count, but Thread.onSpinWait() executed almost 5 times less instructions. This is understandable and expected behavior.
What is unexpected to me is that it produced much less branches and L1-dcache-loads. The benchmarked code doesn't have much branching and reads one and the same flag from memory.
So why are these two metrics (branches and L1-dcache-loads) so different?

On x86, Thread.onSpinWait() intrinsic is translated to the PAUSE instruction. PAUSE delays the execution of the next instruction for an implementation-specific amount of time. Because of this delay, the second loop executes the less number of times comparing to the loop without onSpinWait.
Extra delay per each loop iteration => less number of iterations => less number of retired instructions (including cmp and jne) => less branches and memory loads.

Why does performance of java stream fall off with relatively large work compared to "for" loop?

I had an earlier question about interpreting JMH output, which was mostly answered, but I updated the question with another related question, but it would be better to have this be a separate question.
This is the original question: Verify JMH measurements of simple for/lambda comparisons .
My question has to do with performance of streams at particular levels of "work". The following excerpted results from the previous question illustrates what I'm wondering about:
Benchmark Mode Cnt Score Error Units
MyBenchmark.shortLengthConstantSizeFor thrpt 200 132278188.475 ± 1132184.820 ops/s
MyBenchmark.shortLengthConstantSizeLambda thrpt 200 18750818.019 ± 171239.562 ops/s
MyBenchmark.mediumLengthConstantSizeFor thrpt 200 55447999.297 ± 277442.812 ops/s
MyBenchmark.mediumLengthConstantSizeLambda thrpt 200 15925281.039 ± 65707.093 ops/s
MyBenchmark.longerLengthConstantSizeFor thrpt 200 3551842.518 ± 42612.744 ops/s
MyBenchmark.longerLengthConstantSizeLambda thrpt 200 2791292.093 ± 12207.302 ops/s
MyBenchmark.longLengthConstantSizeFor thrpt 200 2984.554 ± 57.557 ops/s
MyBenchmark.longLengthConstantSizeLambda thrpt 200 331.741 ± 2.196 ops/s
I was expecting, as the tests moved from shorter lists to longer lists, that the performance of the stream test should approach the performance of the "for" test.
I saw that in the "short" list, the stream performance was 14% of the "for" performance. For the medium list, it was 29%. For the longer list, it was 78%. So far, the trend was what I was expecting. However, for the long list, it is 11%. For some reason, a list size of 300k, as opposed to 300, caused the performance of the stream to drop off, compared to the "for".
I was wondering if anyone could corroborate results like this, and whether they had any thoughts about why it might be happening.
I'm running this on a Win7 laptop with Java 8.

Well, streams are quite a new addition to Java compared to the "for loop", and the JIT Compiler does not do any sophisticated optimizations for them, yet, as it does for the loops over arrays or collections.

How does the time it takes Math.random() to run compare to that of simple arithmetic operation?

In Java, how long relative to a simple arithmetic operation does it take Math.random() to generate a number? I am trying to randomly distribute objects in an ArrayList that already contains values such that a somewhat even, but not completely even distribution is created, and I am not sure if choosing a random index for every insertion point by using Math.random() is the best approach to take.
Clarification: the distribution of inserted objects is meant to be even enough that the values are not all concentrated in one area, but also uneven enough that the distribution is not predictable (if someone were to go through the values one by one, they would not be able to determine if the next value was going to be a newly inserted value by detecting a constant pattern).

Do not use Math.random. It relies on a global instance of java.util.Random that uses AtomicLong under the hood. Though the PRNG algorithm used in java.util.Random is pretty simple, the performance is mostly affected by the atomic CAS and the related cache-coherence traffic.
This can be particularly bad for multithread appliсations (like in this example), but has also a penalty even in a single-threaded case.
ThreadLocalRandom is always preferable to Math.random. It does not rely on atomic operations and does not suffer from contention. It only updates a thread-local state and uses a couple of arithmetic and bitwise operations.
Here is a JMH benchmark to compare the performance of Math.random() and ThreadLocalRandom.current().nextDouble() to a simple arithmetic operation.
package bench;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.ThreadLocalRandom;
#State(Scope.Thread)
public class RandomBench {
double x = 1;
#Benchmark
public double multiply() {
return x * Math.PI;
}
#Benchmark
public double mathRandom() {
return Math.random();
}
#Benchmark
public double threadLocalRandom() {
return ThreadLocalRandom.current().nextDouble();
}
}
The results show that ThreadLocalRandom works in just a few nanoseconds, its performance is comparable to a simple arithmetic operation and perfectly scales in multithreaded environment unlike Math.random.
Benchmark Threads Score Error Units
RandomBench.mathRandom 1 34.265 ± 1.709 ns/op
RandomBench.multiply 1 4.531 ± 0.108 ns/op
RandomBench.threadLocalRandom 1 8.322 ± 0.047 ns/op
RandomBench.mathRandom 2 366.589 ± 63.899 ns/op
RandomBench.multiply 2 4.627 ± 0.118 ns/op
RandomBench.threadLocalRandom 2 8.342 ± 0.079 ns/op
RandomBench.mathRandom 4 1328.472 ± 177.216 ns/op
RandomBench.multiply 4 4.592 ± 0.091 ns/op
RandomBench.threadLocalRandom 4 8.474 ± 0.157 ns/op

Java documentations report this as the implementation for Random.nextDouble(), which is what Math.random() ends up invoking.
public double nextDouble() {
return (((long)next(26) << 27) + next(27))
/ (double)(1L << 53);
}
Where next updates the seed to (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1) and returns (int)(seed >>> (48 - bits)).
As you can see it uses a simple algorithm to generate pseudo-random values. It requires just a few cheap operations and so I wouldn't worry about using it.

Creating a random number is a simple operation, you shouldn't worry about it.
But you should keep in mind several things
It is better to reuse Random instance, creating new Random() instance every time you need a random value usually is a bad decision.
But do not use the same Random instance across several threads simultaneously to avoid contention, you can use ThreadLocalRandom.current() instead.
If you do dome cryptography use SecureRandom instead.

Why don't primitive Stream have collect(Collector)?

I'm writing a library for novice programmers so I'm trying to keep the API as clean as possible.
One of the things my Library needs to do is perform some complex computations on a large collection of ints or longs. There are lots of scenarios and business objects that my users need to compute these values from, so I thought the best way would be to use streams to allow users to map business objects to IntStream or LongStream and then compute the computations inside of a collector.
However IntStream and LongStream only have the 3 parameter collect method:
collect(Supplier<R> supplier, ObjIntConsumer<R> accumulator, BiConsumer<R,R> combiner)
And doesn't have the simplier collect(Collector) method that Stream<T> has.
So instead of being able to do
Collection<T> businessObjs = ...
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect( new MyComplexComputation(...));
I have to do provide Suppliers, accumulators and combiners like this:
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build(); //prev collect returns Builder object
This is way too complicated for my novice users and is very error prone.
My work around is to make static methods that take an IntStream or LongStream as input and hide the collector creation and execution for you
public static MyResult compute(IntStream stream, ...){
return .collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build();
}
But that doesn't follow the normal conventions of working with Streams:
IntStream tmpStream = businessObjs.stream()
.mapToInt( ... );
MyResult result = MyUtil.compute(tmpStream, ...);
Because you have to either save a temp variable and pass that to the static method, or create the Stream inside the static call which may be confusing when it's is mixed in with the other parameters to my computation.
Is there a cleaner way to do this while still working with IntStream or LongStream ?

We did in fact prototype some Collector.OfXxx specializations. What we found -- in addition to the obvious annoyance of more specialized types -- was that this was not really very useful without having a full complement of primitive-specialized collections (like Trove does, or GS-Collections, but which the JDK does not have). Without an IntArrayList, for example, a Collector.OfInt merely pushes the boxing somewhere else -- from the Collector to the container -- which no big win, and lots more API surface.

Perhaps if method references are used instead of lambdas, the code needed for the primitive stream collect will not seem as complicated.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
MyComplexComputationBuilder::new,
MyComplexComputationBuilder::add,
MyComplexComputationBuilder::merge)
.build(); //prev collect returns Builder object
In Brian's definitive answer to this question, he mentions two other Java collection frameworks that do have primitive collections that actually can be used with the collect method on primitive streams. I thought it might be useful to illustrate some examples of how to use the primitive containers in these frameworks with primitive streams. The code below will also work with a parallel stream.
// Eclipse Collections
List<Integer> integers = Interval.oneTo(5).toList();
Assert.assertEquals(
IntInterval.oneTo(5),
integers.stream()
.mapToInt(Integer::intValue)
.collect(IntArrayList::new, IntArrayList::add, IntArrayList::addAll));
// Trove Collections
Assert.assertEquals(
new TIntArrayList(IntStream.range(1, 6).toArray()),
integers.stream()
.mapToInt(Integer::intValue)
.collect(TIntArrayList::new, TIntArrayList::add, TIntArrayList::addAll));
Note: I am a committer for Eclipse Collections.

I've implemented the primitive collectors in my library StreamEx (since version 0.3.0). There are interfaces IntCollector, LongCollector and DoubleCollector which extend the Collector interface and specialized to work with primitives. There's an additional minor difference in combining procedure as methods like IntStream.collect accept a BiConsumer instead of BinaryOperator.
There is a bunch of predefined collection methods to join numbers to string, store to primitive array, to BitSet, find min, max, sum, calculate summary statistics, perform group-by and partition-by operations. Of course, you can define your own collectors. Here's several usage examples (assumed that you have int[] input array with input data).
Join numbers as string with separator:
String nums = IntStreamEx.of(input).collect(IntCollector.joining(","));
Grouping by last digit:
Map<Integer, int[]> groups = IntStreamEx.of(input)
.collect(IntCollector.groupingBy(i -> i % 10));
Sum positive and negative numbers separately:
Map<Boolean, Integer> sums = IntStreamEx.of(input)
.collect(IntCollector.partitioningBy(i -> i > 0, IntCollector.summing()));
Here's a simple benchmark which compares these collectors and usual object collectors.
Note that my library does not provide (and will not provide in future) any user-visible data structures like maps on primitives, so grouping is performed into usual HashMap. However if you are using Trove/GS/HFTC/whatever, it's not so difficult to write additional primitive collectors for the data structures defined in these libraries to gain more performance.

Convert the primitive streams to boxed object streams if there are methods you're missing.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.boxed()
.collect( new MyComplexComputation(...));
Or don't use the primitive streams in the first place and work with Integers the whole time.
MyResult result = businessObjs.stream()
.map( ... ) // map to Integer not int
.collect( new MyComplexComputation(...));

Mr. Geotz provided the definitive answer for why the decision was made not to include specialized Collectors, however, I wanted to further investigate how much this decision affected performance.
I thought I would post my results as an answer.
I used the jmh microbenchmark framework to time how long it takes to compute calculations using both kinds of Collectors over collections of sizes 1, 100, 1000, 100,000 and 1 million:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
public class MyBenchmark {
#Param({"1", "100", "1000", "100000", "1000000"})
public int size;
List<BusinessObj> seqs;
#Setup
public void setup(){
seqs = new ArrayList<BusinessObj>(size);
Random rand = new Random();
for(int i=0; i< size; i++){
//these lengths are random but over 128 so no caching of Longs
seqs.add(BusinessObjFactory.createOfRandomLength());
}
}
#Benchmark
public double objectCollector() {
return seqs.stream()
.map(BusinessObj::getLength)
.collect(MyUtil.myCalcLongCollector())
.getAsDouble();
}
#Benchmark
public double primitiveCollector() {
LongStream stream= seqs.stream()
.mapToLong(BusinessObj::getLength);
return MyUtil.myCalc(stream)
.getAsDouble();
}
public static void main(String[] args) throws RunnerException{
Options opt = new OptionsBuilder()
.include(MyBenchmark.class.getSimpleName())
.build();
new Runner(opt).run();
}
}
Here are the results:
# JMH 1.9.3 (released 4 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.sample.MyBenchmark.objectCollector
# Run complete. Total time: 01:30:31
Benchmark (size) Mode Cnt Score Error Units
MyBenchmark.objectCollector 1 avgt 200 140.803 ± 1.425 ns/op
MyBenchmark.objectCollector 100 avgt 200 5775.294 ± 67.871 ns/op
MyBenchmark.objectCollector 1000 avgt 200 70440.488 ± 1023.177 ns/op
MyBenchmark.objectCollector 100000 avgt 200 10292595.233 ± 101036.563 ns/op
MyBenchmark.objectCollector 1000000 avgt 200 100147057.376 ± 979662.707 ns/op
MyBenchmark.primitiveCollector 1 avgt 200 140.971 ± 1.382 ns/op
MyBenchmark.primitiveCollector 100 avgt 200 4654.527 ± 87.101 ns/op
MyBenchmark.primitiveCollector 1000 avgt 200 60929.398 ± 1127.517 ns/op
MyBenchmark.primitiveCollector 100000 avgt 200 9784655.013 ± 113339.448 ns/op
MyBenchmark.primitiveCollector 1000000 avgt 200 94822089.334 ± 1031475.051 ns/op
As you can see, the primitive Stream version is slightly faster, but even when there are 1 million elements in the collection, it is only 0.05 seconds faster (on average).
For my API I would rather keep to the cleaner Object Stream conventions and use the Boxed version since it is such a minor performance penalty.
Thanks to everyone who shed insight into this issue.

Why is returning a Java object reference so much slower than returning a primitive

We are working on a latency sensitive application and have been microbenchmarking all kinds of methods (using jmh). After microbenchmarking a lookup method and being satisfied with the results, I implemented the final version, only to find that the final version was 3 times slower than what I had just benchmarked.
The culprit was that the implemented method was returning an enum object instead of an int. Here is a simplified version of the benchmark code:
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#State(Scope.Thread)
public class ReturnEnumObjectVersusPrimitiveBenchmark {
enum Category {
CATEGORY1,
CATEGORY2,
}
#Param( {"3", "2", "1" })
String value;
int param;
#Setup
public void setUp() {
param = Integer.parseInt(value);
}
#Benchmark
public int benchmarkReturnOrdinal() {
if (param < 2) {
return Category.CATEGORY1.ordinal();
}
return Category.CATEGORY2.ordinal();
}
#Benchmark
public Category benchmarkReturnReference() {
if (param < 2) {
return Category.CATEGORY1;
}
return Category.CATEGORY2;
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(ReturnEnumObjectVersusPrimitiveBenchmark.class.getName()).warmupIterations(5)
.measurementIterations(4).forks(1).build();
new Runner(opt).run();
}
}
The benchmark results for above:
# VM invoker: C:\Program Files\Java\jdk1.7.0_40\jre\bin\java.exe
# VM options: -Dfile.encoding=UTF-8
Benchmark (value) Mode Samples Score Error Units
benchmarkReturnOrdinal 3 thrpt 4 1059.898 ± 71.749 ops/us
benchmarkReturnOrdinal 2 thrpt 4 1051.122 ± 61.238 ops/us
benchmarkReturnOrdinal 1 thrpt 4 1064.067 ± 90.057 ops/us
benchmarkReturnReference 3 thrpt 4 353.197 ± 25.946 ops/us
benchmarkReturnReference 2 thrpt 4 350.902 ± 19.487 ops/us
benchmarkReturnReference 1 thrpt 4 339.578 ± 144.093 ops/us
Just changing the return type of the function changed the performance by a factor of almost 3.
I thought that the sole difference between returning an enum object versus an integer is that one returns a 64 bit value (reference) and the other returns a 32 bit value. One of my colleagues was guessing that returning the enum added additional overhead because of the need to track the reference for potential GC. (But given that enum objects are static final references, it seems strange that it would need to do that).
What is the explanation for the performance difference?
UPDATE
I shared the maven project here so that anyone can clone it and run the benchmark. If anyone has the time/interest, it would be helpful to see if others can replicate the same results. (I've replicated on 2 different machines, Windows 64 and Linux 64, both using flavors of Oracle Java 1.7 JVMs). #ZhekaKozlov says he did not see any difference between the methods.
To run: (after cloning repository)
mvn clean install
java -jar .\target\microbenchmarks.jar function.ReturnEnumObjectVersusPrimitiveBenchmark -i 5 -wi 5 -f 1

TL;DR: You should not put BLIND trust into anything.
First things first: it is important to verify the experimental data before jumping to the conclusions from them. Just claiming something is 3x faster/slower is odd, because you really need to follow up on the reason for the performance difference, not just trust the numbers. This is especially important for nano-benchmarks like you have.
Second, the experimenters should clearly understand what they control and what they don't. In your particular example, you are returning the value from #Benchmark methods, but can you be reasonably sure the callers outside will do the same thing for primitive and the reference? If you ask yourself this question, then you'll realize you are basically measuring the test infrastructure.
Down to the point. On my machine (i5-4210U, Linux x86_64, JDK 8u40), the test yields:
Benchmark (value) Mode Samples Score Error Units
...benchmarkReturnOrdinal 3 thrpt 5 0.876 ± 0.023 ops/ns
...benchmarkReturnOrdinal 2 thrpt 5 0.876 ± 0.009 ops/ns
...benchmarkReturnOrdinal 1 thrpt 5 0.832 ± 0.048 ops/ns
...benchmarkReturnReference 3 thrpt 5 0.292 ± 0.006 ops/ns
...benchmarkReturnReference 2 thrpt 5 0.286 ± 0.024 ops/ns
...benchmarkReturnReference 1 thrpt 5 0.293 ± 0.008 ops/ns
Okay, so reference tests appear 3x slower. But wait, it uses an old JMH (1.1.1), let's update to current latest (1.7.1):
Benchmark (value) Mode Cnt Score Error Units
...benchmarkReturnOrdinal 3 thrpt 5 0.326 ± 0.010 ops/ns
...benchmarkReturnOrdinal 2 thrpt 5 0.329 ± 0.004 ops/ns
...benchmarkReturnOrdinal 1 thrpt 5 0.329 ± 0.004 ops/ns
...benchmarkReturnReference 3 thrpt 5 0.288 ± 0.005 ops/ns
...benchmarkReturnReference 2 thrpt 5 0.288 ± 0.005 ops/ns
...benchmarkReturnReference 1 thrpt 5 0.288 ± 0.002 ops/ns
Oops, now they are only barely slower. BTW, this also tells us the test is infrastructure-bound. Okay, can we see what really happens?
If you build the benchmarks, and look around what exactly calls your #Benchmark methods, then you'll see something like:
public void benchmarkReturnOrdinal_thrpt_jmhStub(InfraControl control, RawResults result, ReturnEnumObjectVersusPrimitiveBenchmark_jmh l_returnenumobjectversusprimitivebenchmark0_0, Blackhole_jmh l_blackhole1_1) throws Throwable {
long operations = 0;
long realTime = 0;
result.startTime = System.nanoTime();
do {
l_blackhole1_1.consume(l_longname.benchmarkReturnOrdinal());
operations++;
} while(!control.isDone);
result.stopTime = System.nanoTime();
result.realTime = realTime;
result.measuredOps = operations;
}
That l_blackhole1_1 has a consume method, which "consumes" the values (see Blackhole for rationale). Blackhole.consume has overloads for references and primitives, and that alone is enough to justify the performance difference.
There is a rationale why these methods look different: they are trying to be as fast as possible for their types of argument. They do not necessarily exhibit the same performance characteristics, even though we try to match them, hence the more symmetric result with newer JMH. Now, you can even go to -prof perfasm to see the generated code for your tests and see why the performance is different, but that's beyond the point here.
If you really want to understand how returning the primitive and/or reference differs performance-wise, you would need to enter a big scary grey zone of nuanced performance benchmarking. E.g. something like this test:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
#Fork(5)
public class PrimVsRef {
#Benchmark
public void prim() {
doPrim();
}
#Benchmark
public void ref() {
doRef();
}
#CompilerControl(CompilerControl.Mode.DONT_INLINE)
private int doPrim() {
return 42;
}
#CompilerControl(CompilerControl.Mode.DONT_INLINE)
private Object doRef() {
return this;
}
}
...which yields the same result for primitives and references:
Benchmark Mode Cnt Score Error Units
PrimVsRef.prim avgt 25 2.637 ± 0.017 ns/op
PrimVsRef.ref avgt 25 2.634 ± 0.005 ns/op
As I said above, these tests require following up on the reasons for the results. In this case, the generated code for both is almost the same, and that explains the result.
prim:
[Verified Entry Point]
12.69% 1.81% 0x00007f5724aec100: mov %eax,-0x14000(%rsp)
0.90% 0.74% 0x00007f5724aec107: push %rbp
0.01% 0.01% 0x00007f5724aec108: sub $0x30,%rsp
12.23% 16.00% 0x00007f5724aec10c: mov $0x2a,%eax ; load "42"
0.95% 0.97% 0x00007f5724aec111: add $0x30,%rsp
0.02% 0x00007f5724aec115: pop %rbp
37.94% 54.70% 0x00007f5724aec116: test %eax,0x10d1aee4(%rip)
0.04% 0.02% 0x00007f5724aec11c: retq
ref:
[Verified Entry Point]
13.52% 1.45% 0x00007f1887e66700: mov %eax,-0x14000(%rsp)
0.60% 0.37% 0x00007f1887e66707: push %rbp
0.02% 0x00007f1887e66708: sub $0x30,%rsp
13.63% 16.91% 0x00007f1887e6670c: mov %rsi,%rax ; load "this"
0.50% 0.49% 0x00007f1887e6670f: add $0x30,%rsp
0.01% 0x00007f1887e66713: pop %rbp
39.18% 57.65% 0x00007f1887e66714: test %eax,0xe3e78e6(%rip)
0.02% 0x00007f1887e6671a: retq
[sarcasm] See how easy it is! [/sarcasm]
The pattern is: the simpler the question, the more you have to work out to make a plausible and reliable answer.

To clear the misconception of reference and memory some have fallen into (#Mzf), let's dive into the Java Virtual Machine Specification.
But before going there, one thing must be clarified - an object can never be retrieved from memory, only its fields can. In fact, there is no opcode that would perform such extensive operation.
This document defines reference as a stack type (so that it may be a result or an argument to instructions performing operations on stack) of 1st category - the category of types taking a single stack word (32 bits). See table 2.3
.
Furthermore, if the method invocation completes normally according to the specification, a value popped from the top of the stack is pushed onto the stack of method´s invoker (section 2.6.4).
Your question is what causes the difference of execution times. Chapter 2 foreword answers:
Implementation details that are not part of the Java Virtual Machine's specification
would unnecessarily constrain the creativity of implementors. For example, the
memory layout of run-time data areas, the garbage-collection algorithm used, and
any internal optimization of the Java Virtual Machine instructions (for example,
translating them into machine code) are left to the discretion of the implementor.
In other words, because no such thing as a performace penalty concerning usage of reference is stated in the document for logical reasons (it's eventually just a stack word as int or float are), you're left with searching the source code of your implementation or never finding out at all.
In extent, we shouldn't actually always blame the implementation, there are some clues you can take when looking for your answers. Java defines separate instructions for manipulating numbers and references. Reference-manipulating instructions start with a (e. g. astore, aload or areturn) and are the only instructions allowed to work with references. In particular you may be interested in looking at areturn´s implementation.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.