Is array shallow copy cheap, regardless of size?

Is array shallow copy cheap, regardless of size? - java

I'm testing System.arraycopy, which does a shallow copy that I assumed is cheap, no matter how many elements I want to copy.
private static int[] randomInts = new int[800];
#Setup
public void setup() {
Random rand = new Random();
for (int i = 0; i < 800; i++) {
randomInts[i] = rand.nextInt(10000) + 1;
}
}
// Copy first 20 elements from randomInts
#Benchmark
public int[] testFirst20() {
final int[] toArray = new int[20];
System.arraycopy(randomInts, 0, toArray, 0, 20);
return toArray;
}
// Copy the whole randomInts
#Benchmark
public int[] testFull() {
final int[] toArray = new int[800];
System.arraycopy(randomInts, 0, toArray, 0, 800);
return toArray;
}
Results:
Benchmark Mode Cnt Score Error Units
BenchAddToArray.testFirst20 thrpt 5 116484.035 ± 9793.886 ops/ms
BenchAddToArray.testFull thrpt 5 3348.843 ± 429.225 ops/ms
So I guess we should be careful with the performance even if it's shallow copy.
I knew taking a taking a full copy should be more expansive than taking a smaller copy, but didn't expect the difference to be so big.
Could anyone give me some insights as to which part of System.arraycopy resulted in the difference between copying arrays of different lengths?
Thanks to #Sweeper's comment, I also tested another version where the array initialisation is done in #Setup, so that the benchmark is purely on arrayCopy. Got following results:
Benchmark Mode Cnt Score Error Units
BenchAddToArray.testFirst20 thrpt 5 30061.031 ± 4521.763 ops/ms
BenchAddToArray.testFull thrpt 5 2753.594 ± 117.719 ops/ms
So array init does account for a large portion of the difference.

The short answer is No. The cost of arraycopy is proportional to the size of the array that you are copying.
Indeed, your last benchmark demonstrates this to be true.
Frankly, it does not make physical (electrical) sense for copying an N element array to be better than O(N). That's not how computer memory hardware works.
At a low level (beneath all of the pipelining, caching, etc) copying an array in memory involves RAM reads from one series of locations and RAM writes to another series of locations. The RAM hardware is only capable of reading / writing 8 or 16 (or something) bytes (lets call the number W) per operation because the memory data bus is only 64 or 128 (or something) bits wide. Therefore copying N bytes takes N / W memory operations which makes the copy take O(N) clock cycles.

Related

How does Java's JIT compiler make bad code faster than good code?

Background
This is a research that started from a Data Structures and Algorithms lecture. I apologize for the long background, but it is necessary to understand the question.
The partitioning algorithms Lomuto and Hoare were benchmarked to see if their running time corresponds to what the theory predicts. Analysis shows that they perform the following number of swaps, depending on the length of the input array n, are approximately:
Lomuto: n/2
Hoare: n/4; if big and small numbers are equally distributed in the array
Hoare: n/6; if big and small numbers are not equally distributed in the array
Swapping is the most expensive operation of these algorithms, so it is a good estimate for how they compare in terms of running time. This means that if the array is random, having big and small numbers equally dstributed, and Hoare has a running time of 20 micros, Lomuto should take approximately 40 micros, which is 2 times slower.
The microbenchmark:
public static double averageRuntimeTests(Function<int[], Integer> algorithm, int size,
int warmupIterations, int measureIterations, int repeatedExecs) {
long start, end;
int result = -1;
double[] runningTimes = new double[measureIterations];
// Warmup
for (int i = 0; i < warmupIterations; i++) {
int[] A = randomArray(size);
for (int j = 0; j < repeatedExecs; j++)
result = algorithm.apply(A);
System.out.print(result);
}
// Measure
for (int i = 0; i < measureIterations; i++) {
int[] A = randomArray(size);
start = System.nanoTime();
for (int j = 0; j < repeatedExecs; j++)
result = algorithm.apply(A);
end = System.nanoTime();
System.out.print(result);
runningTimes[i] = (end - start) / (repeatedExecs * 1000.0);
}
return average(runningTimes);
}
The microbenchmark corresponds with the theory, if the algorithms are run sufficiently many times on the same input array such that the JIT compiler "has enough time" to optimize the code. See the following code, where the algorithms are run 30 times for each different array.
>>> case 1
public static void main(String[] args) {
int size = 10_000, warmupIterations = 10_000, measureIterations = 2_000, repeatedExecs = 30;
System.out.printf("input size = %d%n", size);
System.out.printf("#warmup iterations = %d%n", warmupIterations);
System.out.printf("#measure iterations = %d%n", measureIterations);
System.out.printf("#repeated executions = %d%n", repeatedExecs);
double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
measureIterations, repeatedExecs);
double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
measureIterations, repeatedExecs);
System.out.printf("%nHoare: %f us/op%n", timeHoare);
System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}
Result:
Lomuto: 7.94 us/op
Hoare: 3.7685 us/op
Lomuto is approximately 2 times slower than Hoare, as expected for a uniformly distributed array.
Result when Lomuto runs before Hoare:
Lomuto: 13.513 us/op
Hoare: 3.5865 us/op
Lomuto is 4 times slower than Hoare, more than it should. For some reason it takes almost double the time if it runs before Hoare.
Problem
However, if the algorithms run just one single time for each different input array, the JIT compiler behaves unexpectedly.
>>> case 2
public static void main(String[] args) {
int size = 10_000, warmupIterations = 300_000, measureIterations = 60_000, repeatedExecs = 1;
System.out.printf("input size = %d%n", size);
System.out.printf("#warmup iterations = %d%n", warmupIterations);
System.out.printf("#measure iterations = %d%n", measureIterations);
System.out.printf("#repeated executions = %d%n", repeatedExecs);
double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
measureIterations, repeatedExecs);
double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
measureIterations, repeatedExecs);
System.out.printf("%nHoare: %f us/op%n", timeHoare);
System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}
Result (whether Hoare runs before Lomuto or not):
Lomuto: 26.676133 us/op
Hoare: 31.8233 us/op
It is shocking to see that Lomuto is even faster than Hoare! What is the JIT compiler doing here?
I am constantly saying that the JIT compiler is to be blamed, because if I disable it completely and run in interpreter-only mode (using the -Djava.compiler=NONE flag) the benchmark is again as expected. Running algorithms one single time for each different array...
>>> case 3
Result (whether Hoare runs before Lomuto or not):
Lomuto: 597.76 us/op
Hoare: 254.0455 us/op
As you can see Lomuto is again approximately 2 times slower, as expected.
Could someone please explain what is going on with the JIT compiler in case 2? It looks like the JIT compiler just partially optimizes the code. But then, why is Lomuto as fast as Hoare? Shouldn't Hoare be still faster?
Please note:
I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.
micros is a shorthand for microseconds and is the same as us.

How does Java's JIT compiler run bad code faster than good code?
The JIT compiler doesn't run code. It compiles code.
What you are seeing is not (valid) evidence of bad code running faster than good code.
I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.
It is probably not what the JIT compiler does that matters. It is probably when it does it that is causing the problems.
The JIT compiler uses CPU time to optimize. If you implement a micro-benchmark in the naive way, part or all of the JIT compiler time will be included in one (or more) of the iterations of your benchmark. That will make the benchmark iteration (or iterations) to appear to take longer than an earlier or later iteration. This distorts the apparent execution time.
When you think about it, your Java benchmark runs at three or more apparent1 speeds:
It starts out running slowly because the JVM is interpreting bytecodes and gathering statistics.
Once it has gathered enough stats, the JVM (apparently) runs really slowly while the JIT compiler is compiling and optimizing bytecodes.
Once the JIT compiler has done its job, the JVM starts executing the compiled native code, and the code runs much faster.
If you use jmh (or similar) it should compensate for the effects of JIT compilation and other anomalies. If you want information, please read How do I write a correct micro-benchmark in Java? ... which helps to explain the common pitfalls.
1 - I am referring to apparent speed obtained using before and after system clock measurements.

Should I create a full-size array for toArray(T[] a)? [duplicate]

Assuming I have an ArrayList
ArrayList<MyClass> myList;
And I want to call toArray, is there a performance reason to use
MyClass[] arr = myList.toArray(new MyClass[myList.size()]);
over
MyClass[] arr = myList.toArray(new MyClass[0]);
?
I prefer the second style, since it's less verbose, and I assumed that the compiler will make sure the empty array doesn't really get created, but I've been wondering if that's true.
Of course, in 99% of the cases it doesn't make a difference one way or the other, but I'd like to keep a consistent style between my normal code and my optimized inner loops...

Counterintuitively, the fastest version, on Hotspot 8, is:
MyClass[] arr = myList.toArray(new MyClass[0]);
I have run a micro benchmark using jmh the results and code are below, showing that the version with an empty array consistently outperforms the version with a presized array. Note that if you can reuse an existing array of the correct size, the result may be different.
Benchmark results (score in microseconds, smaller = better):
Benchmark (n) Mode Samples Score Error Units
c.a.p.SO29378922.preSize 1 avgt 30 0.025 ▒ 0.001 us/op
c.a.p.SO29378922.preSize 100 avgt 30 0.155 ▒ 0.004 us/op
c.a.p.SO29378922.preSize 1000 avgt 30 1.512 ▒ 0.031 us/op
c.a.p.SO29378922.preSize 5000 avgt 30 6.884 ▒ 0.130 us/op
c.a.p.SO29378922.preSize 10000 avgt 30 13.147 ▒ 0.199 us/op
c.a.p.SO29378922.preSize 100000 avgt 30 159.977 ▒ 5.292 us/op
c.a.p.SO29378922.resize 1 avgt 30 0.019 ▒ 0.000 us/op
c.a.p.SO29378922.resize 100 avgt 30 0.133 ▒ 0.003 us/op
c.a.p.SO29378922.resize 1000 avgt 30 1.075 ▒ 0.022 us/op
c.a.p.SO29378922.resize 5000 avgt 30 5.318 ▒ 0.121 us/op
c.a.p.SO29378922.resize 10000 avgt 30 10.652 ▒ 0.227 us/op
c.a.p.SO29378922.resize 100000 avgt 30 139.692 ▒ 8.957 us/op
For reference, the code:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
public class SO29378922 {
#Param({"1", "100", "1000", "5000", "10000", "100000"}) int n;
private final List<Integer> list = new ArrayList<>();
#Setup public void populateList() {
for (int i = 0; i < n; i++) list.add(0);
}
#Benchmark public Integer[] preSize() {
return list.toArray(new Integer[n]);
}
#Benchmark public Integer[] resize() {
return list.toArray(new Integer[0]);
}
}
You can find similar results, full analysis, and discussion in the blog post Arrays of Wisdom of the Ancients. To summarize: the JVM and JIT compiler contains several optimizations that enable it to cheaply create and initialize a new correctly sized array, and those optimizations can not be used if you create the array yourself.

As of ArrayList in Java 5, the array will be filled already if it has the right size (or is bigger). Consequently
MyClass[] arr = myList.toArray(new MyClass[myList.size()]);
will create one array object, fill it and return it to "arr". On the other hand
MyClass[] arr = myList.toArray(new MyClass[0]);
will create two arrays. The second one is an array of MyClass with length 0. So there is an object creation for an object that will be thrown away immediately. As far as the source code suggests the compiler / JIT cannot optimize this one so that it is not created. Additionally, using the zero-length object results in casting(s) within the toArray() - method.
See the source of ArrayList.toArray():
public <T> T[] toArray(T[] a) {
if (a.length < size)
// Make a new array of a's runtime type, but my contents:
return (T[]) Arrays.copyOf(elementData, size, a.getClass());
System.arraycopy(elementData, 0, a, 0, size);
if (a.length > size)
a[size] = null;
return a;
}
Use the first method so that only one object is created and avoid (implicit but nevertheless expensive) castings.

From JetBrains Intellij Idea inspection:
There are two styles to convert a collection to an array: either using
a pre-sized array (like c.toArray(new String[c.size()])) or
using an empty array (like c.toArray(new String[0]). In
older Java versions using pre-sized array was recommended, as the
reflection call which is necessary to create an array of proper size
was quite slow. However since late updates of OpenJDK 6 this call
was intrinsified, making the performance of the empty array version
the same and sometimes even better, compared to the pre-sized
version. Also passing pre-sized array is dangerous for a concurrent or
synchronized collection as a data race is possible between the
size and toArray call which may result in extra nulls
at the end of the array, if the collection was concurrently shrunk
during the operation. This inspection allows to follow the
uniform style: either using an empty array (which is recommended in
modern Java) or using a pre-sized array (which might be faster in
older Java versions or non-HotSpot based JVMs).

Modern JVMs optimise reflective array construction in this case, so the performance difference is tiny. Naming the collection twice in such boilerplate code is not a great idea, so I'd avoid the first method. Another advantage of the second is that it works with synchronised and concurrent collections. If you want to make optimisation, reuse the empty array (empty arrays are immutable and can be shared), or use a profiler(!).

toArray checks that the array passed is of the right size (that is, large enough to fit the elements from your list) and if so, uses that. Consequently if the size of the array provided it smaller than required, a new array will be reflexively created.
In your case, an array of size zero, is immutable, so could safely be elevated to a static final variable, which might make your code a little cleaner, which avoids creating the array on each invocation. A new array will be created inside the method anyway, so it's a readability optimisation.
Arguably the faster version is to pass the array of a correct size, but unless you can prove this code is a performance bottleneck, prefer readability to runtime performance until proven otherwise.

The first case is more efficient.
That is because in the second case:
MyClass[] arr = myList.toArray(new MyClass[0]);
the runtime actually creates an empty array (with zero size) and then inside the toArray method creates another array to fit the actual data. This creation is done using reflection using the following code (taken from jdk1.5.0_10):
public <T> T[] toArray(T[] a) {
if (a.length < size)
a = (T[])java.lang.reflect.Array.
newInstance(a.getClass().getComponentType(), size);
System.arraycopy(elementData, 0, a, 0, size);
if (a.length > size)
a[size] = null;
return a;
}
By using the first form, you avoid the creation of a second array and also avoid the reflection code.

The second one is marginally mor readable, but there so little improvement that it's not worth it. The first method is faster, with no disadvantages at runtime, so that's what I use. But I write it the second way, because it's faster to type. Then my IDE flags it as a warning and offers to fix it. With a single keystroke, it converts the code from the second type to the first one.

Using 'toArray' with the array of the correct size will perform better as the alternative will create first the zero sized array then the array of the correct size. However, as you say the difference is likely to be negligible.
Also, note that the javac compiler does not perform any optimization. These days all optimizations are performed by the JIT/HotSpot compilers at runtime. I am not aware of any optimizations around 'toArray' in any JVMs.
The answer to your question, then, is largely a matter of style but for consistency's sake should form part of any coding standards you adhere to (whether documented or otherwise).

How does the time it takes Math.random() to run compare to that of simple arithmetic operation?

In Java, how long relative to a simple arithmetic operation does it take Math.random() to generate a number? I am trying to randomly distribute objects in an ArrayList that already contains values such that a somewhat even, but not completely even distribution is created, and I am not sure if choosing a random index for every insertion point by using Math.random() is the best approach to take.
Clarification: the distribution of inserted objects is meant to be even enough that the values are not all concentrated in one area, but also uneven enough that the distribution is not predictable (if someone were to go through the values one by one, they would not be able to determine if the next value was going to be a newly inserted value by detecting a constant pattern).

Do not use Math.random. It relies on a global instance of java.util.Random that uses AtomicLong under the hood. Though the PRNG algorithm used in java.util.Random is pretty simple, the performance is mostly affected by the atomic CAS and the related cache-coherence traffic.
This can be particularly bad for multithread appliсations (like in this example), but has also a penalty even in a single-threaded case.
ThreadLocalRandom is always preferable to Math.random. It does not rely on atomic operations and does not suffer from contention. It only updates a thread-local state and uses a couple of arithmetic and bitwise operations.
Here is a JMH benchmark to compare the performance of Math.random() and ThreadLocalRandom.current().nextDouble() to a simple arithmetic operation.
package bench;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.ThreadLocalRandom;
#State(Scope.Thread)
public class RandomBench {
double x = 1;
#Benchmark
public double multiply() {
return x * Math.PI;
}
#Benchmark
public double mathRandom() {
return Math.random();
}
#Benchmark
public double threadLocalRandom() {
return ThreadLocalRandom.current().nextDouble();
}
}
The results show that ThreadLocalRandom works in just a few nanoseconds, its performance is comparable to a simple arithmetic operation and perfectly scales in multithreaded environment unlike Math.random.
Benchmark Threads Score Error Units
RandomBench.mathRandom 1 34.265 ± 1.709 ns/op
RandomBench.multiply 1 4.531 ± 0.108 ns/op
RandomBench.threadLocalRandom 1 8.322 ± 0.047 ns/op
RandomBench.mathRandom 2 366.589 ± 63.899 ns/op
RandomBench.multiply 2 4.627 ± 0.118 ns/op
RandomBench.threadLocalRandom 2 8.342 ± 0.079 ns/op
RandomBench.mathRandom 4 1328.472 ± 177.216 ns/op
RandomBench.multiply 4 4.592 ± 0.091 ns/op
RandomBench.threadLocalRandom 4 8.474 ± 0.157 ns/op

Java documentations report this as the implementation for Random.nextDouble(), which is what Math.random() ends up invoking.
public double nextDouble() {
return (((long)next(26) << 27) + next(27))
/ (double)(1L << 53);
}
Where next updates the seed to (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1) and returns (int)(seed >>> (48 - bits)).
As you can see it uses a simple algorithm to generate pseudo-random values. It requires just a few cheap operations and so I wouldn't worry about using it.

Creating a random number is a simple operation, you shouldn't worry about it.
But you should keep in mind several things
It is better to reuse Random instance, creating new Random() instance every time you need a random value usually is a bad decision.
But do not use the same Random instance across several threads simultaneously to avoid contention, you can use ThreadLocalRandom.current() instead.
If you do dome cryptography use SecureRandom instead.

Why don't primitive Stream have collect(Collector)?

I'm writing a library for novice programmers so I'm trying to keep the API as clean as possible.
One of the things my Library needs to do is perform some complex computations on a large collection of ints or longs. There are lots of scenarios and business objects that my users need to compute these values from, so I thought the best way would be to use streams to allow users to map business objects to IntStream or LongStream and then compute the computations inside of a collector.
However IntStream and LongStream only have the 3 parameter collect method:
collect(Supplier<R> supplier, ObjIntConsumer<R> accumulator, BiConsumer<R,R> combiner)
And doesn't have the simplier collect(Collector) method that Stream<T> has.
So instead of being able to do
Collection<T> businessObjs = ...
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect( new MyComplexComputation(...));
I have to do provide Suppliers, accumulators and combiners like this:
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build(); //prev collect returns Builder object
This is way too complicated for my novice users and is very error prone.
My work around is to make static methods that take an IntStream or LongStream as input and hide the collector creation and execution for you
public static MyResult compute(IntStream stream, ...){
return .collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build();
}
But that doesn't follow the normal conventions of working with Streams:
IntStream tmpStream = businessObjs.stream()
.mapToInt( ... );
MyResult result = MyUtil.compute(tmpStream, ...);
Because you have to either save a temp variable and pass that to the static method, or create the Stream inside the static call which may be confusing when it's is mixed in with the other parameters to my computation.
Is there a cleaner way to do this while still working with IntStream or LongStream ?

We did in fact prototype some Collector.OfXxx specializations. What we found -- in addition to the obvious annoyance of more specialized types -- was that this was not really very useful without having a full complement of primitive-specialized collections (like Trove does, or GS-Collections, but which the JDK does not have). Without an IntArrayList, for example, a Collector.OfInt merely pushes the boxing somewhere else -- from the Collector to the container -- which no big win, and lots more API surface.

Perhaps if method references are used instead of lambdas, the code needed for the primitive stream collect will not seem as complicated.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
MyComplexComputationBuilder::new,
MyComplexComputationBuilder::add,
MyComplexComputationBuilder::merge)
.build(); //prev collect returns Builder object
In Brian's definitive answer to this question, he mentions two other Java collection frameworks that do have primitive collections that actually can be used with the collect method on primitive streams. I thought it might be useful to illustrate some examples of how to use the primitive containers in these frameworks with primitive streams. The code below will also work with a parallel stream.
// Eclipse Collections
List<Integer> integers = Interval.oneTo(5).toList();
Assert.assertEquals(
IntInterval.oneTo(5),
integers.stream()
.mapToInt(Integer::intValue)
.collect(IntArrayList::new, IntArrayList::add, IntArrayList::addAll));
// Trove Collections
Assert.assertEquals(
new TIntArrayList(IntStream.range(1, 6).toArray()),
integers.stream()
.mapToInt(Integer::intValue)
.collect(TIntArrayList::new, TIntArrayList::add, TIntArrayList::addAll));
Note: I am a committer for Eclipse Collections.

I've implemented the primitive collectors in my library StreamEx (since version 0.3.0). There are interfaces IntCollector, LongCollector and DoubleCollector which extend the Collector interface and specialized to work with primitives. There's an additional minor difference in combining procedure as methods like IntStream.collect accept a BiConsumer instead of BinaryOperator.
There is a bunch of predefined collection methods to join numbers to string, store to primitive array, to BitSet, find min, max, sum, calculate summary statistics, perform group-by and partition-by operations. Of course, you can define your own collectors. Here's several usage examples (assumed that you have int[] input array with input data).
Join numbers as string with separator:
String nums = IntStreamEx.of(input).collect(IntCollector.joining(","));
Grouping by last digit:
Map<Integer, int[]> groups = IntStreamEx.of(input)
.collect(IntCollector.groupingBy(i -> i % 10));
Sum positive and negative numbers separately:
Map<Boolean, Integer> sums = IntStreamEx.of(input)
.collect(IntCollector.partitioningBy(i -> i > 0, IntCollector.summing()));
Here's a simple benchmark which compares these collectors and usual object collectors.
Note that my library does not provide (and will not provide in future) any user-visible data structures like maps on primitives, so grouping is performed into usual HashMap. However if you are using Trove/GS/HFTC/whatever, it's not so difficult to write additional primitive collectors for the data structures defined in these libraries to gain more performance.

Convert the primitive streams to boxed object streams if there are methods you're missing.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.boxed()
.collect( new MyComplexComputation(...));
Or don't use the primitive streams in the first place and work with Integers the whole time.
MyResult result = businessObjs.stream()
.map( ... ) // map to Integer not int
.collect( new MyComplexComputation(...));

Mr. Geotz provided the definitive answer for why the decision was made not to include specialized Collectors, however, I wanted to further investigate how much this decision affected performance.
I thought I would post my results as an answer.
I used the jmh microbenchmark framework to time how long it takes to compute calculations using both kinds of Collectors over collections of sizes 1, 100, 1000, 100,000 and 1 million:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
public class MyBenchmark {
#Param({"1", "100", "1000", "100000", "1000000"})
public int size;
List<BusinessObj> seqs;
#Setup
public void setup(){
seqs = new ArrayList<BusinessObj>(size);
Random rand = new Random();
for(int i=0; i< size; i++){
//these lengths are random but over 128 so no caching of Longs
seqs.add(BusinessObjFactory.createOfRandomLength());
}
}
#Benchmark
public double objectCollector() {
return seqs.stream()
.map(BusinessObj::getLength)
.collect(MyUtil.myCalcLongCollector())
.getAsDouble();
}
#Benchmark
public double primitiveCollector() {
LongStream stream= seqs.stream()
.mapToLong(BusinessObj::getLength);
return MyUtil.myCalc(stream)
.getAsDouble();
}
public static void main(String[] args) throws RunnerException{
Options opt = new OptionsBuilder()
.include(MyBenchmark.class.getSimpleName())
.build();
new Runner(opt).run();
}
}
Here are the results:
# JMH 1.9.3 (released 4 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.sample.MyBenchmark.objectCollector
# Run complete. Total time: 01:30:31
Benchmark (size) Mode Cnt Score Error Units
MyBenchmark.objectCollector 1 avgt 200 140.803 ± 1.425 ns/op
MyBenchmark.objectCollector 100 avgt 200 5775.294 ± 67.871 ns/op
MyBenchmark.objectCollector 1000 avgt 200 70440.488 ± 1023.177 ns/op
MyBenchmark.objectCollector 100000 avgt 200 10292595.233 ± 101036.563 ns/op
MyBenchmark.objectCollector 1000000 avgt 200 100147057.376 ± 979662.707 ns/op
MyBenchmark.primitiveCollector 1 avgt 200 140.971 ± 1.382 ns/op
MyBenchmark.primitiveCollector 100 avgt 200 4654.527 ± 87.101 ns/op
MyBenchmark.primitiveCollector 1000 avgt 200 60929.398 ± 1127.517 ns/op
MyBenchmark.primitiveCollector 100000 avgt 200 9784655.013 ± 113339.448 ns/op
MyBenchmark.primitiveCollector 1000000 avgt 200 94822089.334 ± 1031475.051 ns/op
As you can see, the primitive Stream version is slightly faster, but even when there are 1 million elements in the collection, it is only 0.05 seconds faster (on average).
For my API I would rather keep to the cleaner Object Stream conventions and use the Boxed version since it is such a minor performance penalty.
Thanks to everyone who shed insight into this issue.

Is it true that having lots of small methods helps the JIT compiler optimize?

In a recent discussion about how to optimize some code, I was told that breaking code up into lots of small methods can significantly increase performance, because the JIT compiler doesn't like to optimize large methods.
I wasn't sure about this since it seems that the JIT compiler should itself be able to identify self-contained segments of code, irrespective of whether they are in their own method or not.
Can anyone confirm or refute this claim?

The Hotspot JIT only inlines methods that are less than a certain (configurable) size. So using smaller methods allows more inlining, which is good.
See the various inlining options on this page.
EDIT
To elaborate a little:
if a method is small it will get inlined so there is little chance to get penalised for splitting the code in small methods.
in some instances, splitting methods may result in more inlining.
Example (full code to have the same line numbers if you try it)
package javaapplication27;
public class TestInline {
private int count = 0;
public static void main(String[] args) throws Exception {
TestInline t = new TestInline();
int sum = 0;
for (int i = 0; i < 1000000; i++) {
sum += t.m();
}
System.out.println(sum);
}
public int m() {
int i = count;
if (i % 10 == 0) {
i += 1;
} else if (i % 10 == 1) {
i += 2;
} else if (i % 10 == 2) {
i += 3;
}
i += count;
i *= count;
i++;
return i;
}
}
When running this code with the following JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:FreqInlineSize=50 -XX:MaxInlineSize=50 -XX:+PrintInlining (yes I have used values that prove my case: m is too big but both the refactored m and m2 are below the threshold - with other values you might get a different output).
You will see that m() and main() get compiled, but m() does not get inlined:
56 1 javaapplication27.TestInline::m (62 bytes)
57 1 % javaapplication27.TestInline::main # 12 (53 bytes)
# 20 javaapplication27.TestInline::m (62 bytes) too big
You can also inspect the generated assembly to confirm that m is not inlined (I used these JVM flags: -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel) - it will look like this:
0x0000000002780624: int3 ;*invokevirtual m
; - javaapplication27.TestInline::main#20 (line 10)
If you refactor the code like this (I have extracted the if/else in a separate method):
public int m() {
int i = count;
i = m2(i);
i += count;
i *= count;
i++;
return i;
}
public int m2(int i) {
if (i % 10 == 0) {
i += 1;
} else if (i % 10 == 1) {
i += 2;
} else if (i % 10 == 2) {
i += 3;
}
return i;
}
You will see the following compilation actions:
60 1 javaapplication27.TestInline::m (30 bytes)
60 2 javaapplication27.TestInline::m2 (40 bytes)
# 7 javaapplication27.TestInline::m2 (40 bytes) inline (hot)
63 1 % javaapplication27.TestInline::main # 12 (53 bytes)
# 20 javaapplication27.TestInline::m (30 bytes) inline (hot)
# 7 javaapplication27.TestInline::m2 (40 bytes) inline (hot)
So m2 gets inlined into m, which you would expect so we are back to the original scenario. But when main gets compiled, it actually inlines the whole thing. At the assembly level, it means you won't find any invokevirtual instructions any more. You will find lines like this:
0x00000000026d0121: add ecx,edi ;*iinc
; - javaapplication27.TestInline::m2#7 (line 33)
; - javaapplication27.TestInline::m#7 (line 24)
; - javaapplication27.TestInline::main#20 (line 10)
where basically common instructions are "mutualised".
Conclusion
I am not saying that this example is representative but it seems to prove a few points:
using smaller method improves readability in your code
smaller methods will generally be inlined, so you will most likely not pay the cost of the extra method call (it will be performance neutral)
using smaller methods might improve inlining globally in some circumstances, as shown by the example above
And finally: if a portion of your code is really critical for performance that these considerations matter, you should examine the JIT output to fine tune your code and importantly profile before and after.

If you take the exact same code and just break them up into lots of small methods, that is not going to help JIT at all.
A better way to put it is that modern HotSpot JVMs do not penalize you for writing a lot of small methods. They do get aggressively inlined, so at runtime you do not really pay the cost of function calls. This is true even for invokevirtual calls, such as the one that calls an interface method.
I did a blog post several years ago that describes how you can see JVM is inlining methods. The technique is still applicable to modern JVMs. I also found it useful to look at the discussions related to invokedynamic, where how the modern HotSpot JVMs compiles Java byte code gets discussed extensively.

I've read numerous articles which have stated that smaller methods (as measured in the number of bytes required to represent the method as Java bytecode) are more likely to be eligible for inlining by the JIT (just-in-time compiler) when it compiles hot methods (those which are being run most frequently) into machine code. And they describe how method inlining produces better performance of the resulting machine code. In short: smaller methods give the JIT more options in terms of how to compile bytecode into machine code when it identifies a hot method, and this allows more sophisticated optimizations.
To test this theory, I created a JMH class with two benchmark methods, each containing identical behaviour but factored differently. The first benchmark is named monolithicMethod (all code in a single method), and the second benchmark is named smallFocusedMethods and has been refactored so that each major behaviour has been moved out into its own method. The smallFocusedMethods benchmark look like this:
#Benchmark
public void smallFocusedMethods(TestState state) {
int i = state.value;
if (i < 90) {
actionOne(i, state);
} else {
actionTwo(i, state);
}
}
private void actionOne(int i, TestState state) {
state.sb.append(Integer.toString(i)).append(
": has triggered the first type of action.");
int result = i;
for (int j = 0; j < i; ++j) {
result += j;
}
state.sb.append("Calculation gives result ").append(Integer.toString(
result));
}
private void actionTwo(int i, TestState state) {
state.sb.append(i).append(" has triggered the second type of action.");
int result = i;
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
result *= k * j + i;
}
}
state.sb.append("Calculation gives result ").append(Integer.toString(
result));
}
and you can imagine how monolithicMethod looks (same code but entirely contained within the one method). The TestState simply does the work of creating a new StringBuilder (so that the creation of this object is not counted in the benchmark time) and of choosing a random number between 0 and 100 for each invocation (and this has been deliberately configured so that both benchmarks use exactly the same sequence of random numbers, to avoid the risk of bias).
After running the benchmark with six "forks", each involving five warmups of one second, followed by six iterations of five seconds, the results look like this:
Benchmark Mode Cnt Score Error Units
monolithicMethod thrpt 30 7609784.687 ± 118863.736 ops/s
monolithicMethod:·gc.alloc.rate thrpt 30 1368.296 ± 15.834 MB/sec
monolithicMethod:·gc.alloc.rate.norm thrpt 30 270.328 ± 0.016 B/op
monolithicMethod:·gc.churn.G1_Eden_Space thrpt 30 1357.303 ± 16.951 MB/sec
monolithicMethod:·gc.churn.G1_Eden_Space.norm thrpt 30 268.156 ± 1.264 B/op
monolithicMethod:·gc.churn.G1_Old_Gen thrpt 30 0.186 ± 0.001 MB/sec
monolithicMethod:·gc.churn.G1_Old_Gen.norm thrpt 30 0.037 ± 0.001 B/op
monolithicMethod:·gc.count thrpt 30 2123.000 counts
monolithicMethod:·gc.time thrpt 30 1060.000 ms
smallFocusedMethods thrpt 30 7855677.144 ± 48987.206 ops/s
smallFocusedMethods:·gc.alloc.rate thrpt 30 1404.228 ± 8.831 MB/sec
smallFocusedMethods:·gc.alloc.rate.norm thrpt 30 270.320 ± 0.001 B/op
smallFocusedMethods:·gc.churn.G1_Eden_Space thrpt 30 1393.473 ± 10.493 MB/sec
smallFocusedMethods:·gc.churn.G1_Eden_Space.norm thrpt 30 268.250 ± 1.193 B/op
smallFocusedMethods:·gc.churn.G1_Old_Gen thrpt 30 0.186 ± 0.001 MB/sec
smallFocusedMethods:·gc.churn.G1_Old_Gen.norm thrpt 30 0.036 ± 0.001 B/op
smallFocusedMethods:·gc.count thrpt 30 1986.000 counts
smallFocusedMethods:·gc.time thrpt 30 1011.000 ms
In short, these numbers show that the smallFocusedMethods approach ran 3.2% faster, and the difference was statistically significant (with 99.9% confidence). And note that the memory usage (based on garbage collection profiling) was not significantly different. So you get faster performance without increased overhead.
I've run a variety of similar benchmarks to test whether small, focused methods give better throughput, and I've found that the improvement is between 3% and 7% in all cases I've tried. But it's likely that the actual gain depends strongly upon the version of the JVM being used, the distribution of executions across your if/else blocks (I've gone for 90% on the first and 10% on the second to exaggerate the heat on the first "action", but I've seen throughput improvements even with a more equal spread across a chain of if/else blocks), and the actual complexity of the work being done by each of the possible actions. So be sure to write your own specific benchmarks if you need to determine what works for your specific application.
My advice is this: write small, focused methods because it makes the code tidier, easier to read, and much easier to override specific behaviours when inheritance is involved. The fact that the JIT is likely to reward you with slightly better performance is a bonus, but tidy code should be your main goal in the majority of cases. Oh, and it's also important to give each method a clear, descriptive name which exactly summarises the responsibility of the method (unlike the terrible names I've used in my benchmark).

I don't really understand how it works, but based on the link AurA provided, I would guess that the JIT compiler will have to compile less bytecode if the same bits are being reused, rather than having to compile different bytecode that is similar across different methods.
Aside from that, the more you are able to break down your code into pieces of sense, the more reuse you are going to get out of your code and that is something that will allow optimization for the VM running it (you are providing more schema to work with).
However I doubt it will have any good impact if you break your code down without any sense that provides no code reuse.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.