Why don't primitive Stream have collect(Collector)?

Why don't primitive Stream have collect(Collector)? - java

I'm writing a library for novice programmers so I'm trying to keep the API as clean as possible.
One of the things my Library needs to do is perform some complex computations on a large collection of ints or longs. There are lots of scenarios and business objects that my users need to compute these values from, so I thought the best way would be to use streams to allow users to map business objects to IntStream or LongStream and then compute the computations inside of a collector.
However IntStream and LongStream only have the 3 parameter collect method:
collect(Supplier<R> supplier, ObjIntConsumer<R> accumulator, BiConsumer<R,R> combiner)
And doesn't have the simplier collect(Collector) method that Stream<T> has.
So instead of being able to do
Collection<T> businessObjs = ...
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect( new MyComplexComputation(...));
I have to do provide Suppliers, accumulators and combiners like this:
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build(); //prev collect returns Builder object
This is way too complicated for my novice users and is very error prone.
My work around is to make static methods that take an IntStream or LongStream as input and hide the collector creation and execution for you
public static MyResult compute(IntStream stream, ...){
return .collect(
()-> new MyComplexComputationBuilder(...),
(builder, v)-> builder.add(v),
(a,b)-> a.merge(b))
.build();
}
But that doesn't follow the normal conventions of working with Streams:
IntStream tmpStream = businessObjs.stream()
.mapToInt( ... );
MyResult result = MyUtil.compute(tmpStream, ...);
Because you have to either save a temp variable and pass that to the static method, or create the Stream inside the static call which may be confusing when it's is mixed in with the other parameters to my computation.
Is there a cleaner way to do this while still working with IntStream or LongStream ?

We did in fact prototype some Collector.OfXxx specializations. What we found -- in addition to the obvious annoyance of more specialized types -- was that this was not really very useful without having a full complement of primitive-specialized collections (like Trove does, or GS-Collections, but which the JDK does not have). Without an IntArrayList, for example, a Collector.OfInt merely pushes the boxing somewhere else -- from the Collector to the container -- which no big win, and lots more API surface.

Perhaps if method references are used instead of lambdas, the code needed for the primitive stream collect will not seem as complicated.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.collect(
MyComplexComputationBuilder::new,
MyComplexComputationBuilder::add,
MyComplexComputationBuilder::merge)
.build(); //prev collect returns Builder object
In Brian's definitive answer to this question, he mentions two other Java collection frameworks that do have primitive collections that actually can be used with the collect method on primitive streams. I thought it might be useful to illustrate some examples of how to use the primitive containers in these frameworks with primitive streams. The code below will also work with a parallel stream.
// Eclipse Collections
List<Integer> integers = Interval.oneTo(5).toList();
Assert.assertEquals(
IntInterval.oneTo(5),
integers.stream()
.mapToInt(Integer::intValue)
.collect(IntArrayList::new, IntArrayList::add, IntArrayList::addAll));
// Trove Collections
Assert.assertEquals(
new TIntArrayList(IntStream.range(1, 6).toArray()),
integers.stream()
.mapToInt(Integer::intValue)
.collect(TIntArrayList::new, TIntArrayList::add, TIntArrayList::addAll));
Note: I am a committer for Eclipse Collections.

I've implemented the primitive collectors in my library StreamEx (since version 0.3.0). There are interfaces IntCollector, LongCollector and DoubleCollector which extend the Collector interface and specialized to work with primitives. There's an additional minor difference in combining procedure as methods like IntStream.collect accept a BiConsumer instead of BinaryOperator.
There is a bunch of predefined collection methods to join numbers to string, store to primitive array, to BitSet, find min, max, sum, calculate summary statistics, perform group-by and partition-by operations. Of course, you can define your own collectors. Here's several usage examples (assumed that you have int[] input array with input data).
Join numbers as string with separator:
String nums = IntStreamEx.of(input).collect(IntCollector.joining(","));
Grouping by last digit:
Map<Integer, int[]> groups = IntStreamEx.of(input)
.collect(IntCollector.groupingBy(i -> i % 10));
Sum positive and negative numbers separately:
Map<Boolean, Integer> sums = IntStreamEx.of(input)
.collect(IntCollector.partitioningBy(i -> i > 0, IntCollector.summing()));
Here's a simple benchmark which compares these collectors and usual object collectors.
Note that my library does not provide (and will not provide in future) any user-visible data structures like maps on primitives, so grouping is performed into usual HashMap. However if you are using Trove/GS/HFTC/whatever, it's not so difficult to write additional primitive collectors for the data structures defined in these libraries to gain more performance.

Convert the primitive streams to boxed object streams if there are methods you're missing.
MyResult result = businessObjs.stream()
.mapToInt( ... )
.boxed()
.collect( new MyComplexComputation(...));
Or don't use the primitive streams in the first place and work with Integers the whole time.
MyResult result = businessObjs.stream()
.map( ... ) // map to Integer not int
.collect( new MyComplexComputation(...));

Mr. Geotz provided the definitive answer for why the decision was made not to include specialized Collectors, however, I wanted to further investigate how much this decision affected performance.
I thought I would post my results as an answer.
I used the jmh microbenchmark framework to time how long it takes to compute calculations using both kinds of Collectors over collections of sizes 1, 100, 1000, 100,000 and 1 million:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
public class MyBenchmark {
#Param({"1", "100", "1000", "100000", "1000000"})
public int size;
List<BusinessObj> seqs;
#Setup
public void setup(){
seqs = new ArrayList<BusinessObj>(size);
Random rand = new Random();
for(int i=0; i< size; i++){
//these lengths are random but over 128 so no caching of Longs
seqs.add(BusinessObjFactory.createOfRandomLength());
}
}
#Benchmark
public double objectCollector() {
return seqs.stream()
.map(BusinessObj::getLength)
.collect(MyUtil.myCalcLongCollector())
.getAsDouble();
}
#Benchmark
public double primitiveCollector() {
LongStream stream= seqs.stream()
.mapToLong(BusinessObj::getLength);
return MyUtil.myCalc(stream)
.getAsDouble();
}
public static void main(String[] args) throws RunnerException{
Options opt = new OptionsBuilder()
.include(MyBenchmark.class.getSimpleName())
.build();
new Runner(opt).run();
}
}
Here are the results:
# JMH 1.9.3 (released 4 days ago)
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.sample.MyBenchmark.objectCollector
# Run complete. Total time: 01:30:31
Benchmark (size) Mode Cnt Score Error Units
MyBenchmark.objectCollector 1 avgt 200 140.803 ± 1.425 ns/op
MyBenchmark.objectCollector 100 avgt 200 5775.294 ± 67.871 ns/op
MyBenchmark.objectCollector 1000 avgt 200 70440.488 ± 1023.177 ns/op
MyBenchmark.objectCollector 100000 avgt 200 10292595.233 ± 101036.563 ns/op
MyBenchmark.objectCollector 1000000 avgt 200 100147057.376 ± 979662.707 ns/op
MyBenchmark.primitiveCollector 1 avgt 200 140.971 ± 1.382 ns/op
MyBenchmark.primitiveCollector 100 avgt 200 4654.527 ± 87.101 ns/op
MyBenchmark.primitiveCollector 1000 avgt 200 60929.398 ± 1127.517 ns/op
MyBenchmark.primitiveCollector 100000 avgt 200 9784655.013 ± 113339.448 ns/op
MyBenchmark.primitiveCollector 1000000 avgt 200 94822089.334 ± 1031475.051 ns/op
As you can see, the primitive Stream version is slightly faster, but even when there are 1 million elements in the collection, it is only 0.05 seconds faster (on average).
For my API I would rather keep to the cleaner Object Stream conventions and use the Boxed version since it is such a minor performance penalty.
Thanks to everyone who shed insight into this issue.

Related

Is array shallow copy cheap, regardless of size?

I'm testing System.arraycopy, which does a shallow copy that I assumed is cheap, no matter how many elements I want to copy.
private static int[] randomInts = new int[800];
#Setup
public void setup() {
Random rand = new Random();
for (int i = 0; i < 800; i++) {
randomInts[i] = rand.nextInt(10000) + 1;
}
}
// Copy first 20 elements from randomInts
#Benchmark
public int[] testFirst20() {
final int[] toArray = new int[20];
System.arraycopy(randomInts, 0, toArray, 0, 20);
return toArray;
}
// Copy the whole randomInts
#Benchmark
public int[] testFull() {
final int[] toArray = new int[800];
System.arraycopy(randomInts, 0, toArray, 0, 800);
return toArray;
}
Results:
Benchmark Mode Cnt Score Error Units
BenchAddToArray.testFirst20 thrpt 5 116484.035 ± 9793.886 ops/ms
BenchAddToArray.testFull thrpt 5 3348.843 ± 429.225 ops/ms
So I guess we should be careful with the performance even if it's shallow copy.
I knew taking a taking a full copy should be more expansive than taking a smaller copy, but didn't expect the difference to be so big.
Could anyone give me some insights as to which part of System.arraycopy resulted in the difference between copying arrays of different lengths?
Thanks to #Sweeper's comment, I also tested another version where the array initialisation is done in #Setup, so that the benchmark is purely on arrayCopy. Got following results:
Benchmark Mode Cnt Score Error Units
BenchAddToArray.testFirst20 thrpt 5 30061.031 ± 4521.763 ops/ms
BenchAddToArray.testFull thrpt 5 2753.594 ± 117.719 ops/ms
So array init does account for a large portion of the difference.

The short answer is No. The cost of arraycopy is proportional to the size of the array that you are copying.
Indeed, your last benchmark demonstrates this to be true.
Frankly, it does not make physical (electrical) sense for copying an N element array to be better than O(N). That's not how computer memory hardware works.
At a low level (beneath all of the pipelining, caching, etc) copying an array in memory involves RAM reads from one series of locations and RAM writes to another series of locations. The RAM hardware is only capable of reading / writing 8 or 16 (or something) bytes (lets call the number W) per operation because the memory data bus is only 64 or 128 (or something) bits wide. Therefore copying N bytes takes N / W memory operations which makes the copy take O(N) clock cycles.

Should I create a full-size array for toArray(T[] a)? [duplicate]

Assuming I have an ArrayList
ArrayList<MyClass> myList;
And I want to call toArray, is there a performance reason to use
MyClass[] arr = myList.toArray(new MyClass[myList.size()]);
over
MyClass[] arr = myList.toArray(new MyClass[0]);
?
I prefer the second style, since it's less verbose, and I assumed that the compiler will make sure the empty array doesn't really get created, but I've been wondering if that's true.
Of course, in 99% of the cases it doesn't make a difference one way or the other, but I'd like to keep a consistent style between my normal code and my optimized inner loops...

Counterintuitively, the fastest version, on Hotspot 8, is:
MyClass[] arr = myList.toArray(new MyClass[0]);
I have run a micro benchmark using jmh the results and code are below, showing that the version with an empty array consistently outperforms the version with a presized array. Note that if you can reuse an existing array of the correct size, the result may be different.
Benchmark results (score in microseconds, smaller = better):
Benchmark (n) Mode Samples Score Error Units
c.a.p.SO29378922.preSize 1 avgt 30 0.025 ▒ 0.001 us/op
c.a.p.SO29378922.preSize 100 avgt 30 0.155 ▒ 0.004 us/op
c.a.p.SO29378922.preSize 1000 avgt 30 1.512 ▒ 0.031 us/op
c.a.p.SO29378922.preSize 5000 avgt 30 6.884 ▒ 0.130 us/op
c.a.p.SO29378922.preSize 10000 avgt 30 13.147 ▒ 0.199 us/op
c.a.p.SO29378922.preSize 100000 avgt 30 159.977 ▒ 5.292 us/op
c.a.p.SO29378922.resize 1 avgt 30 0.019 ▒ 0.000 us/op
c.a.p.SO29378922.resize 100 avgt 30 0.133 ▒ 0.003 us/op
c.a.p.SO29378922.resize 1000 avgt 30 1.075 ▒ 0.022 us/op
c.a.p.SO29378922.resize 5000 avgt 30 5.318 ▒ 0.121 us/op
c.a.p.SO29378922.resize 10000 avgt 30 10.652 ▒ 0.227 us/op
c.a.p.SO29378922.resize 100000 avgt 30 139.692 ▒ 8.957 us/op
For reference, the code:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
public class SO29378922 {
#Param({"1", "100", "1000", "5000", "10000", "100000"}) int n;
private final List<Integer> list = new ArrayList<>();
#Setup public void populateList() {
for (int i = 0; i < n; i++) list.add(0);
}
#Benchmark public Integer[] preSize() {
return list.toArray(new Integer[n]);
}
#Benchmark public Integer[] resize() {
return list.toArray(new Integer[0]);
}
}
You can find similar results, full analysis, and discussion in the blog post Arrays of Wisdom of the Ancients. To summarize: the JVM and JIT compiler contains several optimizations that enable it to cheaply create and initialize a new correctly sized array, and those optimizations can not be used if you create the array yourself.

As of ArrayList in Java 5, the array will be filled already if it has the right size (or is bigger). Consequently
MyClass[] arr = myList.toArray(new MyClass[myList.size()]);
will create one array object, fill it and return it to "arr". On the other hand
MyClass[] arr = myList.toArray(new MyClass[0]);
will create two arrays. The second one is an array of MyClass with length 0. So there is an object creation for an object that will be thrown away immediately. As far as the source code suggests the compiler / JIT cannot optimize this one so that it is not created. Additionally, using the zero-length object results in casting(s) within the toArray() - method.
See the source of ArrayList.toArray():
public <T> T[] toArray(T[] a) {
if (a.length < size)
// Make a new array of a's runtime type, but my contents:
return (T[]) Arrays.copyOf(elementData, size, a.getClass());
System.arraycopy(elementData, 0, a, 0, size);
if (a.length > size)
a[size] = null;
return a;
}
Use the first method so that only one object is created and avoid (implicit but nevertheless expensive) castings.

From JetBrains Intellij Idea inspection:
There are two styles to convert a collection to an array: either using
a pre-sized array (like c.toArray(new String[c.size()])) or
using an empty array (like c.toArray(new String[0]). In
older Java versions using pre-sized array was recommended, as the
reflection call which is necessary to create an array of proper size
was quite slow. However since late updates of OpenJDK 6 this call
was intrinsified, making the performance of the empty array version
the same and sometimes even better, compared to the pre-sized
version. Also passing pre-sized array is dangerous for a concurrent or
synchronized collection as a data race is possible between the
size and toArray call which may result in extra nulls
at the end of the array, if the collection was concurrently shrunk
during the operation. This inspection allows to follow the
uniform style: either using an empty array (which is recommended in
modern Java) or using a pre-sized array (which might be faster in
older Java versions or non-HotSpot based JVMs).

Modern JVMs optimise reflective array construction in this case, so the performance difference is tiny. Naming the collection twice in such boilerplate code is not a great idea, so I'd avoid the first method. Another advantage of the second is that it works with synchronised and concurrent collections. If you want to make optimisation, reuse the empty array (empty arrays are immutable and can be shared), or use a profiler(!).

toArray checks that the array passed is of the right size (that is, large enough to fit the elements from your list) and if so, uses that. Consequently if the size of the array provided it smaller than required, a new array will be reflexively created.
In your case, an array of size zero, is immutable, so could safely be elevated to a static final variable, which might make your code a little cleaner, which avoids creating the array on each invocation. A new array will be created inside the method anyway, so it's a readability optimisation.
Arguably the faster version is to pass the array of a correct size, but unless you can prove this code is a performance bottleneck, prefer readability to runtime performance until proven otherwise.

The first case is more efficient.
That is because in the second case:
MyClass[] arr = myList.toArray(new MyClass[0]);
the runtime actually creates an empty array (with zero size) and then inside the toArray method creates another array to fit the actual data. This creation is done using reflection using the following code (taken from jdk1.5.0_10):
public <T> T[] toArray(T[] a) {
if (a.length < size)
a = (T[])java.lang.reflect.Array.
newInstance(a.getClass().getComponentType(), size);
System.arraycopy(elementData, 0, a, 0, size);
if (a.length > size)
a[size] = null;
return a;
}
By using the first form, you avoid the creation of a second array and also avoid the reflection code.

The second one is marginally mor readable, but there so little improvement that it's not worth it. The first method is faster, with no disadvantages at runtime, so that's what I use. But I write it the second way, because it's faster to type. Then my IDE flags it as a warning and offers to fix it. With a single keystroke, it converts the code from the second type to the first one.

Using 'toArray' with the array of the correct size will perform better as the alternative will create first the zero sized array then the array of the correct size. However, as you say the difference is likely to be negligible.
Also, note that the javac compiler does not perform any optimization. These days all optimizations are performed by the JIT/HotSpot compilers at runtime. I am not aware of any optimizations around 'toArray' in any JVMs.
The answer to your question, then, is largely a matter of style but for consistency's sake should form part of any coding standards you adhere to (whether documented or otherwise).

How does the time it takes Math.random() to run compare to that of simple arithmetic operation?

In Java, how long relative to a simple arithmetic operation does it take Math.random() to generate a number? I am trying to randomly distribute objects in an ArrayList that already contains values such that a somewhat even, but not completely even distribution is created, and I am not sure if choosing a random index for every insertion point by using Math.random() is the best approach to take.
Clarification: the distribution of inserted objects is meant to be even enough that the values are not all concentrated in one area, but also uneven enough that the distribution is not predictable (if someone were to go through the values one by one, they would not be able to determine if the next value was going to be a newly inserted value by detecting a constant pattern).

Do not use Math.random. It relies on a global instance of java.util.Random that uses AtomicLong under the hood. Though the PRNG algorithm used in java.util.Random is pretty simple, the performance is mostly affected by the atomic CAS and the related cache-coherence traffic.
This can be particularly bad for multithread appliсations (like in this example), but has also a penalty even in a single-threaded case.
ThreadLocalRandom is always preferable to Math.random. It does not rely on atomic operations and does not suffer from contention. It only updates a thread-local state and uses a couple of arithmetic and bitwise operations.
Here is a JMH benchmark to compare the performance of Math.random() and ThreadLocalRandom.current().nextDouble() to a simple arithmetic operation.
package bench;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.ThreadLocalRandom;
#State(Scope.Thread)
public class RandomBench {
double x = 1;
#Benchmark
public double multiply() {
return x * Math.PI;
}
#Benchmark
public double mathRandom() {
return Math.random();
}
#Benchmark
public double threadLocalRandom() {
return ThreadLocalRandom.current().nextDouble();
}
}
The results show that ThreadLocalRandom works in just a few nanoseconds, its performance is comparable to a simple arithmetic operation and perfectly scales in multithreaded environment unlike Math.random.
Benchmark Threads Score Error Units
RandomBench.mathRandom 1 34.265 ± 1.709 ns/op
RandomBench.multiply 1 4.531 ± 0.108 ns/op
RandomBench.threadLocalRandom 1 8.322 ± 0.047 ns/op
RandomBench.mathRandom 2 366.589 ± 63.899 ns/op
RandomBench.multiply 2 4.627 ± 0.118 ns/op
RandomBench.threadLocalRandom 2 8.342 ± 0.079 ns/op
RandomBench.mathRandom 4 1328.472 ± 177.216 ns/op
RandomBench.multiply 4 4.592 ± 0.091 ns/op
RandomBench.threadLocalRandom 4 8.474 ± 0.157 ns/op

Java documentations report this as the implementation for Random.nextDouble(), which is what Math.random() ends up invoking.
public double nextDouble() {
return (((long)next(26) << 27) + next(27))
/ (double)(1L << 53);
}
Where next updates the seed to (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1) and returns (int)(seed >>> (48 - bits)).
As you can see it uses a simple algorithm to generate pseudo-random values. It requires just a few cheap operations and so I wouldn't worry about using it.

Creating a random number is a simple operation, you shouldn't worry about it.
But you should keep in mind several things
It is better to reuse Random instance, creating new Random() instance every time you need a random value usually is a bad decision.
But do not use the same Random instance across several threads simultaneously to avoid contention, you can use ThreadLocalRandom.current() instead.
If you do dome cryptography use SecureRandom instead.

Concatenating parallel streams

Suppose that I have two int[] arrays input1 and input2. I want to take only positive numbers from the first one, take distinct numbers from the second one, merge them together, sort and store into the resulting array. This can be performed using streams:
int[] result = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
I want to speed up the task, so I consider to make the stream parallel. Usually this just means that I can insert .parallel() anywhere between the stream construction and terminal operation and the result will be the same. The JavaDoc for IntStream.concat says that the resulting stream will be parallel if any of the input streams is parallel. So I thought that making parallel() either input1 stream or input2 stream or the concatenated stream will produce the same result.
Actually I was wrong: if I add .parallel() to the resulting stream, it seems that the input streams remain sequential. Moreover, I can mark the input streams (either of them or both) as .parallel(), then turn the resulting stream to .sequential(), but the input remains parallel. So actually there are 8 possibilities: either of input1, input2 and concatenated stream can be parallel or not:
int[] sss = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
int[] ssp = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).distinct()).parallel().sorted().toArray();
int[] sps = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sequential().sorted().toArray();
int[] spp = IntStream.concat(Arrays.stream(input1).filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sorted().toArray();
int[] pss = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).distinct()).sequential().sorted().toArray();
int[] psp = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).distinct()).sorted().toArray();
int[] pps = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sequential().sorted().toArray();
int[] ppp = IntStream.concat(Arrays.stream(input1).parallel().filter(x -> x > 0),
Arrays.stream(input2).parallel().distinct()).sorted().toArray();
I benchmarked all the versions for different input sizes (using JDK 8u45 64bit on Core i5 4xCPU, Win7) and got different results for every case:
Benchmark (n) Mode Cnt Score Error Units
ConcatTest.SSS 100 avgt 20 7.094 ± 0.069 us/op
ConcatTest.SSS 10000 avgt 20 1542.820 ± 22.194 us/op
ConcatTest.SSS 1000000 avgt 20 350173.723 ± 7140.406 us/op
ConcatTest.SSP 100 avgt 20 6.176 ± 0.043 us/op
ConcatTest.SSP 10000 avgt 20 907.855 ± 8.448 us/op
ConcatTest.SSP 1000000 avgt 20 264193.679 ± 6744.169 us/op
ConcatTest.SPS 100 avgt 20 16.548 ± 0.175 us/op
ConcatTest.SPS 10000 avgt 20 1831.569 ± 13.582 us/op
ConcatTest.SPS 1000000 avgt 20 500736.204 ± 37932.197 us/op
ConcatTest.SPP 100 avgt 20 23.871 ± 0.285 us/op
ConcatTest.SPP 10000 avgt 20 1141.273 ± 9.310 us/op
ConcatTest.SPP 1000000 avgt 20 400582.847 ± 27330.492 us/op
ConcatTest.PSS 100 avgt 20 7.162 ± 0.241 us/op
ConcatTest.PSS 10000 avgt 20 1593.332 ± 7.961 us/op
ConcatTest.PSS 1000000 avgt 20 383920.286 ± 6650.890 us/op
ConcatTest.PSP 100 avgt 20 9.877 ± 0.382 us/op
ConcatTest.PSP 10000 avgt 20 883.639 ± 13.596 us/op
ConcatTest.PSP 1000000 avgt 20 257921.422 ± 7649.434 us/op
ConcatTest.PPS 100 avgt 20 16.412 ± 0.129 us/op
ConcatTest.PPS 10000 avgt 20 1816.782 ± 10.875 us/op
ConcatTest.PPS 1000000 avgt 20 476311.713 ± 19154.558 us/op
ConcatTest.PPP 100 avgt 20 23.078 ± 0.622 us/op
ConcatTest.PPP 10000 avgt 20 1128.889 ± 7.964 us/op
ConcatTest.PPP 1000000 avgt 20 393699.222 ± 56397.445 us/op
From these results I can only conclude that parallelization of distinct() step reduces the overall performance (at least in my tests).
So I have the following questions:
Are there any official guidelines on how to better use the parallelization with concatenated streams? It's not always feasible to test all possible combinations (especially when concatenating more than two streams), so having some "rules of thumb" would be nice.
Seems that if I concatenate the streams created directly from collection/array (without intermediate operations performed before concatenation), then results do not depend so much on the location of parallel() . Is this true?
Are there any other cases besides concatenation where the result depends on at which point the stream pipeline is parallelized?

The specification precisely describes what you get—when you consider that, unlike other operations, we are not talking about a single pipeline but three distinct Streams which retain their properties independent of the others.
The specification says: “The resulting stream is […] parallel if either of the input streams is parallel.” and that’s what you get; if either input stream is parallel, the resulting stream is parallel (but you can turn it to sequential afterwards). But changing the resulting stream to parallel or sequential does not change the nature of the input streams nor does feeding a parallel and a sequential stream into concat.
Regarding the performance consequences, consult the documentation, paragraph “Stream operations and pipelines”:
Intermediate operations are further divided into stateless and stateful operations. Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
Stateful operations may need to process the entire input before producing a result. For example, one cannot produce any results from sorting a stream until one has seen all elements of the stream. As a result, under parallel computation, some pipelines containing stateful intermediate operations may require multiple passes on the data or may need to buffer significant data. Pipelines containing exclusively stateless intermediate operations can be processed in a single pass, whether sequential or parallel, with minimal data buffering.
You have chosen the very two named stateful operations and combined them. So the .sorted() operation of the resulting stream requires a buffering of the entire content before it can start the sorting which implies a completion of the distinct operation. The distinct operation is obviously hard to parallelize as the threads have to synchronize about the already seen values.
So to answer you first question, it’s not about concat but simply that distinct doesn’t benefit from parallel execution.
This also renders your second question obsolete as your are performing entirely different operations in the two concatenated streams so you can’t do the same with a pre-concatenated collection/array. Concatenating the arrays and running distinct on the resulting array is unlikely to yield better results.
Regarding your third question, flatMap’s behavior regarding parallel streams may be a source of surprises…

Creating distinct list from existing list in Java 7 and 8?

If I have:
List<Integer> listInts = { 1, 1, 3, 77, 2, 19, 77, 123, 14, 123... }
in Java what is an efficient way of creating a List<Integer> listDistinctInts containing only the distinct values from listInts?
My immediate thought is to create a Set<Integer> setInts containing all the values from listInts then call List<Integer> listDistinctInts = new ArrayList<>(setInts);
But this seems potentially inefficient - is there a better solution using Java 7?
I'm not using Java 8, but I believe using it I could do something like this(?):
List<Integer> listDistinctInts = listInts.stream().distinct().collect(Collectors.toList());
Would this be more performant than the approach above and/or is there any more efficient way of doing this in Java 8?
Finally, (and I'm aware that asking multiple questions might be frowned upon but it's directly related) if I only cared about the count of distinct elements in listInts is there a more efficient way to get that value (in Java 7 and 8) - without first creating a list or set of all the distinct elements?
I'm most interested in native Java ways of accomplishing this and avoiding re-inventing any wheels but would consider hand-rolled code or libraries if they offer better clarity or performance. I've read this related question Java - Distinct List of Objects but it's not entirely clear about the differences in performance between the Java 7 and 8 approaches or whether there might be better techniques?

I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".
All my testing was performed with JMH Java Microbenchmarking Harness.
Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:
1. Java 8 Stream - The solution using stream() I had proprosed as a possibility if using Java8:
public List<Integer> testJava8Stream(List<Integer> listInts) {
return listInts.stream().distinct().collect(Collectors.toList());
}
pros modern Java 8 approach, no 3rd party dependencies
cons Requires Java 8
2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:
public List<Integer> testAddingToList(List<Integer> listInts) {
List<Integer> listDistinctInts = new ArrayList<>(listInts.size());
for(Integer i : listInts)
{
if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }
}
return listDistinctInts;
}
pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps
cons Needs to repeatedly check the List for existing values as we build it
3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList:
public List<Integer> testGsCollectionsFast(FastList listFast)
{
return listFast.distinct();
}
pros Reportedly very quick, simple expressive code, works in Java 7 and 8
cons Requires 3rd party library and a FastList rather than a regular List<Integer>
4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList passed to the method rather than a good ol' ArrayList<Integer> so I also tested the adapter method Craig proposed:
public List<Integer> testGsCollectionsAdapted(List<Integer> listInts)
{
return listAdapter.adapt(listInts).distinct();
}
pros Doesn't require a FastList, works in Java 7 and 8
cons Has to adapt List so may not perform as well, needs 3rd party library
5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:
public List<Integer> testGuavaImmutable(List<Integer> listInts)
{
return ImmutableSet.copyOf(listInts).asList();
}
pros Reportedly very fast, works in Java 7 or 8
cons Returns an Immutable List, can't handle nulls in the input List, and requires 3rd party library
7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)
public List<Integer> testHashSet(List<Integer> listInts)
{
return new ArrayList<Integer>(new HashSet<Integer>(listInts));
}
pros Works in Java 7 and 8, no 3rd party dependencies
cons Doesn't retain original order of list, has to construct set then copy to list.
6. LinkedHashSet - Since the HashSet solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:
public List<Integer> testLinkedHashSet(List<Integer> listInts)
{
return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts));
}
pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies
cons Unlikely to be as fast as regular HashSet approach
Results
Here are my results for various different sizes of listInts (results ordered from slowest to fastest):
1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 0.505 0.012 ops/s
Java8Stream thrpt 10 234.932 31.959 ops/s
LinkedHashSet thrpt 10 262.185 16.679 ops/s
HashSet thrpt 10 264.295 24.154 ops/s
GsCollectionsAdapted thrpt 10 357.998 18.468 ops/s
GsCollectionsFast thrpt 10 363.443 40.089 ops/s
GuavaImmutable thrpt 10 469.423 26.056 ops/s
2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 32794.698 1154.113 ops/s
HashSet thrpt 10 61622.073 2752.557 ops/s
LinkedHashSet thrpt 10 67155.865 1690.119 ops/s
Java8Stream thrpt 10 87440.902 13517.925 ops/s
GsCollectionsFast thrpt 10 103490.738 35302.201 ops/s
GsCollectionsAdapted thrpt 10 143135.973 4733.601 ops/s
GuavaImmutable thrpt 10 186301.330 13421.850 ops/s
3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)
Benchmark Mode Samples Mean Mean error Units
AddingToList thrpt 10 278435.085 14229.285 ops/s
Java8Stream thrpt 10 397664.052 24282.858 ops/s
LinkedHashSet thrpt 10 462701.618 20098.435 ops/s
GsCollectionsAdapted thrpt 10 477097.125 15212.580 ops/s
GsCollectionsFast thrpt 10 511248.923 48155.211 ops/s
HashSet thrpt 10 512003.713 25886.696 ops/s
GuavaImmutable thrpt 10 1082006.560 18716.012 ops/s
4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)
Benchmark Mode Samples Mean Mean error Units
Java8Stream thrpt 10 2739774.758 306124.297 ops/s
LinkedHashSet thrpt 10 3607479.332 150331.918 ops/s
HashSet thrpt 10 4238393.657 185624.358 ops/s
GsCollectionsAdapted thrpt 10 5919254.755 495444.800 ops/s
GsCollectionsFast thrpt 10 7916079.963 1708778.450 ops/s
AddingToList thrpt 10 7931479.667 966331.036 ops/s
GuavaImmutable thrpt 10 9021621.880 845936.861 ops/s
Conclusions
If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.
The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.
You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.
The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.
The Guava ImmutableSet.copyOf(listInts).asList() method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable and the input list cannot contain nulls.
The HashSet method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).
The LinkedHashSet approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.
Both the HashSet and LinkedHashSet methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foos from a List<Foo>.
If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.
Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct() call than the way I used, so comments or other answers are of course welcome.
NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.

If you're using Eclipse Collections (formerly GS Collections), you can use the method distinct().
ListIterable<Integer> listInts = FastList.newListWith(1, 1, 3, 77, 2, 19, 77, 123, 14, 123);
Assert.assertEquals(
FastList.newListWith(1, 3, 77, 2, 19, 123, 14),
listInts.distinct());
The advantage of using distinct() instead of converting to a Set and then back to a List is that distinct() preserves the order of the original List, retaining the first occurrence of each element. It's implemented by using both a Set and a List.
MutableSet<T> seenSoFar = UnifiedSet.newSet();
int size = list.size();
for (int i = 0; i < size; i++)
{
T item = list.get(i);
if (seenSoFar.add(item))
{
targetCollection.add(item);
}
}
return targetCollection;
If you cannot convert your original List into a GS Collections type, you can use ListAdapter to get the same API.
MutableList<Integer> distinct = ListAdapter.adapt(integers).distinct();
There's no way to avoid the creation of the Set. Still, UnifiedSet is more efficient than HashSet so there will be some speed benefit.
If all you want is the number of distinct items, it's more efficient to just create a set without creating the list.
Verify.assertSize(7, UnifiedSet.newSet(listInts));
Eclipse Collections 8.0 requires Java 8. Eclipse Collections 7.x works well with Java 8, but only requires Java 5.
Note: I am a committer for Eclipse collections.

You should try new LinkedList(new HashSet(listInts)).

Guava can be your choice:
ImmutableSet<Integer> set = ImmutableSet.copyOf(listInts);
The API is extremely optimized.
It is FASTER than listInts.stream().distinct() and new LinkedHashSet<>(listInts).

When adding a value to a listInts check:
int valueToAdd;
//...
if (!listInts.contains(valueToAdd)) {listInts.add(valueToAdd)}
if you have an existing list, use a for-each statement to copy all the values from that list, to a new one, that you want to be "distinct":
List<Integer> listWithRepeatedValues;
List<Integer> distinctList;
//...
for (Integer i : listWithRepeatedValues) {
if (!listInts.contains(valueToAdd)) {distinctList.add(i);}
}

Don't worry. Using a HashSet is a pretty easy and efficient way to eliminate duplicates:
Set<Integer> uniqueList = new HashSet<>();
uniqueList.addAll(listInts); // Add all elements eliminating duplicates
for (int n : uniqueList) // Check the results (in no particular order)
System.out.println(n);
System.out.println("Number distinct values: " + uniqueList.size());
In a more specific scenario, just in case the range of possible values is known, is not very large, while listInts is very large.
The most efficient way of counting the number of unique entries in the list that I can think of is:
boolean[] counterTable = new boolean[124];
int counter = 0;
for (int n : listInts)
if (!counterTable[n]) {
counter++;
counterTable[n] = true;
}
System.out.println("Number of distinct values: " + counter);

This should work:
yourlist.stream().map(your wrapper that overrides equals and hashchode method::new).distinct().map(wrapper defined above::method that returns the final output).collect(Collectors.toList());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.