JAVA 8 Array Stream Results Timing Longer Than Sequencing [duplicate] - java

I can create a Stream from an array using Arrays.stream(array) or Stream.of(values). Similarly, is it possible to create a ParallelStream directly from an array, without creating an intermediate collection as in Arrays.asList(array).parallelStream()?

Stream.of(array).parallel()
or
Arrays.stream(array).parallel()

TLDR;
Any sequential Stream can be converted into a parallel one by calling .parallel() on it. So all you need is:
Create a stream
Invoke method parallel() on it.
Long answer
The question is pretty old, but I believe some additional explanation will make the things much clearer.
All implementations of Java streams implement interface BaseStream. Which as per JavaDoc is:
Base interface for streams, which are sequences of elements supporting sequential and parallel aggregate operations.
From API's point of view there is no difference between sequential and parallel streams. They share the same aggregate operations.
In order do distinguish between sequential and parallel streams the aggregate methods call BaseStream::isParallel method.
Let's explore the implementation of isParallel method in AbstractPipeline:
#Override
public final boolean isParallel() {
return sourceStage.parallel;
}
As you see, the only thing isParallel does is checking the boolean flag in source stage:
/**
* True if pipeline is parallel, otherwise the pipeline is sequential; only
* valid for the source stage.
*/
private boolean parallel;
So what does the parallel() method do then? How does it turn a sequential stream into a parallel one?
#Override
#SuppressWarnings("unchecked")
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
Well it only sets the parallel flag to true. That's all it does.
As you can see, in current implementation of Java Stream API it doesn't matter how you create a stream (or receive it as a method parameter). You can always turn a stream into a parallel one with zero cost.

Related

Java ParallelStream: several map or single map

Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.

Searching a list using multiple threads and find element (without using parallel streams)

I have a method
public boolean contains(int valueToFind, List<Integer> list) {
//
}
How can I split the array into x chunks? and have a new thread for searching every chunk looking for the value. If the method returns true, I would like to stop the other threads from searching.
I see there are lots of examples for simply splitting work between threads, but how I do structure it so that once one thread returns true, all threads and return that as the answer?
I do not want to use parallel streams for this reason (from source):
If you do, please look at the previous example again. There is a big
error. Do you see it? The problem is that all parallel streams use
common fork-join thread pool, and if you submit a long-running task,
you effectively block all threads in the pool. Consequently, you block
all other tasks that are using parallel streams. Imagine a servlet
environment, when one request calls getStockInfo() and another one
countPrimes(). One will block the other one even though each of them
requires different resources. What's worse, you can not specify thread
pool for parallel streams; the whole class loader has to use the same
one.
You could use the built-in Stream API:
//For a List
public boolean contains(int valueToFind, List<Integer> list) {
return list.parallelStream().anyMatch(Integer.valueOf(valueToFind)::equals);
}
//For an array
public boolean contains(int valueToFind, int[] arr){
return Arrays.stream(arr).parallel().anyMatch(x -> x == valueToFind);
}
Executing Streams in Parallel:
You can execute streams in serial or in parallel. When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate operations iterate over and process these substreams in parallel and then combine the results.
When you create a stream, it is always a serial stream unless otherwise specified. To create a parallel stream, invoke the operation Collection.parallelStream.

Collecting java stream matters if underlying stream is parallel or not

I have the following function:
public Stream getStream(boolean isParallel) {
...
return someSteamFromHere;
}
This function will return a parallel stream if "isParallel" is true, otherwise a sequential stream. Now I want to collect this parallel/sequential stream. Does the caller function need to implement this logic:
boolean isParallel = isParallel();
Stream stream = getStream(isParallel);
List list;
if (isParallel) {
list = stream.parallel().collect(Collectors.toList());
} else {
list = stream.collect(Collectors.toList());
}
Or can i simply collect the stream regardless, and if its parallel, it will be collected in parallel and if sequential, it will be collected in a single thread?
parallelism is a property of the stream. So, if you have a parallel stream, calling .parallel() on this is a no-op. It does absolutely nothing whatsoever.
Note that collecting a parallel stream does imply that any concept of 'order' is right out the window.
Your code can just be List list = stream.collect(Collectors.toList());.
Note that as a general rule, if parallelism matters at all, collecting it into a list seems... bizarre. Whatever performance benefits you think you're getting from treating it parallel are pretty much obliterated when you do this.
Why do you pass in the boolean to the function if you use it after the function's return? Either the function receives the boolean and uses it or it doesn't get it and the test sits outside as you wrote.
Btw, functions with boolean parameters are considered code smell as they clearly do more than one thing. Have a look here.

Does the ordering of calls to sequential() and parallel() matter when processing a Java 8 stream pipeline?

Does the placement of calls to sequential() and parallel() change how a Java 8 stream's pipeline is executed?
For example, suppose I have this code:
new ArrayList().stream().parallel().filter(...).count();
In this example, it's pretty clear that filter() will run in parallel. However, what if I have this code:
new ArrayList().stream().filter(...).parallel().count();
Does filter() still run in parallel or does it run sequentially? The reason it's not clear is because intermediate operations like filter() are lazy, i.e., they won't run until a terminal operation is invoked like count(). As such, by the time count() is invoked, we have a parallel stream pipeline but is filter() performed sequentially because it came before the call to parallel()?
Note the end of the Stream’s class documentation:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution. (For example, Collection.stream() creates a sequential stream, and Collection.parallelStream() creates a parallel one.) This choice of execution mode may be modified by the BaseStream.sequential() or BaseStream.parallel() methods, and may be queried with the BaseStream.isParallel() method.
In other words, calling sequential() or parallel() only changes a property of the stream and its state at the point when the terminal operation is commenced determines the execution mode of the entire pipeline.
This might not be documented that clearly at all places, because, it wasn’t always so. In the early development there were prototypes having different mode for the stages. This mail from March 2013 explains the change.
It appears that at least in the standard Oracle Java 8 implementation, although the parallel() method is defined as an "intermediate operation", it is not exactly lazy. That is, it has an immediate effect, regardless of whether you have a terminal operation or not. Consider the following example:
public class SimpleTest {
public static void main(String[] args) {
Stream<Integer> s = Stream.of(1,2,3,4,5,6,7,8,9,10);
System.out.println(s.isParallel());
Stream<Integer> s1 = s.parallel();
System.out.println(s.isParallel());
System.out.println(s == s1);
}
}
The output on my machine is:
false
true
true
Which tells us that parallel() immediately changes the state of the underlying stream (and returns that stream).
However, the Javadoc is written in such a way that it allows this, but does not require this. Which means that other stream implementations are free to execute the operations before the parallel() operations in a different execution mode than those after it.
In short, it's not a behavior you can rely on, either way.

Obtaining a parallel Stream from a Collection

Is it correct that with Java 8 you need to execute the following code to surely obtain a parallel stream from a Collection?
private <E> void process(final Collection<E> collection) {
Stream<E> stream = collection.parallelStream().parallel();
//processing
}
From the Collection API:
default Stream parallelStream()
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
From the BaseStream API:
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
Is it not awkward that I need to call a function that supposedly parallellizes the stream twice?
Basically the default implementation of Collection.parallelStream() does create a parallel stream. The implementation looks like this:
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
But this being a default method, it is perfectly valid for some implementing class to provide a different implementation to create a sequential stream too. For example, suppose I create a SequentialArrayList:
class MySequentialArrayList extends ArrayList<String> {
#Override
public Stream<String> parallelStream() {
return StreamSupport.stream(spliterator(), false);
}
}
For an object of that class, the following code will print false as expected:
ArrayList<String> arrayList = new MySequentialArrayList();
System.out.println(arrayList.parallelStream().isParallel());
In this case invoking BaseStream#parallel() method ensures that the stream returned is always parallel. Either it was already parallel, or it makes it parallel, by setting the parallel field to true:
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
This is the implementation of AbstractPipeline#parallel() method.
So the following code for the same object will print true:
System.out.println(arrayList.parallelStream().parallel().isParallel());
But if the stream is already parallel, then yes it is an extra method invocation, but that will ensure you always get a parallel stream. I've not yet digged much into the parallelization of streams, so I can't comment on what kind of Collection or in what cases would parallelStream() give you a sequential stream though.

Categories

Resources