Calling sequential on parallel stream makes all previous operations sequential - java

I've got a significant set of data, and want to call slow, but clean method and than call fast method with side effects on result of the first one. I'm not interested in intermediate results, so i would like not to collect them.
Obvious solution is to create parallel stream, make slow call , make stream sequential again, and make fast call. The problem is, ALL code executing in single thread, there is no actual parallelism.
Example code:
#Test
public void testParallelStream() throws ExecutionException, InterruptedException
{
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() * 2);
Set<String> threads = forkJoinPool.submit(()-> new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.sequential()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet())
).get();
System.out.println(threads);
Assert.assertEquals(Runtime.getRuntime().availableProcessors() * 2, threads.size());
}
private String slowOperation(int value)
{
try
{
Thread.sleep(100);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
return Thread.currentThread().getName();
}
If I remove sequential, code executing as expected, but, obviously, non-parallel operation would be call in multiple threads.
Could you recommend some references about such behavior, or maybe some way to avoid temporary collections?

Switching the stream from parallel() to sequential() worked in the initial Stream API design, but caused many problems and finally the implementation was changed, so it just turns the parallel flag on and off for the whole pipeline. The current documentation is indeed vague, but it was improved in Java-9:
The stream pipeline is executed sequentially or in parallel depending on the mode of the stream on which the terminal operation is invoked. The sequential or parallel mode of a stream can be determined with the BaseStream.isParallel() method, and the stream's mode can be modified with the BaseStream.sequential() and BaseStream.parallel() operations. The most recent sequential or parallel mode setting applies to the execution of the entire stream pipeline.
As for your problem, you can collect everything into intermediate List and start new sequential pipeline:
new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.collect(Collectors.toList())
// Start new stream here
.stream()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet());

In the current implementation a Stream is either all parallel or all sequential. While the Javadoc isn't explicit about this and it could change in the future it does say this is possible.
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
If you need the function to be single threaded, I suggest you use a Lock or synchronized block/method.

Related

Java ParallelStream: several map or single map

Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.

Collecting java stream matters if underlying stream is parallel or not

I have the following function:
public Stream getStream(boolean isParallel) {
...
return someSteamFromHere;
}
This function will return a parallel stream if "isParallel" is true, otherwise a sequential stream. Now I want to collect this parallel/sequential stream. Does the caller function need to implement this logic:
boolean isParallel = isParallel();
Stream stream = getStream(isParallel);
List list;
if (isParallel) {
list = stream.parallel().collect(Collectors.toList());
} else {
list = stream.collect(Collectors.toList());
}
Or can i simply collect the stream regardless, and if its parallel, it will be collected in parallel and if sequential, it will be collected in a single thread?
parallelism is a property of the stream. So, if you have a parallel stream, calling .parallel() on this is a no-op. It does absolutely nothing whatsoever.
Note that collecting a parallel stream does imply that any concept of 'order' is right out the window.
Your code can just be List list = stream.collect(Collectors.toList());.
Note that as a general rule, if parallelism matters at all, collecting it into a list seems... bizarre. Whatever performance benefits you think you're getting from treating it parallel are pretty much obliterated when you do this.
Why do you pass in the boolean to the function if you use it after the function's return? Either the function receives the boolean and uses it or it doesn't get it and the test sits outside as you wrote.
Btw, functions with boolean parameters are considered code smell as they clearly do more than one thing. Have a look here.

Join multiple parallel object into a single List

I've a map of key-value and iterating over keys, and calling service and based on the response, I am adding all the response to some uberList
How can I execute the different operations concurrently? Will changing stream() to parallelStream() do the trick? Does it synchronize when it adds to uberList?
The idea is to minimize the response time.
List<MyClass> uberList = new LinkedList<>();
Map<String, List<MyOtherClass>> map = new HashMap();
//Populate map
map.entrySet().stream().filter(s -> s.getValue().size() > 0 && s.getValue().values().size() > 0).forEach(
y -> {
// Do stuff
if(noError) {
uberList.add(MyClass3);
}
}
}
//Do stuff on uberList
How can I execute the different operations concurrently?
One thread can do one task at a time. If you want to do multiple operations concurrently, you have to offwork to other threads.
You can either creating new Thread or using ExecutorService to manage thread pool, queue the task and execute task for you.
Will changing stream() to parallelStream() do the trick?
Yes it does. Internally, parallelStream() use the ForkJoinPool.commonPool() to run tasks for you. But keep in mind that the parallelStream() has no guarantee about if the returned stream is paralleled (but for now, the current implementation return a paralleled one)
Does it synchronize when it adds to uberList?
It's up to you to do the synchronization part in forEach pipeline. Normally you do not want to call collection.add() inside forEach to create collection. Instead you should use .map().collect(toX()) methods. It frees you from synchronizatin part:
It does not required to know about your local variable (in this case uberlist. And it will not modify it on execution, help reduce a lot of strange bugs caused of concurrency
You can freely change the type of collection in .collect() part. It give you more control over the result type.
It does not require thread-safe or synchronization on given collection when using with parallel stream. Because "multiple intermediate results may be instantiated, populated, and merged so as to maintain isolation of mutable data structures" (Read more about this here)
So what you want is to execute multiple similar service call at the same time and collect your result into a list.
You can do it simply by parallel stream:
uberList = map.entrySet().stream()
.parallel() // Use .stream().parallel() to force parallism. The .parallelStream() does not guarantee that the returned stream is parallel stream
.filter(yourCondition)
.map(e -> yourService.methodCall(e))
.collect(Collectors.toList());
Pretty cool, isn't it?
But as I stated, the default parallel stream use ForkJoinPool.commonPool() for thread queueing and executing.
The bad part is if your yourService.methodCall(e) do heavy IO stuff (like HTTP call, even db call...) or long running task then it may exhaust the pool, other incoming tasks will queued forever to wait for execution.
So typically all other tasks depend on this common pool (not only your own yourService.methodCall(e), but all other parallel stream) will be slow down due to queueing time.
To solve this problem, you can force execute parallelism on your own fork-join pool:
ForkJoinPool forkJoinPool = new ForkJoinPool(4); // Typically set it to Runtime.availableProcessors()
uberlist = forkJoinPool.submit(() -> {
return map.entrySet().stream()
.parallel() // Use .stream().parallel() to force parallism. The .parallelStream() does not guarantee that the returned stream is parallel stream
.filter(yourCondition)
.map(e -> yourService.methodCall(e))
.collect(Collectors.toList());
}).get();
You probably don't want to use parallelStream for concurrency, only for parallelism. (That is: use it for tasks where you want to use multiple physical processes efficiently on a task that's conceptually sequential, not for tasks where you want multiple things going on at the same time conceptually.)
In your case you would probably be better off using an ExecutorService, or more specifically com.google.common.util.concurrent.ListenableExecutorService from Google Guava (warning: I haven't tried to compile the below code, there may be syntax errors):
int MAX_NUMBER_OF_SIMULTANEOUS_REQUESTS = 100;
ListeningExecutorService myExecutor =
MoreExecutors.listeningDecorator(
Executors.newFixedThreadPool(MAX_NUMBER_OF_SIMULTANEOUS_REQUESTS));
List<ListenableFuture<Optional<MyClass>>> futures = new ArrayList<>();
for (Map.Entry<String, List<MyOtherClass>> entry : map.entrySet()) {
if (entry.getValue().size() > 0 && entry.getValue().values().size() > 0) {
futures.add(myExecutor.submit(() -> {
// Do stuff
if(noError) {
return Optional.of(MyClass3);
} else {
return Optional.empty();
}
}));
}
}
List<MyClass> uberList = Futures.successfulAsList(futures)
.get(1, TimeUnit.MINUTES /* adjust as necessary */)
.stream()
.filter(Optional::isPresent)
.map(Optional::get)
.collect(Collectors.toList());
The advantage of this code is that it allows you to explicitly specify that the tasks should all start at the "same time" (at least conceptually) and allows you to control your concurrency explicitly (how many simultaneous requests are allowed? What do we do if some of the tasks fail? How long are we willing to wait? etc). Parallel streams aren't really for that.
Parallel Stream will help in execution concurrently. But it is not recommended to do forEach loop and add element in outside list. If you do that, you have to make sure of synchnising external list. Better way of doing it is to use map and collect result into list. In this case, parallelStream takes care of synchronisation.
List<MyClass> uberList = map.entrySet().parallelStream().filter(s ->
s.getValue().size() > 0 && s.getValue().values().size() >
0).map(
y -> {
// Do stuff
return MyClass3;
}
}
.filter(t -> check no ertor condition)
.collect (Collectors.toList())

Does the ordering of calls to sequential() and parallel() matter when processing a Java 8 stream pipeline?

Does the placement of calls to sequential() and parallel() change how a Java 8 stream's pipeline is executed?
For example, suppose I have this code:
new ArrayList().stream().parallel().filter(...).count();
In this example, it's pretty clear that filter() will run in parallel. However, what if I have this code:
new ArrayList().stream().filter(...).parallel().count();
Does filter() still run in parallel or does it run sequentially? The reason it's not clear is because intermediate operations like filter() are lazy, i.e., they won't run until a terminal operation is invoked like count(). As such, by the time count() is invoked, we have a parallel stream pipeline but is filter() performed sequentially because it came before the call to parallel()?
Note the end of the Stream’s class documentation:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution. (For example, Collection.stream() creates a sequential stream, and Collection.parallelStream() creates a parallel one.) This choice of execution mode may be modified by the BaseStream.sequential() or BaseStream.parallel() methods, and may be queried with the BaseStream.isParallel() method.
In other words, calling sequential() or parallel() only changes a property of the stream and its state at the point when the terminal operation is commenced determines the execution mode of the entire pipeline.
This might not be documented that clearly at all places, because, it wasn’t always so. In the early development there were prototypes having different mode for the stages. This mail from March 2013 explains the change.
It appears that at least in the standard Oracle Java 8 implementation, although the parallel() method is defined as an "intermediate operation", it is not exactly lazy. That is, it has an immediate effect, regardless of whether you have a terminal operation or not. Consider the following example:
public class SimpleTest {
public static void main(String[] args) {
Stream<Integer> s = Stream.of(1,2,3,4,5,6,7,8,9,10);
System.out.println(s.isParallel());
Stream<Integer> s1 = s.parallel();
System.out.println(s.isParallel());
System.out.println(s == s1);
}
}
The output on my machine is:
false
true
true
Which tells us that parallel() immediately changes the state of the underlying stream (and returns that stream).
However, the Javadoc is written in such a way that it allows this, but does not require this. Which means that other stream implementations are free to execute the operations before the parallel() operations in a different execution mode than those after it.
In short, it's not a behavior you can rely on, either way.

Java-8 Stream returned by .map will be parallel or sequential?

Stream returned by map or mapToObj methods is always sequential or does it depend on whether the state of the calling stream was parallel or not?
The documentation of IntStream does not answer this explicitly or I cannot understand it properly:
I am wondering if my stream from the following example will be parallel up to the end or it will change at some point.
IntStream.range(1, array_of_X.size())
.parallel()
.mapToObj (index -> array_of_X.get(index)) // mapping#1
.filter (filter_X)
.map (X_to_Y) //mapping#2
.filter (filter_Y)
.mapToInt (obj_Y_to_int) //mapping#3
.collect(value -> Collectors.summingInt(value));
No, it will never change (unless you explicitely change it yourself).
What you have written corresponds to a Stream pipeline and a single pipeline has a single orientation: parallel or sequential. So there is no "parallel up to the end" because either the whole pipeline will be executed in parallel or it will be executed sequentially.
Quoting the Stream package Javadoc:
The only difference between the serial and parallel versions of this example is the creation of the initial stream, using "parallelStream()" instead of "stream()". When the terminal operation is initiated, the stream pipeline is executed sequentially or in parallel depending on the orientation of the stream on which it is invoked. Whether a stream will execute in serial or parallel can be determined with the isParallel() method, and the orientation of a stream can be modified with the BaseStream.sequential() and BaseStream.parallel() operations. When the terminal operation is initiated, the stream pipeline is executed sequentially or in parallel depending on the mode of the stream on which it is invoked.
This means that the only way for a Stream pipeline to change its orientation is by calling one of sequential() or parallel() method. Since this is global to the Stream API, this is not written for every operation but in the package Javadoc instead.
With the code in your question, the Stream pipeline will be executed in parallel because you explicitely changed the Stream orientation by invoking parallel().
It is important to note that the resulting orentiation of the Stream will be the last call made to parallel() or sequential(). Consider the three following examples:
public static void main(String[] args) {
System.out.println(IntStream.range(0, 10).isParallel());
System.out.println(IntStream.range(0, 10).parallel().isParallel());
System.out.println(IntStream.range(0, 10).parallel().map(i -> 2*i).sequential().isParallel());
}
The first one will print false since IntStream.range returns a sequential Stream
The second one will print true since we invoked parallel()
The third one will print false since, even if we call parallel() in the pipeline, we invoked sequential() afterwards so it reset the total orientation of the Stream pipeline to serial.
Note that, still quoting:
The stream implementations in the JDK create serial streams unless parallelism is explicitly requested.
so every Stream you are going to retrieve will be sequential unless you explicitely requested a parallel Stream.

Categories

Resources