Is it correct that with Java 8 you need to execute the following code to surely obtain a parallel stream from a Collection?
private <E> void process(final Collection<E> collection) {
Stream<E> stream = collection.parallelStream().parallel();
//processing
}
From the Collection API:
default Stream parallelStream()
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
From the BaseStream API:
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
Is it not awkward that I need to call a function that supposedly parallellizes the stream twice?
Basically the default implementation of Collection.parallelStream() does create a parallel stream. The implementation looks like this:
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
But this being a default method, it is perfectly valid for some implementing class to provide a different implementation to create a sequential stream too. For example, suppose I create a SequentialArrayList:
class MySequentialArrayList extends ArrayList<String> {
#Override
public Stream<String> parallelStream() {
return StreamSupport.stream(spliterator(), false);
}
}
For an object of that class, the following code will print false as expected:
ArrayList<String> arrayList = new MySequentialArrayList();
System.out.println(arrayList.parallelStream().isParallel());
In this case invoking BaseStream#parallel() method ensures that the stream returned is always parallel. Either it was already parallel, or it makes it parallel, by setting the parallel field to true:
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
This is the implementation of AbstractPipeline#parallel() method.
So the following code for the same object will print true:
System.out.println(arrayList.parallelStream().parallel().isParallel());
But if the stream is already parallel, then yes it is an extra method invocation, but that will ensure you always get a parallel stream. I've not yet digged much into the parallelization of streams, so I can't comment on what kind of Collection or in what cases would parallelStream() give you a sequential stream though.
Related
I can create a Stream from an array using Arrays.stream(array) or Stream.of(values). Similarly, is it possible to create a ParallelStream directly from an array, without creating an intermediate collection as in Arrays.asList(array).parallelStream()?
Stream.of(array).parallel()
or
Arrays.stream(array).parallel()
TLDR;
Any sequential Stream can be converted into a parallel one by calling .parallel() on it. So all you need is:
Create a stream
Invoke method parallel() on it.
Long answer
The question is pretty old, but I believe some additional explanation will make the things much clearer.
All implementations of Java streams implement interface BaseStream. Which as per JavaDoc is:
Base interface for streams, which are sequences of elements supporting sequential and parallel aggregate operations.
From API's point of view there is no difference between sequential and parallel streams. They share the same aggregate operations.
In order do distinguish between sequential and parallel streams the aggregate methods call BaseStream::isParallel method.
Let's explore the implementation of isParallel method in AbstractPipeline:
#Override
public final boolean isParallel() {
return sourceStage.parallel;
}
As you see, the only thing isParallel does is checking the boolean flag in source stage:
/**
* True if pipeline is parallel, otherwise the pipeline is sequential; only
* valid for the source stage.
*/
private boolean parallel;
So what does the parallel() method do then? How does it turn a sequential stream into a parallel one?
#Override
#SuppressWarnings("unchecked")
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
Well it only sets the parallel flag to true. That's all it does.
As you can see, in current implementation of Java Stream API it doesn't matter how you create a stream (or receive it as a method parameter). You can always turn a stream into a parallel one with zero cost.
I have a method
public boolean contains(int valueToFind, List<Integer> list) {
//
}
How can I split the array into x chunks? and have a new thread for searching every chunk looking for the value. If the method returns true, I would like to stop the other threads from searching.
I see there are lots of examples for simply splitting work between threads, but how I do structure it so that once one thread returns true, all threads and return that as the answer?
I do not want to use parallel streams for this reason (from source):
If you do, please look at the previous example again. There is a big
error. Do you see it? The problem is that all parallel streams use
common fork-join thread pool, and if you submit a long-running task,
you effectively block all threads in the pool. Consequently, you block
all other tasks that are using parallel streams. Imagine a servlet
environment, when one request calls getStockInfo() and another one
countPrimes(). One will block the other one even though each of them
requires different resources. What's worse, you can not specify thread
pool for parallel streams; the whole class loader has to use the same
one.
You could use the built-in Stream API:
//For a List
public boolean contains(int valueToFind, List<Integer> list) {
return list.parallelStream().anyMatch(Integer.valueOf(valueToFind)::equals);
}
//For an array
public boolean contains(int valueToFind, int[] arr){
return Arrays.stream(arr).parallel().anyMatch(x -> x == valueToFind);
}
Executing Streams in Parallel:
You can execute streams in serial or in parallel. When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate operations iterate over and process these substreams in parallel and then combine the results.
When you create a stream, it is always a serial stream unless otherwise specified. To create a parallel stream, invoke the operation Collection.parallelStream.
I have the following function:
public Stream getStream(boolean isParallel) {
...
return someSteamFromHere;
}
This function will return a parallel stream if "isParallel" is true, otherwise a sequential stream. Now I want to collect this parallel/sequential stream. Does the caller function need to implement this logic:
boolean isParallel = isParallel();
Stream stream = getStream(isParallel);
List list;
if (isParallel) {
list = stream.parallel().collect(Collectors.toList());
} else {
list = stream.collect(Collectors.toList());
}
Or can i simply collect the stream regardless, and if its parallel, it will be collected in parallel and if sequential, it will be collected in a single thread?
parallelism is a property of the stream. So, if you have a parallel stream, calling .parallel() on this is a no-op. It does absolutely nothing whatsoever.
Note that collecting a parallel stream does imply that any concept of 'order' is right out the window.
Your code can just be List list = stream.collect(Collectors.toList());.
Note that as a general rule, if parallelism matters at all, collecting it into a list seems... bizarre. Whatever performance benefits you think you're getting from treating it parallel are pretty much obliterated when you do this.
Why do you pass in the boolean to the function if you use it after the function's return? Either the function receives the boolean and uses it or it doesn't get it and the test sits outside as you wrote.
Btw, functions with boolean parameters are considered code smell as they clearly do more than one thing. Have a look here.
I have read somewhere that stream operation always return a new collection at the terminal operation and don't change the original collection on which stream operation has been applied.
But in my case original list has been modified.
return subscriptions.stream()
.filter(alertPrefSubscriptionsBO -> (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT || alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.SECONDARY_CONTACT))
.map(alertPrefSubscriptionsBO -> {
if (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT) {
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.PRIMARY);
} else
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.SECONDARY);
return alertPrefSubscriptionsBO;
})
.collect(groupingBy(AlertPrefSubscriptionsBO::isActiveStatus, groupingBy(AlertPrefSubscriptionsBO::getAlertLabel, Collectors.mapping((AlertPrefSubscriptionsBO o) -> o.getType()
.getContactId(), toSet())
)));
After this operation subscriptions list has been modified containing only AlertPrefContactTypeEnum.PRIMARY and AlertPrefContactTypeEnum.SECONDARY objects. I mean size of list remained same but values got changed.
That is because you are violating the contract of the map(Function<? super T,? extends R> mapper) method:
Parameters:
mapper - a non-interfering, stateless function to apply to each element
You're violating the "stateless" part:
Stateless behaviors
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline. An example of a stateful lambda is the parameter to map() in:
Set<Integer> seen = Collections.synchronizedSet(new HashSet<>());
stream.parallel().map(e -> { if (seen.add(e)) return 0; else return e; })...
Here, if the mapping operation is performed in parallel, the results for the same input could vary from run to run, due to thread scheduling differences, whereas, with a stateless lambda expression the results would always be the same.
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
The correct way to implement that map operation is to copy the alertPrefSubscriptionsBO and give the copy a new type.
Following the style used by the java.time classes, e.g. see all the withXxx(...) methods of ZonedDateTime, you would make or treat the alertPrefSubscriptionsBO object as immutable, and have methods for getting a copy with a property changed, e.g. with method withType(...) on the class and using static imports of the AlertPrefContactTypeEnum enums, you code could be:
.map(bo -> bo.withType(bo.getType() == PRIMARY_CONTACT ? PRIMARY : SECONDARY))
I've got a significant set of data, and want to call slow, but clean method and than call fast method with side effects on result of the first one. I'm not interested in intermediate results, so i would like not to collect them.
Obvious solution is to create parallel stream, make slow call , make stream sequential again, and make fast call. The problem is, ALL code executing in single thread, there is no actual parallelism.
Example code:
#Test
public void testParallelStream() throws ExecutionException, InterruptedException
{
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() * 2);
Set<String> threads = forkJoinPool.submit(()-> new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.sequential()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet())
).get();
System.out.println(threads);
Assert.assertEquals(Runtime.getRuntime().availableProcessors() * 2, threads.size());
}
private String slowOperation(int value)
{
try
{
Thread.sleep(100);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
return Thread.currentThread().getName();
}
If I remove sequential, code executing as expected, but, obviously, non-parallel operation would be call in multiple threads.
Could you recommend some references about such behavior, or maybe some way to avoid temporary collections?
Switching the stream from parallel() to sequential() worked in the initial Stream API design, but caused many problems and finally the implementation was changed, so it just turns the parallel flag on and off for the whole pipeline. The current documentation is indeed vague, but it was improved in Java-9:
The stream pipeline is executed sequentially or in parallel depending on the mode of the stream on which the terminal operation is invoked. The sequential or parallel mode of a stream can be determined with the BaseStream.isParallel() method, and the stream's mode can be modified with the BaseStream.sequential() and BaseStream.parallel() operations. The most recent sequential or parallel mode setting applies to the execution of the entire stream pipeline.
As for your problem, you can collect everything into intermediate List and start new sequential pipeline:
new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.collect(Collectors.toList())
// Start new stream here
.stream()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet());
In the current implementation a Stream is either all parallel or all sequential. While the Javadoc isn't explicit about this and it could change in the future it does say this is possible.
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
If you need the function to be single threaded, I suggest you use a Lock or synchronized block/method.