Collecting java stream matters if underlying stream is parallel or not - java

I have the following function:
public Stream getStream(boolean isParallel) {
...
return someSteamFromHere;
}
This function will return a parallel stream if "isParallel" is true, otherwise a sequential stream. Now I want to collect this parallel/sequential stream. Does the caller function need to implement this logic:
boolean isParallel = isParallel();
Stream stream = getStream(isParallel);
List list;
if (isParallel) {
list = stream.parallel().collect(Collectors.toList());
} else {
list = stream.collect(Collectors.toList());
}
Or can i simply collect the stream regardless, and if its parallel, it will be collected in parallel and if sequential, it will be collected in a single thread?

parallelism is a property of the stream. So, if you have a parallel stream, calling .parallel() on this is a no-op. It does absolutely nothing whatsoever.
Note that collecting a parallel stream does imply that any concept of 'order' is right out the window.
Your code can just be List list = stream.collect(Collectors.toList());.
Note that as a general rule, if parallelism matters at all, collecting it into a list seems... bizarre. Whatever performance benefits you think you're getting from treating it parallel are pretty much obliterated when you do this.

Why do you pass in the boolean to the function if you use it after the function's return? Either the function receives the boolean and uses it or it doesn't get it and the test sits outside as you wrote.
Btw, functions with boolean parameters are considered code smell as they clearly do more than one thing. Have a look here.

Related

JAVA 8 Array Stream Results Timing Longer Than Sequencing [duplicate]

I can create a Stream from an array using Arrays.stream(array) or Stream.of(values). Similarly, is it possible to create a ParallelStream directly from an array, without creating an intermediate collection as in Arrays.asList(array).parallelStream()?
Stream.of(array).parallel()
or
Arrays.stream(array).parallel()
TLDR;
Any sequential Stream can be converted into a parallel one by calling .parallel() on it. So all you need is:
Create a stream
Invoke method parallel() on it.
Long answer
The question is pretty old, but I believe some additional explanation will make the things much clearer.
All implementations of Java streams implement interface BaseStream. Which as per JavaDoc is:
Base interface for streams, which are sequences of elements supporting sequential and parallel aggregate operations.
From API's point of view there is no difference between sequential and parallel streams. They share the same aggregate operations.
In order do distinguish between sequential and parallel streams the aggregate methods call BaseStream::isParallel method.
Let's explore the implementation of isParallel method in AbstractPipeline:
#Override
public final boolean isParallel() {
return sourceStage.parallel;
}
As you see, the only thing isParallel does is checking the boolean flag in source stage:
/**
* True if pipeline is parallel, otherwise the pipeline is sequential; only
* valid for the source stage.
*/
private boolean parallel;
So what does the parallel() method do then? How does it turn a sequential stream into a parallel one?
#Override
#SuppressWarnings("unchecked")
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
Well it only sets the parallel flag to true. That's all it does.
As you can see, in current implementation of Java Stream API it doesn't matter how you create a stream (or receive it as a method parameter). You can always turn a stream into a parallel one with zero cost.

How to safely consume Java Streams safely without isFinite() and isOrdered() methods?

There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.

Calling sequential on parallel stream makes all previous operations sequential

I've got a significant set of data, and want to call slow, but clean method and than call fast method with side effects on result of the first one. I'm not interested in intermediate results, so i would like not to collect them.
Obvious solution is to create parallel stream, make slow call , make stream sequential again, and make fast call. The problem is, ALL code executing in single thread, there is no actual parallelism.
Example code:
#Test
public void testParallelStream() throws ExecutionException, InterruptedException
{
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() * 2);
Set<String> threads = forkJoinPool.submit(()-> new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.sequential()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet())
).get();
System.out.println(threads);
Assert.assertEquals(Runtime.getRuntime().availableProcessors() * 2, threads.size());
}
private String slowOperation(int value)
{
try
{
Thread.sleep(100);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
return Thread.currentThread().getName();
}
If I remove sequential, code executing as expected, but, obviously, non-parallel operation would be call in multiple threads.
Could you recommend some references about such behavior, or maybe some way to avoid temporary collections?
Switching the stream from parallel() to sequential() worked in the initial Stream API design, but caused many problems and finally the implementation was changed, so it just turns the parallel flag on and off for the whole pipeline. The current documentation is indeed vague, but it was improved in Java-9:
The stream pipeline is executed sequentially or in parallel depending on the mode of the stream on which the terminal operation is invoked. The sequential or parallel mode of a stream can be determined with the BaseStream.isParallel() method, and the stream's mode can be modified with the BaseStream.sequential() and BaseStream.parallel() operations. The most recent sequential or parallel mode setting applies to the execution of the entire stream pipeline.
As for your problem, you can collect everything into intermediate List and start new sequential pipeline:
new Random().ints(100).boxed()
.parallel()
.map(this::slowOperation)
.collect(Collectors.toList())
// Start new stream here
.stream()
.map(Function.identity())//some fast operation, but must be in single thread
.collect(Collectors.toSet());
In the current implementation a Stream is either all parallel or all sequential. While the Javadoc isn't explicit about this and it could change in the future it does say this is possible.
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
If you need the function to be single threaded, I suggest you use a Lock or synchronized block/method.

Sequential streams and shared state

The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.

Obtaining a parallel Stream from a Collection

Is it correct that with Java 8 you need to execute the following code to surely obtain a parallel stream from a Collection?
private <E> void process(final Collection<E> collection) {
Stream<E> stream = collection.parallelStream().parallel();
//processing
}
From the Collection API:
default Stream parallelStream()
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
From the BaseStream API:
S parallel()
Returns an equivalent stream that is parallel. May return itself, either because the stream was already parallel, or because the underlying stream state was modified to be parallel.
Is it not awkward that I need to call a function that supposedly parallellizes the stream twice?
Basically the default implementation of Collection.parallelStream() does create a parallel stream. The implementation looks like this:
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
But this being a default method, it is perfectly valid for some implementing class to provide a different implementation to create a sequential stream too. For example, suppose I create a SequentialArrayList:
class MySequentialArrayList extends ArrayList<String> {
#Override
public Stream<String> parallelStream() {
return StreamSupport.stream(spliterator(), false);
}
}
For an object of that class, the following code will print false as expected:
ArrayList<String> arrayList = new MySequentialArrayList();
System.out.println(arrayList.parallelStream().isParallel());
In this case invoking BaseStream#parallel() method ensures that the stream returned is always parallel. Either it was already parallel, or it makes it parallel, by setting the parallel field to true:
public final S parallel() {
sourceStage.parallel = true;
return (S) this;
}
This is the implementation of AbstractPipeline#parallel() method.
So the following code for the same object will print true:
System.out.println(arrayList.parallelStream().parallel().isParallel());
But if the stream is already parallel, then yes it is an extra method invocation, but that will ensure you always get a parallel stream. I've not yet digged much into the parallelization of streams, so I can't comment on what kind of Collection or in what cases would parallelStream() give you a sequential stream though.

Categories

Resources