Are java.util.stream.Collectors::joining implementations thread-safe? Can I do something like
public final class SomeClass {
private static final Collector<CharSequence, ?, String> jc = Collectors.joining(",");
public String someMethod(List<String> someList) {
return someList.parallelStream().collect(jc);
}
}
without fear of running into concurrency issues?
You can use this collector as any other collector provided in Collectors class without fear of running into concurrency issues. The Collector need not to care about thread safety unless it has CONCURRENT characteristic. It just need to have its operations non-interfering, stateless and associative. The rest will be done by Stream pipeline itself. It will use the collector functions in the way which does not require the additional synchronization. In particular when accumulator or combiner function is called, it's guaranteed that no other thread is operating on the same accumulated value at the moment. This is specified in Collector documentation:
Libraries that implement reduction based on Collector, such as Stream.collect(Collector), must adhere to the following constraints:
<...>
For non-concurrent collectors, any result returned from the result supplier, accumulator, or combiner functions must be serially thread-confined. This enables collection to occur in parallel without the Collector needing to implement any additional synchronization. The reduction implementation must manage that the input is properly partitioned, that partitions are processed in isolation, and combining happens only after accumulation is complete.
Note that the collector itself is stateless as well as functions it provides, thus it's also safe to have it in the static field. The state is preserved in the external accumulator which is returned by supplier and passed back to accumulator, combiner and finisher. So even if the same collector is reused by several stream operations, they don't interfere.
Related
Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.
I have read somewhere that stream operation always return a new collection at the terminal operation and don't change the original collection on which stream operation has been applied.
But in my case original list has been modified.
return subscriptions.stream()
.filter(alertPrefSubscriptionsBO -> (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT || alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.SECONDARY_CONTACT))
.map(alertPrefSubscriptionsBO -> {
if (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT) {
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.PRIMARY);
} else
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.SECONDARY);
return alertPrefSubscriptionsBO;
})
.collect(groupingBy(AlertPrefSubscriptionsBO::isActiveStatus, groupingBy(AlertPrefSubscriptionsBO::getAlertLabel, Collectors.mapping((AlertPrefSubscriptionsBO o) -> o.getType()
.getContactId(), toSet())
)));
After this operation subscriptions list has been modified containing only AlertPrefContactTypeEnum.PRIMARY and AlertPrefContactTypeEnum.SECONDARY objects. I mean size of list remained same but values got changed.
That is because you are violating the contract of the map(Function<? super T,? extends R> mapper) method:
Parameters:
mapper - a non-interfering, stateless function to apply to each element
You're violating the "stateless" part:
Stateless behaviors
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline. An example of a stateful lambda is the parameter to map() in:
Set<Integer> seen = Collections.synchronizedSet(new HashSet<>());
stream.parallel().map(e -> { if (seen.add(e)) return 0; else return e; })...
Here, if the mapping operation is performed in parallel, the results for the same input could vary from run to run, due to thread scheduling differences, whereas, with a stateless lambda expression the results would always be the same.
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
The correct way to implement that map operation is to copy the alertPrefSubscriptionsBO and give the copy a new type.
Following the style used by the java.time classes, e.g. see all the withXxx(...) methods of ZonedDateTime, you would make or treat the alertPrefSubscriptionsBO object as immutable, and have methods for getting a copy with a property changed, e.g. with method withType(...) on the class and using static imports of the AlertPrefContactTypeEnum enums, you code could be:
.map(bo -> bo.withType(bo.getType() == PRIMARY_CONTACT ? PRIMARY : SECONDARY))
Does the placement of calls to sequential() and parallel() change how a Java 8 stream's pipeline is executed?
For example, suppose I have this code:
new ArrayList().stream().parallel().filter(...).count();
In this example, it's pretty clear that filter() will run in parallel. However, what if I have this code:
new ArrayList().stream().filter(...).parallel().count();
Does filter() still run in parallel or does it run sequentially? The reason it's not clear is because intermediate operations like filter() are lazy, i.e., they won't run until a terminal operation is invoked like count(). As such, by the time count() is invoked, we have a parallel stream pipeline but is filter() performed sequentially because it came before the call to parallel()?
Note the end of the Stream’s class documentation:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution. (For example, Collection.stream() creates a sequential stream, and Collection.parallelStream() creates a parallel one.) This choice of execution mode may be modified by the BaseStream.sequential() or BaseStream.parallel() methods, and may be queried with the BaseStream.isParallel() method.
In other words, calling sequential() or parallel() only changes a property of the stream and its state at the point when the terminal operation is commenced determines the execution mode of the entire pipeline.
This might not be documented that clearly at all places, because, it wasn’t always so. In the early development there were prototypes having different mode for the stages. This mail from March 2013 explains the change.
It appears that at least in the standard Oracle Java 8 implementation, although the parallel() method is defined as an "intermediate operation", it is not exactly lazy. That is, it has an immediate effect, regardless of whether you have a terminal operation or not. Consider the following example:
public class SimpleTest {
public static void main(String[] args) {
Stream<Integer> s = Stream.of(1,2,3,4,5,6,7,8,9,10);
System.out.println(s.isParallel());
Stream<Integer> s1 = s.parallel();
System.out.println(s.isParallel());
System.out.println(s == s1);
}
}
The output on my machine is:
false
true
true
Which tells us that parallel() immediately changes the state of the underlying stream (and returns that stream).
However, the Javadoc is written in such a way that it allows this, but does not require this. Which means that other stream implementations are free to execute the operations before the parallel() operations in a different execution mode than those after it.
In short, it's not a behavior you can rely on, either way.
In Aggregating with Streams, Brian Goetz compares populating a collection using Stream.collect() and doing the same using Stream.forEach(), with the following two snippets:
Set<String> uniqueStrings = strings.stream()
.collect(HashSet::new,
HashSet::add,
HashSet::addAll);
And,
Set<String> set = new HashSet<>();
strings.stream().forEach(s -> set.add(s));
Then he explains:
The key
difference is that, with the forEach() version, multiple threads are trying to access a single result
container simultaneously, whereas with parallel collect(), each thread has its own local result
container, the results of which are merged afterward.
To my understanding, multiple threads would be working in the forEach() case only if the stream is parallel. However, in the example given, forEach() is operating on a sequential stream (no call to parallelStream()).
So, is it that forEach() always work in parallel, or that the code snippet should call parallelStream() instead of stream(). (or that I'm missing something?)
No, forEach() doesn't parallelize if the stream isn't parallel. I think he simplified the example for the sake of discussion.
As evidence, this code is inside the AbstractPipeline class's evaluate method (which is called from forEach)
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
The whole quote goes as follows:
Just as reduction can parallelize safely provided the combining function is associative and free of interfering side effects, mutable reduction with Stream.collect() can parallelize safely if it meets certain simple consistency requirements (outlined in the specification for collect()).
And then what you've quoted:
The key difference is that, with the forEach() version, multiple threads are trying to access a single result container simultaneously, whereas with parallel collect(), each thread has its own local result container, the results of which are merged afterward.
Since the first sentence clearly speaks of parallelization, my understanding is that both forEach() and collect() are spoken of in the context of parallel streams.
In the javaodoc for the stream package, at the end of the section Parallelism, I read:
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
And also:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?
A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.
Examples of stateful stream lambdas:
collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).
So, Collector and Consumer are two lambda interfaces that by definition are stateful.
All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:
1. Apple
2. Banana
3. Grape
How you solve this task in Java Stream API? Pretty easily:
List<String> f = asList("Banana", "Apple", "Grape");
AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
.sorted()
.sequential()
.map(i -> String.format("%d. %s", number.incrementAndGet(), i))
.collect(Collectors.joining("\n"));
Now if you look at this pipeline you'll see 3 stateful operations:
sorted() – stateful by definition. See documetation to Stream.sorted():
This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.
There are some controversy about why sorted() is stateful. From the Stream API documentation:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.
Also note that there is some confusion between terms stateless and deterministic. They are not the same.
Deterministic function provide same result given same arguments.
Stateless function retain no state from previous calls.
Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.
When in doubt simply check the documentation to the specific operation. Examples:
Stream.map mapper parameter:
mapper - a non-interfering, stateless function to apply to each element
Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:
action - a non-interfering action to perform on the elements
Here it's not specified that the action is stateless, thus it can be stateful.
In general it's always explicitly written on every method documentation.
A stateless function returns the same output for the same inputs, "no matter what".
It's easy to create non-stateless functions in an imperative language like Java. e.g.
func = input -> currentTime();
If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).
If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.
Note that "no matter what" implies that a stateless function must be thread-safe.
If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.
If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).
What is an "important" side effect? That is very subjective.
At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).
If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)
If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.
Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.
If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if
we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic