The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.
Related
Introduction
I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.
Several small map
I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
.map(Document::parse)
.map(InsertOneModel::new)
.toList();
Single big map
For example a better distribution than:
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
.toList();
Question
Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?
I looked into the Stream source code. The result of a map operation is just fed into the next operation. So there is almost no difference between one big map() call or several small map() calls.
And for the map() operation a parallel Stream makes no difference at all. Meaning each input object will be processed until the end by the same Thread in any case.
Also note: A parallel Stream only splits up the work if the operation chain allows it and there is enough data to process. So for a small Collection or a Collection that allows no random access, a parallel Stream behaves like a sequential Stream.
I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
To understand this we have to check the code inside the map function. link
public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
Objects.requireNonNull(mapper);
return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
#Override
Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
return new Sink.ChainedReference<P_OUT, R>(sink) {
#Override
public void accept(P_OUT u) {
downstream.accept(mapper.apply(u));
}
};
}
};
}
As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.
Now coming back to ParallelStreams, they work on the concept of Parallelism .
Streams Documentation
A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.
Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.
ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.
So I would prefer parallelStream when
I have huge amount of data to process at a time
I have multiple cores to process the data
Performance issues with the existing implementation
I already don't have multiple threaded process running, as it will add to the complexity.
Performance Implications
Overhead : Sometimes when dataset is small converting a sequential stream into a parallel one results in worse performance. The overhead of managing threads, sources and results is a more expensive operation than doing the actual work.
Splitting: Arrays can split cheaply and evenly, while LinkedList has none of these properties. TreeMap and HashSet split better than LinkedList but not as well as arrays.
Merging:The merge operation is really cheap for some operations, such as reduction and addition, but merge operations like grouping to sets or maps can be quite expensive.
Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.
The three steps (toJson/parse/new) have to be executed sequentially, so all you're effectively doing is comparing s.map(g.compose(f)) and s.map(f).map(g). By virtue of being a monad, Java Streams are functors, and the 2nd functor law states that, in essence, s.map(g.compose(f)) == s.map(f).map(g), meaning that the two alternative ways of expressing the computation will produce identical results. From a performance standpoint the difference between the two is likely to be minimal.
However, in general you should be careful using Collection.parallelStream. It uses the common forkJoinPool, essentially a fixed pool of threads shared across the entire JVM. The size of the pool is determined by the number of cores on the host. The problem with using the common pool is that other threads in the same process may also be using it at the same time as your code. This can lead to your code randomly and inexplicably slowing down - if another part of the code has temporarily exhausted the common thread pool, for example.
More preferable is to create your own ExecutorService by using one of the creator methods on Executors, and then submit your tasks to that.
private static final ExecutorService EX_SVC = Executors.newFixedThreadPool(16);
public static List<InsertOneModel<Document>> process(Stream<Puzzle> puzzles) throws InterruptedException {
final Collection<Callable<InsertOneModel<Document>>> callables =
puzzles.map(puzzle ->
(Callable<InsertOneModel<Document>>)
() -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle)))
).collect(Collectors.toList());
return EX_SVC.invokeAll(callables).stream()
.map(fut -> {
try {
return fut.get();
} catch (ExecutionException|InterruptedException ex) {
throw new RuntimeException(ex);
}
}).collect(Collectors.toList());
}
I doubt that there is much different in performance, but even if you proved it did have quicker performance I would still prefer to see and use the first style in code I had to maintain.
The first multi-map style is easier for others to understand, it is easier to maintain and easier to debug - for example adding peek stages for any stage of the processing chain.
List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
.map(gson::toJson)
// easy to make changes for debug, moving peek up/down
// .peek(System.out::println)
.map(Document::parse)
// easy to filter:
// .filter(this::somecondition)
.map(InsertOneModel::new)
.toList();
If your requirements change - such as needing to filter the output, or capture the intermediate data by splitting to 2 collections, the first approach beats second every time.
There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.
In Aggregating with Streams, Brian Goetz compares populating a collection using Stream.collect() and doing the same using Stream.forEach(), with the following two snippets:
Set<String> uniqueStrings = strings.stream()
.collect(HashSet::new,
HashSet::add,
HashSet::addAll);
And,
Set<String> set = new HashSet<>();
strings.stream().forEach(s -> set.add(s));
Then he explains:
The key
difference is that, with the forEach() version, multiple threads are trying to access a single result
container simultaneously, whereas with parallel collect(), each thread has its own local result
container, the results of which are merged afterward.
To my understanding, multiple threads would be working in the forEach() case only if the stream is parallel. However, in the example given, forEach() is operating on a sequential stream (no call to parallelStream()).
So, is it that forEach() always work in parallel, or that the code snippet should call parallelStream() instead of stream(). (or that I'm missing something?)
No, forEach() doesn't parallelize if the stream isn't parallel. I think he simplified the example for the sake of discussion.
As evidence, this code is inside the AbstractPipeline class's evaluate method (which is called from forEach)
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
The whole quote goes as follows:
Just as reduction can parallelize safely provided the combining function is associative and free of interfering side effects, mutable reduction with Stream.collect() can parallelize safely if it meets certain simple consistency requirements (outlined in the specification for collect()).
And then what you've quoted:
The key difference is that, with the forEach() version, multiple threads are trying to access a single result container simultaneously, whereas with parallel collect(), each thread has its own local result container, the results of which are merged afterward.
Since the first sentence clearly speaks of parallelization, my understanding is that both forEach() and collect() are spoken of in the context of parallel streams.
I have two loops. In the inner loop, I hit a Database, get the result and perform some computatiosn on the result (which involves calling other private method) and put the result it in a map.
Will this approach cause any problem like putting null for any of the keys?
No two threads will update the same value. i.e)the key that is computed will be unique. (If it loops n times, there will be n keys)
Map<String,String> m = new ConcurrentHashMap<>();
obj1.getProp().parallelStream().forEach(k1 -> { //obj.getProp() returns a list
obj2.parallelStream().forEach(k2-> { //obj2 is a list
String key = constructKey(k1,k2);
//Hit a DB and get the result
//Computations on the result
//Call some other methods
m.put(key, result);
});
});
You should not use the Stream API unless you’ve fully understood that it is more than an alternative spelling for loops. Generally, if your code contains a forEach on a stream, you should ask yourself at least once whether this is really the best solution for your task, but if your code contains a nested forEach calls, you should know that it can’t be the right thing.
It might work, as when adding to a concurrent map, like in your question, however, it defeats the purpose of the Stream API.
Besides that, arrays don’t have a parallelStream() method, thus, when the result type of obj.getProp() and the type of obj2 are arrays, as your comments say, you have to use Arrays.stream(…) to construct a stream.
What you want to do can be implemented as
Map<String,String> m =
Arrays.stream(obj1.getProp()).parallel()
.flatMap(k1 -> Arrays.stream(obj2).map(k2 -> constructKey(k1, k2)))
.collect(Collectors.toConcurrentMap(key -> key, key -> {
//Hit a DB and get the result
//Computations on the result
//Call some other methods
return result;
}));
The benefit of this is not only a better utilization of parallel processing, but also that it works even if you use Collectors.toMap, creating a non-concurrent Map, instead of Collectors.toConcurrentMap; the framework will take care of producing it in a thread-safe manner.
So unless you definitely need a concurrent map for concurrent later-one processing, you can use either; which one will perform better depends on factors whose discussion would exceed the scope of this answer.
So with the correct usage of the Stream API, it will be thread safe, regardless of which Map type you produce, and the remaining question is whether the database access is thread safe, which, as already explained in this answer depends on a lot of factors which you didn’t include in your question, so we can’t answer that.
Your question boils down to the parts "can I add to a concurrent hash map from multiple threads?" and "can I access my database in parallel?"
The answer to the first is: "yes", the answer to the second is "it depends"
Or a little longer: the two parallel streams which you use basically just start the inner lambda on multiple threads in the execution pool. The adding to the map itself is not a problem, that is what the concurrent hash map was made for.
Regarding the database, it depends on how you query it and on which level you share the object. If you use a connection pool with a different connection for each thread, you will probably be fine. For most databases, sharing a connection and getting a new statement per thread is also fine. Sharing a statement and getting a new result set leads to problems for quite a number of database drivers.
In the javaodoc for the stream package, at the end of the section Parallelism, I read:
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
And also:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?
A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.
Examples of stateful stream lambdas:
collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).
So, Collector and Consumer are two lambda interfaces that by definition are stateful.
All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:
1. Apple
2. Banana
3. Grape
How you solve this task in Java Stream API? Pretty easily:
List<String> f = asList("Banana", "Apple", "Grape");
AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
.sorted()
.sequential()
.map(i -> String.format("%d. %s", number.incrementAndGet(), i))
.collect(Collectors.joining("\n"));
Now if you look at this pipeline you'll see 3 stateful operations:
sorted() – stateful by definition. See documetation to Stream.sorted():
This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.
There are some controversy about why sorted() is stateful. From the Stream API documentation:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.
Also note that there is some confusion between terms stateless and deterministic. They are not the same.
Deterministic function provide same result given same arguments.
Stateless function retain no state from previous calls.
Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.
When in doubt simply check the documentation to the specific operation. Examples:
Stream.map mapper parameter:
mapper - a non-interfering, stateless function to apply to each element
Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:
action - a non-interfering action to perform on the elements
Here it's not specified that the action is stateless, thus it can be stateful.
In general it's always explicitly written on every method documentation.
A stateless function returns the same output for the same inputs, "no matter what".
It's easy to create non-stateless functions in an imperative language like Java. e.g.
func = input -> currentTime();
If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).
If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.
Note that "no matter what" implies that a stateless function must be thread-safe.
If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.
If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).
What is an "important" side effect? That is very subjective.
At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).
If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)
If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.
Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.
If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if
we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic