In which cases Stream operations should be stateful? - java

In the javaodoc for the stream package, at the end of the section Parallelism, I read:
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
And also:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?
A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.

Examples of stateful stream lambdas:
collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).
So, Collector and Consumer are two lambda interfaces that by definition are stateful.
All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.

I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:
1. Apple
2. Banana
3. Grape
How you solve this task in Java Stream API? Pretty easily:
List<String> f = asList("Banana", "Apple", "Grape");
AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
.sorted()
.sequential()
.map(i -> String.format("%d. %s", number.incrementAndGet(), i))
.collect(Collectors.joining("\n"));
Now if you look at this pipeline you'll see 3 stateful operations:
sorted() – stateful by definition. See documetation to Stream.sorted():
This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.
There are some controversy about why sorted() is stateful. From the Stream API documentation:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.
Also note that there is some confusion between terms stateless and deterministic. They are not the same.
Deterministic function provide same result given same arguments.
Stateless function retain no state from previous calls.
Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.

When in doubt simply check the documentation to the specific operation. Examples:
Stream.map mapper parameter:
mapper - a non-interfering, stateless function to apply to each element
Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:
action - a non-interfering action to perform on the elements
Here it's not specified that the action is stateless, thus it can be stateful.
In general it's always explicitly written on every method documentation.

A stateless function returns the same output for the same inputs, "no matter what".
It's easy to create non-stateless functions in an imperative language like Java. e.g.
func = input -> currentTime();
If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).
If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.
Note that "no matter what" implies that a stateless function must be thread-safe.
If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.
If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).
What is an "important" side effect? That is very subjective.
At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).
If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)
If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.
Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.
If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if
we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic

Related

Alternative to peek in Java streams

There are lots of questions regarding peek in the Java Streams API. I'm looking for a way to complete the following common pattern using Java Streams. I can make it work with Streams, but it is non-obvious which means slightly dangerous without a comment which it is not ideal.
boolean anyPricingComponentsChanged = false;
for (var pc : plan.getPricingComponents()) {
if (pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0) {
anyPricingComponentsChanged = true;
pc.setValidTill(dateNow);
}
}
My option:
long numberChanged = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.peek(pc -> pc.setValidTill(dateNow))
.count(); //`count` rather than `findAny` to ensure that `peek` processes all components.
boolean anyPricingComponentsChanged = numberChanged != 0;
As an aside, whilst compareTo is not an expensive operation here and consistently returns the same result, in other cases this might not be true, and I'd rather avoid running it multiple times for this pattern.
// to ensure that peek processes all components
You can't really ensure that peek() would process all the stream elements that should be modified. In some cases, this operation can be elided from the pipeline, and you should not perform any important actions via peek().
Here's a quote from the documenation of peek():
API Note:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline ...
In cases where the stream implementation is able to optimize away the production of some or all the elements (such as with short-circuiting operations like findFirst, or in the example described in count()), the action will not be invoked for those elements.
Also, here's what Stream API documentation says regarding Side-effects:
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:
the visibility of those side-effects to other threads;
that different operations on the "same" element within the same stream pipeline are executed in the same thread; and
that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a
stream pipeline if it can prove that it would not affect the result of
the computation.
...
The eliding of side-effects may also be surprising. With the exception
of terminal operations forEach and forEachOrdered, side-effects of
behavioral parameters may not always be executed when the stream
implementation can optimize away the execution of behavioral
parameters without affecting the result of the computation. (For a
specific example see the API note documented on the count operation.)
Amphesys added
Since peek is not meant to contribute to the result of the stream execution Stream implementations are free to throw it away.
Instead of relying on peek() you can do the following:
List<PricingComponent> componentsToChange = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.toList();
componentsToChange.forEach(pc -> pc.setValidTill(dateNow));
boolean anyPricingComponentsChanged = componentsToChange.size() != 0;
If you don't want to materialize the objects that need to be modified as a List, then stick with a for-loop.
Note
The quotes above from the API documentation like "stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation" are applicable to any intermediate operation having an embedded side-effect. Either a side-effect can be elided, or the whole pipeline stage (stream operation) optimized away if it has no impact on the result. And to be on the same page regurding the terminology, in short, side-effect - is anything that a function does apart from producing the required result (e.g. i -> { side-effect; return i * 2; })
Although it's not advisable to assign peek() with an action which should be executed at any circumstances, at least is choice doesn't contradicts the semantics of peek. To the contrary, performing side-effects via filter, map, or other operation which are not designed to operate through side-effects not only doesn't resolve the problem, but is also weird since it goes against the semantics of these operations and violates the Principle of least astonishment.

How to safely consume Java Streams safely without isFinite() and isOrdered() methods?

There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.

Steam operation modifying the actual list

I have read somewhere that stream operation always return a new collection at the terminal operation and don't change the original collection on which stream operation has been applied.
But in my case original list has been modified.
return subscriptions.stream()
.filter(alertPrefSubscriptionsBO -> (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT || alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.SECONDARY_CONTACT))
.map(alertPrefSubscriptionsBO -> {
if (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT) {
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.PRIMARY);
} else
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.SECONDARY);
return alertPrefSubscriptionsBO;
})
.collect(groupingBy(AlertPrefSubscriptionsBO::isActiveStatus, groupingBy(AlertPrefSubscriptionsBO::getAlertLabel, Collectors.mapping((AlertPrefSubscriptionsBO o) -> o.getType()
.getContactId(), toSet())
)));
After this operation subscriptions list has been modified containing only AlertPrefContactTypeEnum.PRIMARY and AlertPrefContactTypeEnum.SECONDARY objects. I mean size of list remained same but values got changed.
That is because you are violating the contract of the map(Function<? super T,? extends R> mapper) method:
Parameters:
mapper - a non-interfering, stateless function to apply to each element
You're violating the "stateless" part:
Stateless behaviors
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline. An example of a stateful lambda is the parameter to map() in:
Set<Integer> seen = Collections.synchronizedSet(new HashSet<>());
stream.parallel().map(e -> { if (seen.add(e)) return 0; else return e; })...
Here, if the mapping operation is performed in parallel, the results for the same input could vary from run to run, due to thread scheduling differences, whereas, with a stateless lambda expression the results would always be the same.
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
The correct way to implement that map operation is to copy the alertPrefSubscriptionsBO and give the copy a new type.
Following the style used by the java.time classes, e.g. see all the withXxx(...) methods of ZonedDateTime, you would make or treat the alertPrefSubscriptionsBO object as immutable, and have methods for getting a copy with a property changed, e.g. with method withType(...) on the class and using static imports of the AlertPrefContactTypeEnum enums, you code could be:
.map(bo -> bo.withType(bo.getType() == PRIMARY_CONTACT ? PRIMARY : SECONDARY))

Does Stream.forEach() always work in parallel?

In Aggregating with Streams, Brian Goetz compares populating a collection using Stream.collect() and doing the same using Stream.forEach(), with the following two snippets:
Set<String> uniqueStrings = strings.stream()
.collect(HashSet::new,
HashSet::add,
HashSet::addAll);
And,
Set<String> set = new HashSet<>();
strings.stream().forEach(s -> set.add(s));
Then he explains:
The key
difference is that, with the forEach() version, multiple threads are trying to access a single result
container simultaneously, whereas with parallel collect(), each thread has its own local result
container, the results of which are merged afterward.
To my understanding, multiple threads would be working in the forEach() case only if the stream is parallel. However, in the example given, forEach() is operating on a sequential stream (no call to parallelStream()).
So, is it that forEach() always work in parallel, or that the code snippet should call parallelStream() instead of stream(). (or that I'm missing something?)
No, forEach() doesn't parallelize if the stream isn't parallel. I think he simplified the example for the sake of discussion.
As evidence, this code is inside the AbstractPipeline class's evaluate method (which is called from forEach)
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
The whole quote goes as follows:
Just as reduction can parallelize safely provided the combining function is associative and free of interfering side effects, mutable reduction with Stream.collect() can parallelize safely if it meets certain simple consistency requirements (outlined in the specification for collect()).
And then what you've quoted:
The key difference is that, with the forEach() version, multiple threads are trying to access a single result container simultaneously, whereas with parallel collect(), each thread has its own local result container, the results of which are merged afterward.
Since the first sentence clearly speaks of parallelization, my understanding is that both forEach() and collect() are spoken of in the context of parallel streams.

Sequential streams and shared state

The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.

Categories

Resources