There are lots of questions regarding peek in the Java Streams API. I'm looking for a way to complete the following common pattern using Java Streams. I can make it work with Streams, but it is non-obvious which means slightly dangerous without a comment which it is not ideal.
boolean anyPricingComponentsChanged = false;
for (var pc : plan.getPricingComponents()) {
if (pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0) {
anyPricingComponentsChanged = true;
pc.setValidTill(dateNow);
}
}
My option:
long numberChanged = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.peek(pc -> pc.setValidTill(dateNow))
.count(); //`count` rather than `findAny` to ensure that `peek` processes all components.
boolean anyPricingComponentsChanged = numberChanged != 0;
As an aside, whilst compareTo is not an expensive operation here and consistently returns the same result, in other cases this might not be true, and I'd rather avoid running it multiple times for this pattern.
// to ensure that peek processes all components
You can't really ensure that peek() would process all the stream elements that should be modified. In some cases, this operation can be elided from the pipeline, and you should not perform any important actions via peek().
Here's a quote from the documenation of peek():
API Note:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline ...
In cases where the stream implementation is able to optimize away the production of some or all the elements (such as with short-circuiting operations like findFirst, or in the example described in count()), the action will not be invoked for those elements.
Also, here's what Stream API documentation says regarding Side-effects:
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:
the visibility of those side-effects to other threads;
that different operations on the "same" element within the same stream pipeline are executed in the same thread; and
that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a
stream pipeline if it can prove that it would not affect the result of
the computation.
...
The eliding of side-effects may also be surprising. With the exception
of terminal operations forEach and forEachOrdered, side-effects of
behavioral parameters may not always be executed when the stream
implementation can optimize away the execution of behavioral
parameters without affecting the result of the computation. (For a
specific example see the API note documented on the count operation.)
Amphesys added
Since peek is not meant to contribute to the result of the stream execution Stream implementations are free to throw it away.
Instead of relying on peek() you can do the following:
List<PricingComponent> componentsToChange = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.toList();
componentsToChange.forEach(pc -> pc.setValidTill(dateNow));
boolean anyPricingComponentsChanged = componentsToChange.size() != 0;
If you don't want to materialize the objects that need to be modified as a List, then stick with a for-loop.
Note
The quotes above from the API documentation like "stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation" are applicable to any intermediate operation having an embedded side-effect. Either a side-effect can be elided, or the whole pipeline stage (stream operation) optimized away if it has no impact on the result. And to be on the same page regurding the terminology, in short, side-effect - is anything that a function does apart from producing the required result (e.g. i -> { side-effect; return i * 2; })
Although it's not advisable to assign peek() with an action which should be executed at any circumstances, at least is choice doesn't contradicts the semantics of peek. To the contrary, performing side-effects via filter, map, or other operation which are not designed to operate through side-effects not only doesn't resolve the problem, but is also weird since it goes against the semantics of these operations and violates the Principle of least astonishment.
Related
I've come across a rule in Sonar which says:
A key difference with other intermediate Stream operations is that the Stream implementation is free to skip calls to peek() for optimization purpose. This can lead to peek() being unexpectedly called only for some or none of the elements in the Stream.
Also, it's mentioned in the Javadoc which says:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline
In which case can java.util.Stream.peek() be skipped? Is it related to debugging?
Not only peek but also map can be skipped. It is for sake of optimization.
For example, when the terminal operation count() is called, it makes no sense to peek or map the individual items as such operations do not change the number/count of the present items.
Here are two examples:
1. Map and peek are not skipped because the filter can change the number of items beforehand.
long count = Stream.of("a", "aa")
.peek(s -> System.out.println("#1"))
.filter(s -> s.length() < 2)
.peek(s -> System.out.println("#2"))
.map(s -> {
System.out.println("#3");
return s.length();
})
.count();
#1
#2
#3
#1
1
2. Map and peek are skipped because the number of items is unchanged.
long count = Stream.of("a", "aa")
.peek(s -> System.out.println("#1"))
//.filter(s -> s.length() < 2)
.peek(s -> System.out.println("#2"))
.map(s -> {
System.out.println("#3");
return s.length();
})
.count();
2
Important: The methods should have no side-effects (they do above, but only for the sake of example).
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
The following implementation is dangerous. Assuming callRestApi method performs a REST call, it won't be performed as the Stream violates the side-effect.
long count = Stream.of("url1", "url2")
.map(string -> callRestApi(HttpMethod.POST, string))
.count();
/**
* Performs a REST call
*/
public String callRestApi(HttpMethod httpMethod, String url);
peek() is an intermediate operation, and it expects a consumer which perform an action (side-effect) on elements of the stream.
In case when a stream pipe-line doesn't contain intermediate operations which can change the number of elements in the stream, like takeWhile, filter, limit, etc., and ends with terminal operation count() and when the stream-source allows evaluating the number of elements in it, then count() simply interrogates the source and returns the result. All intermediate operations get optimized away.
Note: this optimization of count() operation, which exists since Java 9 (see the API Note), is not directly related to peek(), it would affect every intermediate operation which doesn't change the number of elements in the stream (for now these are map(), sorted(), peek()).
There's More to it
peek() has a very special niche among other intermediate operations.
By its nature, peek() differs from other intermediate operations like map() as well as from the terminal operations that cause side-effects (like peek() does), performing a final action for each element that reaches them, which are forEach() and forEachOrdered().
The key point is that peek() doesn't contribute to the result of stream execution. It never affects the result produced by the terminal operation, whether it's a value or a final action.
In other words, if we throw away peek() from the pipeline, it would not affect the terminal operation.
Documentation of the method peek() as well the Stream API documentation warns its action could be elided, and you shouldn't rely on it.
A quote from the documentation of peek():
In cases where the stream implementation is able to optimize away the
production of some or all the elements (such as with short-circuiting
operations like findFirst, or in the example described in count()),
the action will not be invoked for those elements.
A quote from the API documentation, paragraph Side-effects:
The eliding of side-effects may also be surprising. With the exception of terminal operations forEach and forEachOrdered, side-effects of behavioral parameters may not always be executed when the stream implementation can optimize away the execution of behavioral parameters without affecting the result of the computation.
Here's an example of the stream (link to the source) where none of the intermediate operations gets elided apart from peek():
Stream.of(1, 2, 3)
.parallel()
.peek(System.out::println)
.skip(1)
.map(n -> n * 10)
.forEach(System.out::println);
In this pipe-line peek() presides skip() therefor you might expect it to display every element from the source on the console. However, it doesn't happen (element 1 will not be printed). Due to the nature of peek() it might be optimized away without breaking the code, i.e. without affecting the terminal operation.
That's why documentation explicitly states that this operation is provided exclusively for debugging purposes, and it should not be assigned with an action which needs to be executed at any circumstances.
The referenced optimization at this thread is the known architecture of java streams which is based on lazy computation.
Streams are lazy; computation on the source data is only performed
when the terminal operation is initiated, and source elements are
consumed only as needed. (java doc)
Also
Intermediate operations return a new stream. They are always lazy;
executing an intermediate operation such as filter() does not actually
perform any filtering, but instead creates a new stream that, when
traversed, contains the elements of the initial stream that match the
given predicate. Traversal of the pipeline source does not begin until
the terminal operation of the pipeline is executed. (java doc)
This lazy computation affects several other operators not just .peek. In the same way that peek (which is an intermediate operation) is affected by this lazy computation are also all other intermediate operations affected (filter, map, mapToInt, mapToDouble, mapToLong, flatMap, flatMapToInt, flatMapToDouble, flatMapToLong). But probably someone not understanding the concept of lazy computation can be caught in the trap with .peek that sonar reports here.
So the example that the Sonar correctly reports
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e));
should not be used as is, because no terminal operation in the above example exists. So Streams will not invoke at all the intermidiate .peek operator, even though 2 elements ( "three", "four") are eligible to pass through the stream pipeline.
Example 1. Add a terminal operator like the following:
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e))
.collect(Collectors.toList()); // <----
and the elements passed through would be also passed through .peek intermediate operator. Never an element would be skipped on this example.
Example 2. Now here is the interesting part, if you use some other terminal operator for example the .findFirst because the Stream Api is based on lazy computation
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e))
.findFirst(); // <----
Only 1 element will pass through the operator .peek and not 2.
But as long as you know what you are doing (example 1) and you have understood lazy computation, you can expect that in certain cases .peek will be invoked for every element passing down the stream channel and no element would be skipped, and in other cases you would know which elements are to be skipped from .peek.
But extremely caution if you use .peek with parallel streams since there exists another set of traps which can arise. As the java API for .peek mentions:
For parallel stream pipelines, the action may be called at
* whatever time and in whatever thread the element is made available by the
* upstream operation. If the action modifies shared state,
* it is responsible for providing the required synchronization.
There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.
I have read somewhere that stream operation always return a new collection at the terminal operation and don't change the original collection on which stream operation has been applied.
But in my case original list has been modified.
return subscriptions.stream()
.filter(alertPrefSubscriptionsBO -> (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT || alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.SECONDARY_CONTACT))
.map(alertPrefSubscriptionsBO -> {
if (alertPrefSubscriptionsBO.getType() == AlertPrefContactTypeEnum.PRIMARY_CONTACT) {
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.PRIMARY);
} else
alertPrefSubscriptionsBO.setType(AlertPrefContactTypeEnum.SECONDARY);
return alertPrefSubscriptionsBO;
})
.collect(groupingBy(AlertPrefSubscriptionsBO::isActiveStatus, groupingBy(AlertPrefSubscriptionsBO::getAlertLabel, Collectors.mapping((AlertPrefSubscriptionsBO o) -> o.getType()
.getContactId(), toSet())
)));
After this operation subscriptions list has been modified containing only AlertPrefContactTypeEnum.PRIMARY and AlertPrefContactTypeEnum.SECONDARY objects. I mean size of list remained same but values got changed.
That is because you are violating the contract of the map(Function<? super T,? extends R> mapper) method:
Parameters:
mapper - a non-interfering, stateless function to apply to each element
You're violating the "stateless" part:
Stateless behaviors
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline. An example of a stateful lambda is the parameter to map() in:
Set<Integer> seen = Collections.synchronizedSet(new HashSet<>());
stream.parallel().map(e -> { if (seen.add(e)) return 0; else return e; })...
Here, if the mapping operation is performed in parallel, the results for the same input could vary from run to run, due to thread scheduling differences, whereas, with a stateless lambda expression the results would always be the same.
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
The correct way to implement that map operation is to copy the alertPrefSubscriptionsBO and give the copy a new type.
Following the style used by the java.time classes, e.g. see all the withXxx(...) methods of ZonedDateTime, you would make or treat the alertPrefSubscriptionsBO object as immutable, and have methods for getting a copy with a property changed, e.g. with method withType(...) on the class and using static imports of the AlertPrefContactTypeEnum enums, you code could be:
.map(bo -> bo.withType(bo.getType() == PRIMARY_CONTACT ? PRIMARY : SECONDARY))
In the javaodoc for the stream package, at the end of the section Parallelism, I read:
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
And also:
Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.
So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?
A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.
Examples of stateful stream lambdas:
collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).
So, Collector and Consumer are two lambda interfaces that by definition are stateful.
All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.
I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?
Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:
1. Apple
2. Banana
3. Grape
How you solve this task in Java Stream API? Pretty easily:
List<String> f = asList("Banana", "Apple", "Grape");
AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
.sorted()
.sequential()
.map(i -> String.format("%d. %s", number.incrementAndGet(), i))
.collect(Collectors.joining("\n"));
Now if you look at this pipeline you'll see 3 stateful operations:
sorted() – stateful by definition. See documetation to Stream.sorted():
This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.
There are some controversy about why sorted() is stateful. From the Stream API documentation:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.
Also note that there is some confusion between terms stateless and deterministic. They are not the same.
Deterministic function provide same result given same arguments.
Stateless function retain no state from previous calls.
Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.
When in doubt simply check the documentation to the specific operation. Examples:
Stream.map mapper parameter:
mapper - a non-interfering, stateless function to apply to each element
Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:
action - a non-interfering action to perform on the elements
Here it's not specified that the action is stateless, thus it can be stateful.
In general it's always explicitly written on every method documentation.
A stateless function returns the same output for the same inputs, "no matter what".
It's easy to create non-stateless functions in an imperative language like Java. e.g.
func = input -> currentTime();
If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).
If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.
Note that "no matter what" implies that a stateless function must be thread-safe.
If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.
If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).
What is an "important" side effect? That is very subjective.
At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).
If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)
If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.
Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.
If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if
we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic
The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.