Stream.peek() can be skipped for optimization - java

I've come across a rule in Sonar which says:
A key difference with other intermediate Stream operations is that the Stream implementation is free to skip calls to peek() for optimization purpose. This can lead to peek() being unexpectedly called only for some or none of the elements in the Stream.
Also, it's mentioned in the Javadoc which says:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline
In which case can java.util.Stream.peek() be skipped? Is it related to debugging?

Not only peek but also map can be skipped. It is for sake of optimization.
For example, when the terminal operation count() is called, it makes no sense to peek or map the individual items as such operations do not change the number/count of the present items.
Here are two examples:
1. Map and peek are not skipped because the filter can change the number of items beforehand.
long count = Stream.of("a", "aa")
.peek(s -> System.out.println("#1"))
.filter(s -> s.length() < 2)
.peek(s -> System.out.println("#2"))
.map(s -> {
System.out.println("#3");
return s.length();
})
.count();
#1
#2
#3
#1
1
2. Map and peek are skipped because the number of items is unchanged.
long count = Stream.of("a", "aa")
.peek(s -> System.out.println("#1"))
//.filter(s -> s.length() < 2)
.peek(s -> System.out.println("#2"))
.map(s -> {
System.out.println("#3");
return s.length();
})
.count();
2
Important: The methods should have no side-effects (they do above, but only for the sake of example).
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
The following implementation is dangerous. Assuming callRestApi method performs a REST call, it won't be performed as the Stream violates the side-effect.
long count = Stream.of("url1", "url2")
.map(string -> callRestApi(HttpMethod.POST, string))
.count();
/**
* Performs a REST call
*/
public String callRestApi(HttpMethod httpMethod, String url);

peek() is an intermediate operation, and it expects a consumer which perform an action (side-effect) on elements of the stream.
In case when a stream pipe-line doesn't contain intermediate operations which can change the number of elements in the stream, like takeWhile, filter, limit, etc., and ends with terminal operation count() and when the stream-source allows evaluating the number of elements in it, then count() simply interrogates the source and returns the result. All intermediate operations get optimized away.
Note: this optimization of count() operation, which exists since Java 9 (see the API Note), is not directly related to peek(), it would affect every intermediate operation which doesn't change the number of elements in the stream (for now these are map(), sorted(), peek()).
There's More to it
peek() has a very special niche among other intermediate operations.
By its nature, peek() differs from other intermediate operations like map() as well as from the terminal operations that cause side-effects (like peek() does), performing a final action for each element that reaches them, which are forEach() and forEachOrdered().
The key point is that peek() doesn't contribute to the result of stream execution. It never affects the result produced by the terminal operation, whether it's a value or a final action.
In other words, if we throw away peek() from the pipeline, it would not affect the terminal operation.
Documentation of the method peek() as well the Stream API documentation warns its action could be elided, and you shouldn't rely on it.
A quote from the documentation of peek():
In cases where the stream implementation is able to optimize away the
production of some or all the elements (such as with short-circuiting
operations like findFirst, or in the example described in count()),
the action will not be invoked for those elements.
A quote from the API documentation, paragraph Side-effects:
The eliding of side-effects may also be surprising. With the exception of terminal operations forEach and forEachOrdered, side-effects of behavioral parameters may not always be executed when the stream implementation can optimize away the execution of behavioral parameters without affecting the result of the computation.
Here's an example of the stream (link to the source) where none of the intermediate operations gets elided apart from peek():
Stream.of(1, 2, 3)
.parallel()
.peek(System.out::println)
.skip(1)
.map(n -> n * 10)
.forEach(System.out::println);
In this pipe-line peek() presides skip() therefor you might expect it to display every element from the source on the console. However, it doesn't happen (element 1 will not be printed). Due to the nature of peek() it might be optimized away without breaking the code, i.e. without affecting the terminal operation.
That's why documentation explicitly states that this operation is provided exclusively for debugging purposes, and it should not be assigned with an action which needs to be executed at any circumstances.

The referenced optimization at this thread is the known architecture of java streams which is based on lazy computation.
Streams are lazy; computation on the source data is only performed
when the terminal operation is initiated, and source elements are
consumed only as needed. (java doc)
Also
Intermediate operations return a new stream. They are always lazy;
executing an intermediate operation such as filter() does not actually
perform any filtering, but instead creates a new stream that, when
traversed, contains the elements of the initial stream that match the
given predicate. Traversal of the pipeline source does not begin until
the terminal operation of the pipeline is executed. (java doc)
This lazy computation affects several other operators not just .peek. In the same way that peek (which is an intermediate operation) is affected by this lazy computation are also all other intermediate operations affected (filter, map, mapToInt, mapToDouble, mapToLong, flatMap, flatMapToInt, flatMapToDouble, flatMapToLong). But probably someone not understanding the concept of lazy computation can be caught in the trap with .peek that sonar reports here.
So the example that the Sonar correctly reports
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e));
should not be used as is, because no terminal operation in the above example exists. So Streams will not invoke at all the intermidiate .peek operator, even though 2 elements ( "three", "four") are eligible to pass through the stream pipeline.
Example 1. Add a terminal operator like the following:
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e))
.collect(Collectors.toList()); // <----
and the elements passed through would be also passed through .peek intermediate operator. Never an element would be skipped on this example.
Example 2. Now here is the interesting part, if you use some other terminal operator for example the .findFirst because the Stream Api is based on lazy computation
Stream.of("one", "two", "three", "four")
.filter(e -> e.length() > 3)
.peek(e -> System.out.println("Filtered value: " + e))
.findFirst(); // <----
Only 1 element will pass through the operator .peek and not 2.
But as long as you know what you are doing (example 1) and you have understood lazy computation, you can expect that in certain cases .peek will be invoked for every element passing down the stream channel and no element would be skipped, and in other cases you would know which elements are to be skipped from .peek.
But extremely caution if you use .peek with parallel streams since there exists another set of traps which can arise. As the java API for .peek mentions:
For parallel stream pipelines, the action may be called at
* whatever time and in whatever thread the element is made available by the
* upstream operation. If the action modifies shared state,
* it is responsible for providing the required synchronization.

Related

Alternative to peek in Java streams

There are lots of questions regarding peek in the Java Streams API. I'm looking for a way to complete the following common pattern using Java Streams. I can make it work with Streams, but it is non-obvious which means slightly dangerous without a comment which it is not ideal.
boolean anyPricingComponentsChanged = false;
for (var pc : plan.getPricingComponents()) {
if (pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0) {
anyPricingComponentsChanged = true;
pc.setValidTill(dateNow);
}
}
My option:
long numberChanged = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.peek(pc -> pc.setValidTill(dateNow))
.count(); //`count` rather than `findAny` to ensure that `peek` processes all components.
boolean anyPricingComponentsChanged = numberChanged != 0;
As an aside, whilst compareTo is not an expensive operation here and consistently returns the same result, in other cases this might not be true, and I'd rather avoid running it multiple times for this pattern.
// to ensure that peek processes all components
You can't really ensure that peek() would process all the stream elements that should be modified. In some cases, this operation can be elided from the pipeline, and you should not perform any important actions via peek().
Here's a quote from the documenation of peek():
API Note:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline ...
In cases where the stream implementation is able to optimize away the production of some or all the elements (such as with short-circuiting operations like findFirst, or in the example described in count()), the action will not be invoked for those elements.
Also, here's what Stream API documentation says regarding Side-effects:
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:
the visibility of those side-effects to other threads;
that different operations on the "same" element within the same stream pipeline are executed in the same thread; and
that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a
stream pipeline if it can prove that it would not affect the result of
the computation.
...
The eliding of side-effects may also be surprising. With the exception
of terminal operations forEach and forEachOrdered, side-effects of
behavioral parameters may not always be executed when the stream
implementation can optimize away the execution of behavioral
parameters without affecting the result of the computation. (For a
specific example see the API note documented on the count operation.)
Amphesys added
Since peek is not meant to contribute to the result of the stream execution Stream implementations are free to throw it away.
Instead of relying on peek() you can do the following:
List<PricingComponent> componentsToChange = plan.getPricingComponents()
.stream()
.filter(pc -> pc.getValidTill() == null || pc.getValidTill().compareTo(dateNow) <= 0)
.toList();
componentsToChange.forEach(pc -> pc.setValidTill(dateNow));
boolean anyPricingComponentsChanged = componentsToChange.size() != 0;
If you don't want to materialize the objects that need to be modified as a List, then stick with a for-loop.
Note
The quotes above from the API documentation like "stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation" are applicable to any intermediate operation having an embedded side-effect. Either a side-effect can be elided, or the whole pipeline stage (stream operation) optimized away if it has no impact on the result. And to be on the same page regurding the terminology, in short, side-effect - is anything that a function does apart from producing the required result (e.g. i -> { side-effect; return i * 2; })
Although it's not advisable to assign peek() with an action which should be executed at any circumstances, at least is choice doesn't contradicts the semantics of peek. To the contrary, performing side-effects via filter, map, or other operation which are not designed to operate through side-effects not only doesn't resolve the problem, but is also weird since it goes against the semantics of these operations and violates the Principle of least astonishment.

How to safely consume Java Streams safely without isFinite() and isOrdered() methods?

There is the question on whether java methods should return Collections or Streams, in which Brian Goetz answers that even for finite sequences, Streams should usually be preferred.
But it seems to me that currently many operations on Streams that come from other places cannot be safely performed, and defensive code guards are not possible because Streams do not reveal if they are infinite or unordered.
If parallel was a problem to the operations I want to perform on a Stream(), I can call isParallel() to check or sequential to make sure computation is in parallel (if i remember to).
But if orderedness or finity(sizedness) was relevant to the safety of my program, I cannot write safeguards.
Assuming I consume a library implementing this fictitious interface:
public interface CoordinateServer {
public Stream<Integer> coordinates();
// example implementations:
// finite, ordered, sequential
// IntStream.range(0, 100).boxed()
// final AtomicInteger atomic = new AtomicInteger();
// // infinite, unordered, sequential
// Stream.generate(() -> atomic2.incrementAndGet())
// infinite, unordered, parallel
// Stream.generate(() -> atomic2.incrementAndGet()).parallel()
// finite, ordered, sequential, should-be-closed
// Files.lines(Path.path("coordinates.txt")).map(Integer::parseInt)
}
Then what operations can I safely call on this stream to write a correct algorithm?
It seems if I maybe want to do write the elements to a file as a side-effect, I need to be concerned about the stream being parallel:
// if stream is parallel, which order will be written to file?
coordinates().peek(i -> {writeToFile(i)}).count();
// how should I remember to always add sequential() in such cases?
And also if it is parallel, based on what Threadpool is it parallel?
If I want to sort the stream (or other non-short-circuit operations), I somehow need to be cautious about it being infinite:
coordinates().sorted().limit(1000).collect(toList()); // will this terminate?
coordinates().allMatch(x -> x > 0); // will this terminate?
I can impose a limit before sorting, but which magic number should that be, if I expect a finite stream of unknown size?
Finally maybe I want to compute in parallel to save time and then collect the result:
// will result list maintain the same order as sequential?
coordinates().map(i -> complexLookup(i)).parallel().collect(toList());
But if the stream is not ordered (in that version of the library), then the result might become mangled due to the parallel processing. But how can I guard against this, other than not using parallel (which defeats the performance purpose)?
Collections are explicit about being finite or infinite, about having an order or not, and they do not carry the processing mode or threadpools with them. Those seem like valuable properties for APIs.
Additionally, Streams may sometimes need to be closed, but most commonly not. If I consume a stream from a method (of from a method parameter), should I generally call close?
Also, streams might already have been consumed, and it would be good to be able to handle that case gracefully, so it would be good to check if the stream has already been consumed;
I would wish for some code snippet that can be used to validate assumptions about a stream before processing it, like>
Stream<X> stream = fooLibrary.getStream();
Stream<X> safeStream = StreamPreconditions(
stream,
/*maxThreshold or elements before IllegalArgumentException*/
10_000,
/* fail with IllegalArgumentException if not ordered */
true
)
After looking at things a bit (some experimentation and here) as far as I see, there is no way to know definitely whether a stream is finite or not.
More than that, sometimes even it is not determined except at runtime (such as in java 11 - IntStream.generate(() -> 1).takeWhile(x -> externalCondition(x))).
What you can do is:
You can find out with certainty if it is finite, in a few ways (notice that receiving false on these does not mean it is infinite, only that it may be so):
stream.spliterator().getExactSizeIfKnown() - if this has an known exact size, it is finite, otherwise it will return -1.
stream.spliterator().hasCharacteristics(Spliterator.SIZED) - if it is SIZED will return true.
You can safe-guard yourself, by assuming the worst (depends on your case).
stream.sequential()/stream.parallel() - explicitly set your preferred consumption type.
With potentially infinite stream, assume your worst case on each scenario.
For example assume you want listen to a stream of tweets until you find one by Venkat - it is a potentially infinite operation, but you'd like to wait until such a tweet is found. So in this case, simply go for stream.filter(tweet -> isByVenkat(tweet)).findAny() - it will iterate until such a tweet comes along (or forever).
A different scenario, and probably the more common one, is wanting to do something on all the elements, or only to try a certain amount of time (similar to timeout). For this, I'd recommend always calling stream.limit(x) before calling your operation (collect or allMatch or similar) where x is the amount of tries you're willing to tolerate.
After all this, I'll just mention that I think returning a stream is generally not a good idea, and I'd try to avoid it unless there are large benefits.

How to implement Stack Iteration using Java 8 Stream

I have a Stack<Object> and following piece of code:
while(!stack.isEmpty()){
Object object = stack.pop();
// do some operation on object
}
How this iteration can be implemented using Java 8 Stream so that it loops until the stack is empty and in every iteration the stack should be reduce by popping one element from top?
In Java 9, there will be a 3-arg version of Stream.iterate (like a for loop -- initial value, lambda for determining end-of-input, lambda for determining next input) that could do this, though it would be a little strained:
if (!stack.isEmpty()) {
Stream.iterate(stack.pop(),
e -> !stack.isEmpty(),
e -> stack.pop())
...
}
In case you don’t want to wait for the Java 9 solution, here’s a stream factory which works under Java 8.
public static <T> Stream<T> pop(Stack<T> stack) {
return StreamSupport.stream(new Spliterators.AbstractSpliterator<T>(
stack.size(), Spliterator.ORDERED|Spliterator.SIZED) {
public boolean tryAdvance(Consumer<? super T> action) {
if(stack.isEmpty()) return false;
action.accept(stack.pop());
return true;
}
}, false);
}
Note that this reports the initial size of the stack, taking it for granted, which implies that you must not change the stack in-between (modifying a stream source in-between is a bad idea anyway). On the other hand, this will make certain Stream operations more efficient than the iterate variant.
Now, a general warning that applies to both variants. Stream sources that are modified due to an ongoing Stream operation, like popping the elements which the Stream consumes, can leave the source in an unpredictable state. Short circuiting operations may not consume all elements and in combination with parallel Streams, they still may consume more elements than needed for the terminal operation.
So analogous to BufferedReader.lines()
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
you should not make any assumptions about the Stack contents after consuming elements this way.
Not possible using the stream of the stack since first it would be in first-in-first-out order and second since it is based on an iterator would throw a ConcurrentModificationException. Still possible but of course not recommended when compared to the simple for loop:
IntStream.range(0, s.size()).forEach(i -> stack.pop());

Java-8 Stream returned by .map will be parallel or sequential?

Stream returned by map or mapToObj methods is always sequential or does it depend on whether the state of the calling stream was parallel or not?
The documentation of IntStream does not answer this explicitly or I cannot understand it properly:
I am wondering if my stream from the following example will be parallel up to the end or it will change at some point.
IntStream.range(1, array_of_X.size())
.parallel()
.mapToObj (index -> array_of_X.get(index)) // mapping#1
.filter (filter_X)
.map (X_to_Y) //mapping#2
.filter (filter_Y)
.mapToInt (obj_Y_to_int) //mapping#3
.collect(value -> Collectors.summingInt(value));
No, it will never change (unless you explicitely change it yourself).
What you have written corresponds to a Stream pipeline and a single pipeline has a single orientation: parallel or sequential. So there is no "parallel up to the end" because either the whole pipeline will be executed in parallel or it will be executed sequentially.
Quoting the Stream package Javadoc:
The only difference between the serial and parallel versions of this example is the creation of the initial stream, using "parallelStream()" instead of "stream()". When the terminal operation is initiated, the stream pipeline is executed sequentially or in parallel depending on the orientation of the stream on which it is invoked. Whether a stream will execute in serial or parallel can be determined with the isParallel() method, and the orientation of a stream can be modified with the BaseStream.sequential() and BaseStream.parallel() operations. When the terminal operation is initiated, the stream pipeline is executed sequentially or in parallel depending on the mode of the stream on which it is invoked.
This means that the only way for a Stream pipeline to change its orientation is by calling one of sequential() or parallel() method. Since this is global to the Stream API, this is not written for every operation but in the package Javadoc instead.
With the code in your question, the Stream pipeline will be executed in parallel because you explicitely changed the Stream orientation by invoking parallel().
It is important to note that the resulting orentiation of the Stream will be the last call made to parallel() or sequential(). Consider the three following examples:
public static void main(String[] args) {
System.out.println(IntStream.range(0, 10).isParallel());
System.out.println(IntStream.range(0, 10).parallel().isParallel());
System.out.println(IntStream.range(0, 10).parallel().map(i -> 2*i).sequential().isParallel());
}
The first one will print false since IntStream.range returns a sequential Stream
The second one will print true since we invoked parallel()
The third one will print false since, even if we call parallel() in the pipeline, we invoked sequential() afterwards so it reset the total orientation of the Stream pipeline to serial.
Note that, still quoting:
The stream implementations in the JDK create serial streams unless parallelism is explicitly requested.
so every Stream you are going to retrieve will be sequential unless you explicitely requested a parallel Stream.

Sequential streams and shared state

The javadoc for java.util.stream implies that "behavioral operations" in a stream pipeline must usually be stateless. However, the examples it shows of how not to write a pipeline all seem to involve parallel streams.
To what extent does this apply to sequential streams?
In particular, I was looking over a colleague's code that looked essentially like this:
List<SomeClass> list = ...;
Map<SomeClass, String> map = new HashMap<>();
list.stream()
.filter(x -> [some boolean expression])
.forEach(x -> {
if (map.containsKey(x) {
throw new UserDefinedException("duplicates detected in input");
} else {
map.put(x, aStringFunction(x));
}
});
[The author had tried using Collectors.toMap(), but it threw an IllegalStateException when there were duplicates, and neither of us knew about the toMap that takes a mergeFunction. That last would have been the best solution, but I'd like an answer anyway because of the more general principle involved.]
I was nervous about this code, since it wasn't clear to me whether the execution of the block in the forEach could overlap for different elements, even for a sequential stream. The javadoc for forEach() is a bit ambiguous whether synchronization is necessary for accessing shared state in a sequential stream. Eventually the author changed the code to use a ConcurrentHashMap and map.putIfAbsent().
My question is: was I right to be nervous, or is the code above trustworthy?
Suppose the expression in the filter() did something that used some shared state. Can we trust that it will work OK when using a sequential stream?
The sequential stream is by definition executes everything in the caller thread, thus if you are not going to parallelize your stream in future, you can safely use shared state without additional synchronization and concurrent-safe collections. So the current code is safe. Note however that it just looks dirty.
If you rely on your forEach to be executed sequentially, consider using forEachOrdered instead even if the stream is sequential. Not only will that get the explicit guarantee from the api that the code will be executed sequentially, it will make the code more self-documenting and provide some measure of protection against somebody coming along and changing your stream to parallel.

Categories

Resources