Is the skip() method a short circuiting-operation? - java

I am reading about Java streams' short-circuiting operations and found in some articles that skip() is a short-circuiting operation.
In another article they didn't mention skip() as a short-circuiting operation.
Now I am confused; is skip() a short-circuiting operation or not?

From the java doc under the "Stream operations and pipelines" section :
An
intermediate operation is short-circuiting if, when presented with
infinite input, it may produce a finite stream as a result. A terminal
operation is short-circuiting if, when presented with infinite input,
it may terminate in finite time.
Emphasis mine.
if you were to call skip on an infinite input it won't produce a finite stream hence not a short-circuiting operation.
The only short-circuiting intermediate operation in JDK8 is limit as it allows computations on infinite streams to complete in finite time.
Example:
if you were to execute this program with the use of skip:
String[] skip = Stream.generate(() -> "test") // returns an infinite stream
.skip(20)
.toArray(String[]::new);
it will not produce a finite stream hence you would eventually end up with something along the lines of "java.lang.OutOfMemoryError: Java heap space".
whereas if you were to execute this program with the use of limit, it will cause the computation to finish in a finite time:
String[] limit = Stream.generate(() -> "test") // returns an infinite stream
.limit(20)
.toArray(String[]::new);

Just want to add my two cents here, this idea in general of a short-circuiting a stream is infinitely complicated (at least to me and at least in the sense that I have to scratch my head twice usually). I will get to skip at the end of the answer btw.
Let's take this for example:
Stream.generate(() -> Integer.MAX_VALUE);
This is an infinite stream, we can all agree on this. Let's short-circuit it via an operation that is documented to be as such (unlike skip):
Stream.generate(() -> Integer.MAX_VALUE).anyMatch(x -> true);
This works nicely, how about adding a filter:
Stream.generate(() -> Integer.MAX_VALUE)
.filter(x -> x < 100) // well sort of useless...
.anyMatch(x -> true);
What will happen here? Well, this never finishes, even if there is a short-circuiting operation like anyMatch - but it's never reached to actually short-circuit anything.
On the other hand, filter is not a short-circuiting operation, but you can make it as such (just as an example):
someList.stream()
.filter(x -> {
if(x > 3) throw AssertionError("Just because");
})
Yes, it's ugly, but it's short-circuiting... That's how we (emphases on we, since lots of people, disagree) implement short-circuiting reduce - throw an Exception that has no stack traces.
In java-9 there was an addition of another intermediate operation that is short-circuiting: takeWhile that acts sort of like limit but for a certain condition.
And to be fair, the bulk of the answer about skip was an already give by Aomine, but the most simple answer is that it is not documented as such. And in general (there are cases when documentation is corrected), but that is the number one indication you should look at. See limit and takeWhile for example that clearly says:
This is a short-circuiting stateful intermediate operation

Related

Java-Stream - mapMulti() with Infinite Streams

I thought that all stream pipelines written using flatMap() can be converted to use mapMulti. Looks like I was wrong when the flatMap() or mapMulti() returns/operates on an infinite stream.
Note: this is for educational purpose only
When we map an element to an infinite stream inside a flatMap() followed by a limit(), then the stream pipeline is lazy and evaluates as per the required number of elements.
list.stream()
.flatMap(element -> Stream.generate(() -> 1))
.limit(3)
.forEach(System.out::println);
Output:
1
1
1
But when doing the same in a mapMulti(), the pipeline is still lazy i.e., it doesn't consume the infinite stream. But when running this in IDE (Intellij), it hangs and doesn't terminate (I guess waiting for other elements consumption) and doesn't come out of the stream pipeline execution.
With a mapMulti(),
list.stream()
.mapMulti((element, consumer) -> {
Stream.generate(() -> 1)
.forEach(consumer);
})
.limit(3)
.forEach(System.out::println);
System.out.println("Done"); //Never gets here
Output:
1
1
1
But the last print (Done) doesn't get executed.
Is this the expected behaviour?
I couldn't find any warning or points on infinite stream and mapMulti() in Javadoc.
The advantage of mapMulti() is that it consumes new elements which became a part of the stream, replacing the initial element (opposed to flatMap() which internally generates a new stream for each element). If you're generating a fully-fledged stream with a terminal operation inside the mapMulti() it should be executed. And you've created an infinite stream which can't terminate (as #Lino has pointed out in the comment).
On the contrary, flatMap() expects a function producing a stream, i.e. function only returns it not processes.
Here's a quote from the API note that emphasizes the difference between the two operations:
API Note:
This method is similar to flatMap in that it applies a one-to-many
transformation to the elements of the stream and flattens the result
elements into a new stream. This method is preferable to flatMap in
the following circumstances:
When replacing each stream element with a small (possibly zero) number of elements. Using this method avoids the overhead of creating
a new Stream instance for every group of result elements, as required
by flatMap.

Why is the stream with a limit infinite?

I have run the following code in Eclipse:
Stream.generate(() -> "Elsa")
.filter(n -> n.length() ==4)
.sorted()
.limit(2)
.forEach(System.out::println);
The output is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
What I was expecting since the limit is two:
Elsa
Elsa
Can someone please explain why this is an infinite stream?
The first thing is that Stream::generate creates an infinite stream. That's why the stream is initially infinite.
You limit the stream to two elements by using Stream::limit, which would make it finite.
However, the problem is that you call sorted(), which tries to consume the whole stream. You need to limit the stream before you sort:
Stream.generate(() -> "Elsa")
.filter(n -> n.length() == 4)
.limit(2)
.sorted()
.forEach(System.out::println);
The documentation says that Stream::sorted() "is a stateful intermediate operation". The Streams documentation about a stateful intermediate operation explains it very well:
Stateful operations may need to process the entire input before producing a result. For example, one cannot produce any results from sorting a stream until one has seen all elements of the stream.
Emphasis mine.
There it is. Also note that for all Stream operations, their operation type is mentioned in the Javadocs.
Can someone please explain why this is an infinite stream?
Because the javadoc says that is precisely what Stream.generate() creates:
Returns an infinite sequential unordered stream where each element is generated by the provided Supplier
Then when you combine that with sorted(), you tell it to start a sort on an infinite sequence which will obviously cause the JVM to run out of memory.

Make Stream Parallel on the Result of flatMap

Consider the following simple code:
Stream.of(1)
.flatMap(x -> IntStream.range(0, 1024).boxed())
.parallel() // Moving this before flatMap has the same effect because it's just a property of the entire stream
.forEach(x -> {
System.out.println("Thread: " + Thread.currentThread().getName());
});
For a long time, I thought that Java would have parallel execution for elements even after flatMap. But the above code prints all "Thread: main", which proves my thought wrong.
A simple way to make it parallel after flatMap would be to collect and then stream again:
Stream.of(1)
.flatMap(x -> IntStream.range(0, 1024).boxed())
.parallel() // Moving this before flatMap has the same effect because it's just a property of the entire stream
.collect(Collectors.toList())
.parallelStream()
.forEach(x -> {
System.out.println("Thread: " + Thread.currentThread().getName());
});
I was wondering whether there is a better way, and about the design choice of flatMap that only parallelizes the stream before the call, but not after the call.
========= More Clarification about the Question ========
From some answers, it seems that my question is not fully conveyed. As #Andreas said, if I start with a Stream of 3 elements, there could be 3 threads running.
But my question really is: Java Stream uses a common ForkJoinPool that has a default size equal to one less than the number of cores, according to this post. Now suppose I have 64 cores, then I expect my above code would see many different threads after flatMap, but in fact, it sees only one (or 3 in Andreas' case). By the way, I did use isParallel to observe that the stream is parallel.
To be honest, I wasn't asking this question for pure academic interest. I ran into this problem in a project that presents a long chain of stream operations for transforming a dataset. The chain starts with a single file, and explodes to a lot of elements through flatMap. But apparently, in my experiment, it does NOT fully exploit my machine (which has 64 cores), but only uses one core (from observation of the cpu usage).
I was wondering [...] about the design choice of flatMap that only parallelizes the stream before the call, but not after the call.
You're mistaken. All steps both before and after the flatMap are run in parallel, but it only splits the original stream between threads. The flatMap operation is then handled by one such thread, and its stream isn't split.
Since your original stream only has 1 element, it cannot be split, and hence parallel has no effect.
Try changing to Stream.of(1, 2, 3), and you will see that the forEach, which is after the flatMap, is actually run in 3 different threads.
The documentation for forEach specifies:
For any given element, the action may be performed at whatever time and in whatever thread the library chooses.
In particular, "execute all the operations on the invoking thread" seems like a good broadly-safe implementation.
Note that your attempt to parallelize the stream does not require any specific parallelism, but you'd be much more likely to see an effect with this:
IntStream.range(0, 1024).boxed()
.parallel()
.map(i -> "Thread: " + Thread.currentThread().getName())
.forEach(System.out::println);
For anyone like me, who has a dire need to parallelize flatMap and needs some practical solution, not only history and theory. And for those who doesn't consider collecting all the items in between before parallelizing them.
The simplest solution I came up with is to do flattening by hand, basically by replacing it with map + reduce(Stream::concat).
I've already answered to the same question in another thread, see details at https://stackoverflow.com/a/66386078/3606820

Java 8 stream short-circuit

Reading up a bit on Java 8, I got to this blog post explaining a bit about streams and reduction of them, and when it would be possible to short-circuit the reduction. At the bottom it states:
Note in the case of findFirst or findAny we only need the first value which matches the predicate (although findAny is not guaranteed to return the first). However if the stream has no ordering then we’d expect findFirst to behave like findAny. The operations allMatch, noneMatch and anyMatch may not short-circuit the stream at all since it may take evaluating all the values to determine whether the operator is true or false. Thus an infinite stream using these may not terminate.
I get that findFirst or findAny may short-circuit the reduction, because as soon af you find an element, you don't need to process any further.
But why would this not be possible for allMatch, noneMatch and anyMatch? For allMatch, if you find one which doesn't match the predicate, you can stop processing. Same for none. And anyMatch especially doesn't make sense to me, as it it pretty much equal to findAny (except for what is returned)?
Saying that these three may not short-circuit, because it may take evaluating all the values, could also be said for findFirst/Any.
Is there some fundamental difference I'm missing? Am I not really understanding what is going on?
There's a subtle difference, because anyMatch family uses a predicate, while findAny family does not. Technically findAny() looks like anyMatch(x -> true) and anyMatch(pred) looks like filter(pred).findAny(). So here we have another issue. Consider we have a simple infinite stream:
Stream<Integer> s = Stream.generate(() -> 1);
So it's true that applying findAny() to such stream will always short-circuit and finish while applying anyMatch(pred) depends on the predicate. However let's filter our infinite stream:
Stream<Integer> s = Stream.generate(() -> 1).filter(x -> x < 0);
Is the resulting stream infinite as well? That's a tricky question. It actually contains no elements, but to determine this (for example, using .iterator().hasNext()) we have to check the infinite number of underlying stream elements, so this operation will never finish. I would call such stream an infinite as well. However using such stream both anyMatch and findAny will never finish:
Stream.generate(() -> 1).filter(x -> x < 0).anyMatch(x -> true);
Stream.generate(() -> 1).filter(x -> x < 0).findAny();
So findAny() is not guaranteed to finish either, it depends on the previous intermediate stream operations.
To conclude I would rate that blog-post as very misleading. In my opinion infinity stream behavior is better explained in official JavaDoc.
Answer Updated
I'd say the blog post is wrong when it says "findFirst or findAny we only need the first value which matches the predicate".
In the javadoc for allMatch(Predicate), anyMatch(Predicate), noneMatch(Predicate), findAny(), and findFirst():
This is a short-circuiting terminal operation.
However, note that findFirst and findAny doesn't have a Predicate. So they can both return immediately upon seeing the first/any value. The other 3 are conditional and may loop forever if condition never fires.
According to Oracle's Stream Documentation:
https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#StreamOps
A terminal operation is short-circuiting if, when presented with infinite input, it may terminate in finite time. Having a short-circuiting operation in the pipeline is a necessary, but not sufficient, condition for the processing of an infinite stream to terminate normally in finite time.
All five functions have the line:
This is a short-circuiting terminal operation.
In the description of the function.
When the javadoc says "may not short-circuit" it is merely pointing out that it is not a short circuit operation and depending on the values, the entire stream may be processed.
findFirst and findAny on the other hand, are guaranteed to short circuit since they never need to process the rest of the stream once they are satisfied.
anyMatch, noneMatch and allMatch return boolean values, so they may have to check all to prove the logic.
findFirst and findAny just care about finding the first they can and returning that.
Edit:
For a given dataset the Match methods are guaranteed to always return the same value, however the Find methods are not because the order may vary and affect which value is returned.
The short circuiting described is talking about the Find methods lacking consistency for a given dataset.
LongStream.range(0, Long.MAX_VALUE).allMatch(x -> x >= 0)
LongStream.range(0, Long.MAX_VALUE).allMatch(x -> x > 0)
The first one returns forever, the second one returns immediately

Java 8 stream peek and limit interaction

Why this code in java 8:
IntStream.range(0, 10)
.peek(System.out::print)
.limit(3)
.count();
outputs:
012
I'd expect it to output 0123456789, because peek preceeds limit.
It seems to me even more peculiar because of the fact that this:
IntStream.range(0, 10)
.peek(System.out::print)
.map(x -> x * 2)
.count();
outputs 0123456789 as expected (not 02481012141618).
P.S.: .count() here is used just to consume stream, it can be replaced with anything else
The most important thing to know about streams are that they do not contain elements themselves (like collections) but are working like a pipe whose values are lazily evaluated. That means that the statements that build up a stream - including mapping, filtering, or whatever - are not evaluated until the terminal operation runs.
In your first example, the stream tries to count from 0 to 9, one at each time doing the following:
print out the value
check whether 3 values are passed (if yes, terminate)
So you really get the output 012.
In your second example, the stream again counts from 0 to 9, one at each time doing the following:
print out the value
maping x to x*2, thus forwarding the double of the value to the next step
As you can see the output comes before the mapping and thus you get the result 0123456789. Try to switch the peek and the map calls. Then you will get your expected output.
From the docs:
limit() is a short-circuiting stateful intermediate operation.
map() is an intermediate operation
Again from the docs what that essentially means is that limit() will return a stream with x values from the stream it received.
An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result.
Streams are defined to do lazy processing. So in order to complete your count() operation it doesn’t need to look at the other items. Otherwise, it would be broken, as limit(…) is defined to be a proper way of processing infinite streams in a finite time (by not processing more than limit items).
In principle, it would be possible to complete your request without ever looking at the int values at all, as the operation chain limit(3).count() doesn’t need any processing of the previous operations (other than verifying whether the stream has at least 3 items).
Streams use lazy evaluation, the intermediate operations, i.e. peek() are not executed till the terminal operation runs.
For instances, the following code will just print 1 .In fact, as soon as the first element of the stream,1, will reach the terminal operation, findAny(), the stream execution will be ended.
Arrays.asList(1,2,3)
.stream()
.peek(System.out::print)
.filter((n)->n<3)
.findAny();
Viceversa, in the following example, will be printed 123. In fact the terminal operation, noneMatch(), needs to evaluate all the elements of the stream in order to make sure there is no match with its Predicate: n>4
Arrays.asList(1, 2, 3)
.stream()
.peek(System.out::print)
.noneMatch(n -> n > 4);
For future readers struggling to understand how the count method doesn't execute the peek method before it, I thought I add this additional note:
As per Java 9, the Java documentation for the count method states that:
An implementation may choose to not execute the stream pipeline
(either sequentially or in parallel) if it is capable of computing the
count directly from the stream source.
This means terminating the stream with count is no longer enough to ensure the execution of all previous steps, such as peek.

Categories

Resources