Please, consider this code:
System.out.println("#1");
Stream.of(0, 1, 2, 3)
.peek(e -> System.out.println(e))
.sorted()
.findFirst();
System.out.println("\n#2");
IntStream.range(0, 4)
.peek(e -> System.out.println(e))
.sorted()
.findFirst();
The output will be:
#1
0
1
2
3
#2
0
Could anyone explain, why output of two streams are different?
Well, IntStream.range() returns a sequential ordered IntStream from startInclusive(inclusive) to endExclusive (exclusive) by an incremental step of 1, which means it's already sorted. Since it's already sorted, it makes sense that the following .sorted() intermediate operation does nothing. As a result, peek() is executed on just the first element (since the terminal operation only requires the first element).
On the other hand, the elements passed to Stream.of() are not necessarily sorted (and the of() method doesn't check if they are sorted). Therefore, .sorted() must traverse all the elements in order to produce a sorted stream, which allows the findFirst() terminal operation to return the first element of the sorted stream. As a result, peek is executed on all the elements, even though the terminal operation only needs the first element.
IntStream.range is already sorted:
// reports true
System.out.println(
IntStream.range(0, 4)
.spliterator()
.hasCharacteristics(Spliterator.SORTED)
);
So when sorted() method on the Stream is hit, internally, it will become a NO-OP.
Otherwise, as you already see in your first example, all the elements have to be sorted, only then findFirst can tell who is "really the first one".
Just notice that this optimization only works for naturally sorted streams. For example:
// prints too much you say?
Stream.of(new User(30), new User(25), new User(34))
.peek(x -> System.out.println("1 : before I call first sorted"))
.sorted(Comparator.comparing(User::age))
.peek(x -> System.out.println("2 : before I call second sorted"))
.sorted(Comparator.comparing(User::age))
.findFirst();
where (for brevity):
record User(int age) { }
Related
I have a question on the intermediate stages sequential state - are the operations from a stage applied to all the input stream (items) or are all the stages / operations applied to each stream item?
I'm aware the question might not be easy to understand, so I'll give an example. On the following stream processing:
List<String> strings = Arrays.asList("Are Java streams intermediate stages sequential?".split(" "));
strings.stream()
.filter(word -> word.length() > 4)
.peek(word -> System.out.println("f: " + word))
.map(word -> word.length())
.peek(length -> System.out.println("m: " + length))
.forEach(length -> System.out.println("-> " + length + "\n"));
My expectation for this code is that it will output:
f: streams
f: intermediate
f: stages
f: sequential?
m: 7
m: 12
m: 6
m: 11
-> 7
-> 12
-> 6
-> 11
Instead, the output is:
f: streams
m: 7
-> 7
f: intermediate
m: 12
-> 12
f: stages
m: 6
-> 6
f: sequential?
m: 11
-> 11
Are the items just displayed for all the stages, due to the console output? Or are they also processed for all the stages, one at a time?
I can further detail the question, if it's not clear enough.
This behaviour enables optimisation of the code. If each intermediate operation were to process all elements of a stream before proceeding to the next intermediate operation then there would be no chance of optimisation.
So to answer your question, each element moves along the stream pipeline vertically one at a time (except for some stateful operations discussed later), therefore enabling optimisation where possible.
Explanation
Given the example you've provided, each element will move along the stream pipeline vertically one by one as there is no stateful operation included.
Another example, say you were looking for the first String whose length is greater than 4, processing all the elements prior to providing the result is unnecessary and time-consuming.
Consider this simple illustration:
List<String> stringsList = Arrays.asList("1","12","123","1234","12345","123456","1234567");
int result = stringsList.stream()
.filter(s -> s.length() > 4)
.mapToInt(Integer::valueOf)
.findFirst().orElse(0);
The filter intermediate operation above will not find all the elements whose length is greater than 4 and return a new stream of them but rather what happens is as soon as we find the first element whose length is greater than 4, that element goes through to the .mapToInt which then findFirst says "I've found the first element" and execution stops there. Therefore the result will be 12345.
Behaviour of stateful and stateless intermediate operations
Note that when a stateful intermediate operation as such of sorted is included in a stream pipeline then that specific operation will traverse the entire stream. If you think about it, this makes complete sense as in order to sort elements you'll need to see all the elements to determine which elements come first in the sort order.
The distinct intermediate operation is also a stateful operation, however, as #Holger has mentioned unlike sorted, it does not require traversing the entire stream as each distinct element can get passed down the pipeline immediately and may fulfil a short-circuiting condition.
stateless intermediate operations such as filter , map etc do not have to traverse the entire stream and can freely process one element at a time vertically as mentioned above.
Lastly, but not least it's also important to note that, when the terminal operation is a short-circuiting operation the terminal-short-circuiting methods can finish before traversing all the elements of the underlying stream.
reading: Java 8 stream tutorial
Your answer is loop fusion. What we see is that the four
intermediate operations filter() – peek() – map() – peek() – println using forEach() which is a kinda terminal operation have been logically
joined together to constitute a single pass. They are executed in
order for each of the individual element. This joining
together of operations in a single pass is an optimization technique
known as loop fusion.
More for reading: Source
An intermediate operation is always lazily executed. That is to say
they are not run until the point a terminal operation is reached.
A few of the most popular intermediate operations used in a stream
filter – the filter operation returns a stream of elements that
satisfy the predicate passed in as a parameter to the operation. The
elements themselves before and after the filter will have the same
type, however the number of elements will likely change
map – the map operation returns a stream of elements after they have
been processed by the function passed in as a parameter. The
elements before and after the mapping may have a different type, but
there will be the same total number of elements.
distinct – the distinct operation is a special case of the filter
operation. Distinct returns a stream of elements such that each
element is unique in the stream, based on the equals method of the
elements
.java-8-streams-cheat-sheet
Apart from optimisation, the order of processing you'd describe wouldn't work for streams of indeterminate length, like this:
DoubleStream.generate(Math::random).filter(d -> d > 0.9).findFirst();
Admittedly this example doesn't make much sense in practice, but the point is that rather than backed by a fixed-size collection,DoubleStream.generate() creates a potentially infinite stream. The only way to process this is element by element.
I was practicing stream. Then got this strange behavior. I can't figure out the reason. That's why asking for help.
I wrote this:
IntStream.iterate(10, i -> i - 2)
.limit(5)
.sorted()
.takeWhile(i -> i > 2)
.forEach(System.out::println);
This is giving me nothing in the console, which is expected. When I am seeing the stream chain in IntelliJ IDEA, getting this:
So, after the execution of sorted() it is returning only one element (that element is 2 in the image) to the pipeline.
Now I commented the takeWhile() call. Like this:
IntStream.iterate(10, i -> i - 2)
.limit(5)
.sorted()
// .takeWhile(i -> i > 2)
.forEach(System.out::println);
This is also printing in the console as expected(from 10 to 2 with a difference of 2).
But problem is, this time sorted() returned 5 elements to pipeline. When I am seeing the stream chain in IntelliJ IDEA, getting this:
My question is, why the difference is seen in calling sorted()? If there is a takeWhile() after sorted() then it is returning one element. If I am commenting takeWhile(), sorted() is returning stream of 5 elements.
Is JVM doing anything here for achieving better performance?
Thanks in advance.....
Stream.sorted() is a stateful operation, so it is actually maintaining the [2, 4, 6, 8, 10] state. However, it will only send 1 item at a time.
So, Initially it will send 1 entry: '2' to takeWhile operator which can then reject the entry and stop the stream chain. That is why you are seeing sorted returning only 2 to takeWhile operator.
But, when sorted is not present, takeWhile will get 10 as the entry. And as the condition is true, stream will continue until takeWhile receives 2.
Edit:
When takeWhile is not present, you are getting all 5 items as output because you have 5 items in stream chain and there is no takeWhile to stop the stream chain.
In Intellij's Stream Trace, On the right side of sorted you are seeing how many elements are being passed to next operator after performing sorted() operation.
In the first case, only 1 element is being passed to takeWhile() as takeWhile() is terminating the stream chain after receiving '2'. But in the second case all the elements are being passed to forEach.
Given a List of numbers: { 4, 5, 7, 3, 5, 4, 2, 4 }
The desired output would be: { 7, 3, 2 }
The solution I am thinking of is create below HashMap from the given List:
Map<Integer, Integer> numbersCountMap = new HashMap();
where key is the value of from the list and value is the occurrences count.
Then loop through the HashMap entry set and where ever the number contains count greater than one remove that number from the list.
for (Map.Entry<Int, Int> numberCountEntry : numbersCountMap.entrySet()) {
if(numberCountEntry.getValue() > 1) {
testList.remove(numberCountEntry.getKey());
}
}
I am not sure whether this is an efficient solution to this problem, as the remove(Integer) operation on a list can be expensive. Also I am creating additional Map data structure. And looping twice, once on the original list to create the Map and then on the map to remove duplicates.
Could you please suggest a better way. May be Java 8 has better way of implementing this.
Also can we do it in few lines using Streams and other new structures in Java 8?
By streams you can use:
Map<Integer, Long> grouping = integers.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
grouping.values().removeIf(c -> c > 1);
Set<Integer> result = grouping.keySet();
Or as #Holger mentioned, all you want to know, is whether there is more than one integer in your list, so just do:
Map<Integer, Boolean> grouping = integers.stream()
.collect(Collectors.toMap(Function.identity(),
x -> false, (a, b) -> true,
HashMap::new));
grouping.values().removeIf(b -> b);
// or
grouping.values().removeAll(Collections.singleton(true));
Set<Integer> result = grouping.keySet();
While YCF_L's answer does the thing and yields the correct result, I don't think it's a good solution to go with, since it mixes functional and procedural approaches by mutating the intermediary collection.
A functional approach would assume either of the following solutions:
Using intermediary variable:
Map<Integer, Boolean> map =
integers.stream()
.collect(toMap(identity(), x -> true, (a, b) -> false));
List<Integer> result = map.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList());
Note that we don't even care about the mutability of the map variable. Thus we can omit the 4th parameter of toMap collector.
Chaining two pipelines (similar to Alex Rudenko's answer):
List<Integer> result =
integers.stream()
.collect(toMap(identity(), x -> true, (a, b) -> false))
.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList());
This code is still safe, but less readable. Chaining two or more pipelines is discouraged.
Pure functional approach:
List<Integer> result =
integers.stream()
.collect(collectingAndThen(
toMap(identity(), x -> true, (a, b) -> false),
map -> map.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList())));
The intermediary state (the grouped map) does not get exposed to the outside world. So we may be sure nobody will modify it while we're processing the result.
It's overengineered for just this problem. Also, your code is faulty:
It's Integer, not Int (minor niggle)
More importantly, a remove call removes the first matching element, and to make matters considerably worse, remove on lists is overloaded: There's remove(int) which removes an element by index, and remove(Object) which removes an element by looking it up. In a List<Integer>, it is very difficult to know which one you're calling. You want the 'remove by lookup' one.
On complexity:
On modern CPUs, it's not that simple. The CPU works on 'pages' of memory, and because fetching a new page takes on the order of 500 cycles or more, it makes more sense to simplify matters and consider any operation that does NOT require a new page of memory to be loaded, to be instantaneous.
That means that if we're talking about a list of, say, 10,000 numbers or fewer? None of it matters. It'll fly by. Any debate about 'efficiency' is meaningless until we get to larger counts.
Assuming that 'efficiency' is still relevant:
integers don't have hashcode collisions.
hashmaps with few to no key hash collisions are effectively O(1) on all single element ops such as 'add' and 'get'.
arraylist's .remove(Object) method is O(n). It takes longer the larger the list is. In fact, it is doubly O(n): it takes O(n) time to find the element you want to remove, and then O(n) time again to actually remove it. For fundamental informatics twice O(n) is still O(n) but pragmatically speaking, arrayList.remove(item) is pricey.
You're calling .remove about O(n) times, making the total complexity O(n^2). Not great, and not the most efficient algorithm. Practically or fundamentally.
An efficient strategy is probably to just commit to copying. A copy operation is O(n). For the whole thing, instead of O(n) per item. Sorting is O(n log n). This gives us a trivial algorithm:
Sort the input. Note that you can do this with an int[] too; until java 16 is out and you can use primitives in collections, int[] is an order of magnitude more efficient than a List<Integer>.
loop through the sorted input. Don't immediately copy, but use an intermediate: For the 0th item in the list, remember only 'the last item was FOO' and 'how many times did I see foo?'. Then, for any item, check if it is the same as the previous. If yes, increment count. If not, check the count: if it was 1, add it to the output, otherwise don't. In any case, update the 'last seen value' to the new value and set the count to 1. At the end, make sure to add the last remembered value if the count is 1, and make sure your code works even for empty lists.
That's O(n log n) + O(n) complexity, which is O(n log n) - a lot better than your O(n^2) take.
Use int[], and add another step that you first go through juuust to count how large the output would be (because arrays cannot grow/shrink), and now you have a time complexity of O(n log n) + 2*O(n) which is still O(n log n), and the lowest possible memory complexity, as sort is in-place and doesn't cost any extra.
If you really want to tweak it, you can go with a space complexity of 0 (you can write the reduced list inside the input).
One problem with this strategy is that you mess up the ordering in the input. The algorithm would produce 2, 3, 7. If you want to preserve order, you can combine the hashmap solution with the sort, and make a copy as you loop solution.
You may count the frequency of each number into LinkedHashMap to keep insertion order if it's relevant, then filter out the single numbers from the entrySet() and keep the keys.
List<Integer> data = Arrays.asList(4, 5, 7, 3, 5, 4, 2, 4);
List<Integer> singles = data.stream()
.collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()))
.entrySet().stream()
.filter(e -> e.getValue() == 1)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
System.out.println(singles);
Printed output:
[7, 3, 2]
You can use 3-argument reduce method and walk through the stream only once, maintaining two sets of selected and rejected values.
final var nums = Stream.of(4, 5, 7, 3, 5, 4, 2, 4);
final var init = new Tuple<Set<Integer>>(new LinkedHashSet<Integer>(), new LinkedHashSet<Integer>());
final var comb = (BinaryOperator<Tuple<Set<Integer>>>) (a, b) -> a;
final var accum = (BiFunction<Tuple<Set<Integer>>, Integer, Tuple<Set<Integer>>>) (t, elem) -> {
if (t.fst().contains(elem)) {
t.snd().add(elem);
t.fst().remove(elem);
} else if (!t.snd().contains(elem)) {
t.fst().add(elem);
}
return t;
};
Assertions.assertEquals(nums.reduce(init, accum, comb).fst(), Set.of(7, 3, 2));
In this example, Tuple were defined as record Tuple<T> (T fst, T snd) { }
Decided against the sublist method due to poor performance on large data sets. The following alternative is faster, and holds its own against stream solutions. Probably because Set access to an element is in constant time. The downside is that it requires extra data structures. Give an ArrayList list of elements, this seems to work quite well.
Set<Integer> dups = new HashSet<>(list.size());
Set<Integer> result = new HashSet<>(list.size());
for (int i : list) {
if (dups.add(i)) {
result.add(i);
continue;
}
result.remove(i);
}
I have a question on the intermediate stages sequential state - are the operations from a stage applied to all the input stream (items) or are all the stages / operations applied to each stream item?
I'm aware the question might not be easy to understand, so I'll give an example. On the following stream processing:
List<String> strings = Arrays.asList("Are Java streams intermediate stages sequential?".split(" "));
strings.stream()
.filter(word -> word.length() > 4)
.peek(word -> System.out.println("f: " + word))
.map(word -> word.length())
.peek(length -> System.out.println("m: " + length))
.forEach(length -> System.out.println("-> " + length + "\n"));
My expectation for this code is that it will output:
f: streams
f: intermediate
f: stages
f: sequential?
m: 7
m: 12
m: 6
m: 11
-> 7
-> 12
-> 6
-> 11
Instead, the output is:
f: streams
m: 7
-> 7
f: intermediate
m: 12
-> 12
f: stages
m: 6
-> 6
f: sequential?
m: 11
-> 11
Are the items just displayed for all the stages, due to the console output? Or are they also processed for all the stages, one at a time?
I can further detail the question, if it's not clear enough.
This behaviour enables optimisation of the code. If each intermediate operation were to process all elements of a stream before proceeding to the next intermediate operation then there would be no chance of optimisation.
So to answer your question, each element moves along the stream pipeline vertically one at a time (except for some stateful operations discussed later), therefore enabling optimisation where possible.
Explanation
Given the example you've provided, each element will move along the stream pipeline vertically one by one as there is no stateful operation included.
Another example, say you were looking for the first String whose length is greater than 4, processing all the elements prior to providing the result is unnecessary and time-consuming.
Consider this simple illustration:
List<String> stringsList = Arrays.asList("1","12","123","1234","12345","123456","1234567");
int result = stringsList.stream()
.filter(s -> s.length() > 4)
.mapToInt(Integer::valueOf)
.findFirst().orElse(0);
The filter intermediate operation above will not find all the elements whose length is greater than 4 and return a new stream of them but rather what happens is as soon as we find the first element whose length is greater than 4, that element goes through to the .mapToInt which then findFirst says "I've found the first element" and execution stops there. Therefore the result will be 12345.
Behaviour of stateful and stateless intermediate operations
Note that when a stateful intermediate operation as such of sorted is included in a stream pipeline then that specific operation will traverse the entire stream. If you think about it, this makes complete sense as in order to sort elements you'll need to see all the elements to determine which elements come first in the sort order.
The distinct intermediate operation is also a stateful operation, however, as #Holger has mentioned unlike sorted, it does not require traversing the entire stream as each distinct element can get passed down the pipeline immediately and may fulfil a short-circuiting condition.
stateless intermediate operations such as filter , map etc do not have to traverse the entire stream and can freely process one element at a time vertically as mentioned above.
Lastly, but not least it's also important to note that, when the terminal operation is a short-circuiting operation the terminal-short-circuiting methods can finish before traversing all the elements of the underlying stream.
reading: Java 8 stream tutorial
Your answer is loop fusion. What we see is that the four
intermediate operations filter() – peek() – map() – peek() – println using forEach() which is a kinda terminal operation have been logically
joined together to constitute a single pass. They are executed in
order for each of the individual element. This joining
together of operations in a single pass is an optimization technique
known as loop fusion.
More for reading: Source
An intermediate operation is always lazily executed. That is to say
they are not run until the point a terminal operation is reached.
A few of the most popular intermediate operations used in a stream
filter – the filter operation returns a stream of elements that
satisfy the predicate passed in as a parameter to the operation. The
elements themselves before and after the filter will have the same
type, however the number of elements will likely change
map – the map operation returns a stream of elements after they have
been processed by the function passed in as a parameter. The
elements before and after the mapping may have a different type, but
there will be the same total number of elements.
distinct – the distinct operation is a special case of the filter
operation. Distinct returns a stream of elements such that each
element is unique in the stream, based on the equals method of the
elements
.java-8-streams-cheat-sheet
Apart from optimisation, the order of processing you'd describe wouldn't work for streams of indeterminate length, like this:
DoubleStream.generate(Math::random).filter(d -> d > 0.9).findFirst();
Admittedly this example doesn't make much sense in practice, but the point is that rather than backed by a fixed-size collection,DoubleStream.generate() creates a potentially infinite stream. The only way to process this is element by element.
I have a question on the intermediate stages sequential state - are the operations from a stage applied to all the input stream (items) or are all the stages / operations applied to each stream item?
I'm aware the question might not be easy to understand, so I'll give an example. On the following stream processing:
List<String> strings = Arrays.asList("Are Java streams intermediate stages sequential?".split(" "));
strings.stream()
.filter(word -> word.length() > 4)
.peek(word -> System.out.println("f: " + word))
.map(word -> word.length())
.peek(length -> System.out.println("m: " + length))
.forEach(length -> System.out.println("-> " + length + "\n"));
My expectation for this code is that it will output:
f: streams
f: intermediate
f: stages
f: sequential?
m: 7
m: 12
m: 6
m: 11
-> 7
-> 12
-> 6
-> 11
Instead, the output is:
f: streams
m: 7
-> 7
f: intermediate
m: 12
-> 12
f: stages
m: 6
-> 6
f: sequential?
m: 11
-> 11
Are the items just displayed for all the stages, due to the console output? Or are they also processed for all the stages, one at a time?
I can further detail the question, if it's not clear enough.
This behaviour enables optimisation of the code. If each intermediate operation were to process all elements of a stream before proceeding to the next intermediate operation then there would be no chance of optimisation.
So to answer your question, each element moves along the stream pipeline vertically one at a time (except for some stateful operations discussed later), therefore enabling optimisation where possible.
Explanation
Given the example you've provided, each element will move along the stream pipeline vertically one by one as there is no stateful operation included.
Another example, say you were looking for the first String whose length is greater than 4, processing all the elements prior to providing the result is unnecessary and time-consuming.
Consider this simple illustration:
List<String> stringsList = Arrays.asList("1","12","123","1234","12345","123456","1234567");
int result = stringsList.stream()
.filter(s -> s.length() > 4)
.mapToInt(Integer::valueOf)
.findFirst().orElse(0);
The filter intermediate operation above will not find all the elements whose length is greater than 4 and return a new stream of them but rather what happens is as soon as we find the first element whose length is greater than 4, that element goes through to the .mapToInt which then findFirst says "I've found the first element" and execution stops there. Therefore the result will be 12345.
Behaviour of stateful and stateless intermediate operations
Note that when a stateful intermediate operation as such of sorted is included in a stream pipeline then that specific operation will traverse the entire stream. If you think about it, this makes complete sense as in order to sort elements you'll need to see all the elements to determine which elements come first in the sort order.
The distinct intermediate operation is also a stateful operation, however, as #Holger has mentioned unlike sorted, it does not require traversing the entire stream as each distinct element can get passed down the pipeline immediately and may fulfil a short-circuiting condition.
stateless intermediate operations such as filter , map etc do not have to traverse the entire stream and can freely process one element at a time vertically as mentioned above.
Lastly, but not least it's also important to note that, when the terminal operation is a short-circuiting operation the terminal-short-circuiting methods can finish before traversing all the elements of the underlying stream.
reading: Java 8 stream tutorial
Your answer is loop fusion. What we see is that the four
intermediate operations filter() – peek() – map() – peek() – println using forEach() which is a kinda terminal operation have been logically
joined together to constitute a single pass. They are executed in
order for each of the individual element. This joining
together of operations in a single pass is an optimization technique
known as loop fusion.
More for reading: Source
An intermediate operation is always lazily executed. That is to say
they are not run until the point a terminal operation is reached.
A few of the most popular intermediate operations used in a stream
filter – the filter operation returns a stream of elements that
satisfy the predicate passed in as a parameter to the operation. The
elements themselves before and after the filter will have the same
type, however the number of elements will likely change
map – the map operation returns a stream of elements after they have
been processed by the function passed in as a parameter. The
elements before and after the mapping may have a different type, but
there will be the same total number of elements.
distinct – the distinct operation is a special case of the filter
operation. Distinct returns a stream of elements such that each
element is unique in the stream, based on the equals method of the
elements
.java-8-streams-cheat-sheet
Apart from optimisation, the order of processing you'd describe wouldn't work for streams of indeterminate length, like this:
DoubleStream.generate(Math::random).filter(d -> d > 0.9).findFirst();
Admittedly this example doesn't make much sense in practice, but the point is that rather than backed by a fixed-size collection,DoubleStream.generate() creates a potentially infinite stream. The only way to process this is element by element.