Combiner not executed in the Java Stream Reducer [duplicate] - java

I wrote this code to reduce a list of words to a long count of how many words start with an 'A'. I'm just writing it to learn Java 8, so I'd like to understand it a little better [Disclaimer: I realize this is probably not the best way to write this code; it's just for practice!].
Long countOfAWords = results.stream().reduce(
0L,
(a, b) -> b.charAt(0) == 'A' ? a + 1 : a,
Long::sum);
The middle parameter/lambda (called the accumulator) would seem to be capable of reducing the full list without the final 'Combiner' parameter. In fact, the Javadoc actually says:
The {#code accumulator} function acts as a fused mapper and
accumulator,
* which can sometimes be more efficient than separate mapping and reduction,
* such as when knowing the previously reduced value allows you to avoid
* some computation.
[Edit From Author] - The following statement is wrong, so don't let it confuse you; I'm just keeping it here so I don't ruin the original context of the answers.
Anyway, I can infer that the accumulator must just be outputting 1's and 0's which the combiner combines. I didn't find this particularly obvious from the documentation though.
My Question
Is there a way to see what the output would be before the combiner executes so I can see the list of 1's and 0's that the combiner combines? This would be helpful in debugging more complex situations which I'm sure I'll come across eventually.

The combiner does not reduce a list of 0's and 1's. When the stream is not run in parallel it's not used in this case so that the following loop is equivalent:
U result = identity;
for (T element : this stream)
result = accumulator.apply(result, element)
return result;
When you run the stream in parallel, the task is spanned into multiple threads. So for example the data in the pipeline is partitioned into chunks that evaluate and produce a result independently. Then the combiner is used to merge this results.
So you won't see a list that is reduced, but rather 2 values either the identity value or with another value computed by a task that are summed. For example if you add a print statement in the combiner
(i1, i2) -> {System.out.println("Merging: "+i1+"-"+i2); return i1+i2;});
you could see something like this:
Merging: 0-0
Merging: 0-0
Merging: 1-0
Merging: 1-0
Merging: 1-1
This would be helpful in debugging more complex situations which I'm
sure I'll come across eventaully.
More generally if you want to see the data on the pipeline on the go you can use peek (or the debugger could also help). So applied to your example:
long countOfAWords = result.stream().map(s -> s.charAt(0) == 'A' ? 1 : 0).peek(System.out::print).mapToLong(l -> l).sum();
which can output:
100100
[Disclaimer: I realize this is probably not the best way to write this
code; it's just for practice!].
The idiomatic way to achieve your task would be to filter the stream and then simply use count:
long countOfAWords = result.stream().filter(s -> s.charAt(0) == 'A').count();
Hope it helps! :)

One way to see what's going on is to replace the method reference Long::sum by a lambda that includes a println.
List<String> results = Arrays.asList("A", "B", "A", "A", "C", "A", "A");
Long countOfAWords = results.stream().reduce(
0L,
(a, b) -> b.charAt(0) == 'A' ? a + 1 : a,
(a, b) -> {
System.out.println(a + " " + b);
return Long.sum(a, b);
});
In this case, we can see that the combiner is not actually used. This is because the stream is not parallel. All we are really doing is using the accumulator to successively combine each String with the current Long result; no two Long values are ever combined.
If you replace stream by parallelStream you can see that the combiner is used and look at the values it combines.

Related

Remove all instances of an item in a list if it appears more than once

Given a List of numbers: { 4, 5, 7, 3, 5, 4, 2, 4 }
The desired output would be: { 7, 3, 2 }
The solution I am thinking of is create below HashMap from the given List:
Map<Integer, Integer> numbersCountMap = new HashMap();
where key is the value of from the list and value is the occurrences count.
Then loop through the HashMap entry set and where ever the number contains count greater than one remove that number from the list.
for (Map.Entry<Int, Int> numberCountEntry : numbersCountMap.entrySet()) {
if(numberCountEntry.getValue() > 1) {
testList.remove(numberCountEntry.getKey());
}
}
I am not sure whether this is an efficient solution to this problem, as the remove(Integer) operation on a list can be expensive. Also I am creating additional Map data structure. And looping twice, once on the original list to create the Map and then on the map to remove duplicates.
Could you please suggest a better way. May be Java 8 has better way of implementing this.
Also can we do it in few lines using Streams and other new structures in Java 8?
By streams you can use:
Map<Integer, Long> grouping = integers.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
grouping.values().removeIf(c -> c > 1);
Set<Integer> result = grouping.keySet();
Or as #Holger mentioned, all you want to know, is whether there is more than one integer in your list, so just do:
Map<Integer, Boolean> grouping = integers.stream()
.collect(Collectors.toMap(Function.identity(),
x -> false, (a, b) -> true,
HashMap::new));
grouping.values().removeIf(b -> b);
// or
grouping.values().removeAll(Collections.singleton(true));
Set<Integer> result = grouping.keySet();
While YCF_L's answer does the thing and yields the correct result, I don't think it's a good solution to go with, since it mixes functional and procedural approaches by mutating the intermediary collection.
A functional approach would assume either of the following solutions:
Using intermediary variable:
Map<Integer, Boolean> map =
integers.stream()
.collect(toMap(identity(), x -> true, (a, b) -> false));
List<Integer> result = map.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList());
Note that we don't even care about the mutability of the map variable. Thus we can omit the 4th parameter of toMap collector.
Chaining two pipelines (similar to Alex Rudenko's answer):
List<Integer> result =
integers.stream()
.collect(toMap(identity(), x -> true, (a, b) -> false))
.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList());
This code is still safe, but less readable. Chaining two or more pipelines is discouraged.
Pure functional approach:
List<Integer> result =
integers.stream()
.collect(collectingAndThen(
toMap(identity(), x -> true, (a, b) -> false),
map -> map.entrySet()
.stream()
.filter(Entry::getValue)
.map(Entry::getKey)
.collect(toList())));
The intermediary state (the grouped map) does not get exposed to the outside world. So we may be sure nobody will modify it while we're processing the result.
It's overengineered for just this problem. Also, your code is faulty:
It's Integer, not Int (minor niggle)
More importantly, a remove call removes the first matching element, and to make matters considerably worse, remove on lists is overloaded: There's remove(int) which removes an element by index, and remove(Object) which removes an element by looking it up. In a List<Integer>, it is very difficult to know which one you're calling. You want the 'remove by lookup' one.
On complexity:
On modern CPUs, it's not that simple. The CPU works on 'pages' of memory, and because fetching a new page takes on the order of 500 cycles or more, it makes more sense to simplify matters and consider any operation that does NOT require a new page of memory to be loaded, to be instantaneous.
That means that if we're talking about a list of, say, 10,000 numbers or fewer? None of it matters. It'll fly by. Any debate about 'efficiency' is meaningless until we get to larger counts.
Assuming that 'efficiency' is still relevant:
integers don't have hashcode collisions.
hashmaps with few to no key hash collisions are effectively O(1) on all single element ops such as 'add' and 'get'.
arraylist's .remove(Object) method is O(n). It takes longer the larger the list is. In fact, it is doubly O(n): it takes O(n) time to find the element you want to remove, and then O(n) time again to actually remove it. For fundamental informatics twice O(n) is still O(n) but pragmatically speaking, arrayList.remove(item) is pricey.
You're calling .remove about O(n) times, making the total complexity O(n^2). Not great, and not the most efficient algorithm. Practically or fundamentally.
An efficient strategy is probably to just commit to copying. A copy operation is O(n). For the whole thing, instead of O(n) per item. Sorting is O(n log n). This gives us a trivial algorithm:
Sort the input. Note that you can do this with an int[] too; until java 16 is out and you can use primitives in collections, int[] is an order of magnitude more efficient than a List<Integer>.
loop through the sorted input. Don't immediately copy, but use an intermediate: For the 0th item in the list, remember only 'the last item was FOO' and 'how many times did I see foo?'. Then, for any item, check if it is the same as the previous. If yes, increment count. If not, check the count: if it was 1, add it to the output, otherwise don't. In any case, update the 'last seen value' to the new value and set the count to 1. At the end, make sure to add the last remembered value if the count is 1, and make sure your code works even for empty lists.
That's O(n log n) + O(n) complexity, which is O(n log n) - a lot better than your O(n^2) take.
Use int[], and add another step that you first go through juuust to count how large the output would be (because arrays cannot grow/shrink), and now you have a time complexity of O(n log n) + 2*O(n) which is still O(n log n), and the lowest possible memory complexity, as sort is in-place and doesn't cost any extra.
If you really want to tweak it, you can go with a space complexity of 0 (you can write the reduced list inside the input).
One problem with this strategy is that you mess up the ordering in the input. The algorithm would produce 2, 3, 7. If you want to preserve order, you can combine the hashmap solution with the sort, and make a copy as you loop solution.
You may count the frequency of each number into LinkedHashMap to keep insertion order if it's relevant, then filter out the single numbers from the entrySet() and keep the keys.
List<Integer> data = Arrays.asList(4, 5, 7, 3, 5, 4, 2, 4);
List<Integer> singles = data.stream()
.collect(Collectors.groupingBy(Function.identity(), LinkedHashMap::new, Collectors.counting()))
.entrySet().stream()
.filter(e -> e.getValue() == 1)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
System.out.println(singles);
Printed output:
[7, 3, 2]
You can use 3-argument reduce method and walk through the stream only once, maintaining two sets of selected and rejected values.
final var nums = Stream.of(4, 5, 7, 3, 5, 4, 2, 4);
final var init = new Tuple<Set<Integer>>(new LinkedHashSet<Integer>(), new LinkedHashSet<Integer>());
final var comb = (BinaryOperator<Tuple<Set<Integer>>>) (a, b) -> a;
final var accum = (BiFunction<Tuple<Set<Integer>>, Integer, Tuple<Set<Integer>>>) (t, elem) -> {
if (t.fst().contains(elem)) {
t.snd().add(elem);
t.fst().remove(elem);
} else if (!t.snd().contains(elem)) {
t.fst().add(elem);
}
return t;
};
Assertions.assertEquals(nums.reduce(init, accum, comb).fst(), Set.of(7, 3, 2));
In this example, Tuple were defined as record Tuple<T> (T fst, T snd) { }
Decided against the sublist method due to poor performance on large data sets. The following alternative is faster, and holds its own against stream solutions. Probably because Set access to an element is in constant time. The downside is that it requires extra data structures. Give an ArrayList list of elements, this seems to work quite well.
Set<Integer> dups = new HashSet<>(list.size());
Set<Integer> result = new HashSet<>(list.size());
for (int i : list) {
if (dups.add(i)) {
result.add(i);
continue;
}
result.remove(i);
}

How do elements go through the stream?

How do elements of a stream go thought the stream itself? Is it like it takes 1 element and passes it thought all functions (map then sort then collect) and then takes second elements and repeats the cycle or is it like it takes all elements and maps them then sorts and finally collects?
new ArrayList<Integer>().stream()
.map(x -> x.byteValue())
.sorted()
.collect(Collectors.toList());
It depends entirely on the stream. It is usually evaluated lazily, which means it takes it one element at a time, but under certain conditions it needs to get all the elements before it continues to the next step. For example, consider the following code:
IntStream.generate(() -> (int) (Math.random() * 100))
.limit(20)
.filter(i -> i % 2 == 0)
.sorted()
.forEach(System.out::println);
This stream generates random numbers from 0 to 99, limited to 20 elements, after which it filters the numbers by checking wether or not they are even, if they are, they continue. Until now, it's done one element at a time. The change comes when you request a sorting of the stream. The sorted() method sorts the stream by the natural ordering of the elements, or by a provided comparator. For you to sort something you need access to all elements, because you don't know the last element's value until you get it. It could be the first element after you sort it. So this method waits for the entire stream, sorts it and returns the sorted stream. After that this code just prints the sorted stream one element at a time.
That depends on the actual Streamimplementation. This mostly applies to parallel streams, because spliterators tend to chunk the amount of data and you don't know which element will be process when.
In general, a stream goes through each element in order (but doesn't have to). The simplest way to check this behaviour is to put in some breakpoints and see when they actually hit.
Also, certain operations may wait until all prior operations are executed (namely collet())
I advise to check the javadoc and read it carefully, because it gives away enough hints to get an expectation.
something like this, yes.
if you have a stream of integers let's say 1,2,3,4,5 and you do some operations on it, let's say stream().map(x -> x*3).filter(x -> x%2==0).findFirst()
it will first take the first value (1), it will be multiplied by 3, and then it will check if it's even.
Because it's not, it will take the second one (2), multiply by 3 (=6), check if it is even (it is), find first.
this will be the first one and now it stops and returns.
Which means the other integers from the stream won't be evaluated (multiplied and checked if even) as it is not necessary

java 8 make a stream of the multiples of two

I'm practicing streams in java 8 and im trying to make a Stream<Integer> containing the multiples of 2. There are several tasks in one main class so I won't link the whole block but what i got so far is this:
Integer twoToTheZeroth = 1;
UnaryOperator<Integer> doubler = (Integer x) -> 2 * x;
Stream<Integer> result = ?;
My question here probably isn't related strongly to the streams, more like the syntax, that how should I use the doubler to get the result?
Thanks in advance!
You can use Stream.iterate.
Stream<Integer> result = Stream.iterate(twoToTheZeroth, doubler);
or using the lambda directly
Stream.iterate(1, x -> 2*x);
The first argument is the "seed" (ie first element of the stream), the operator gets applied consecutively with every element access.
EDIT:
As Vinay points out, this will result in the stream being filled with 0s eventually (this is due to int overflow). To prevent that, maybe use BigInteger:
Stream.iterate(BigInteger.ONE,
x -> x.multiply(BigInteger.valueOf(2)))
.forEach(System.out::println);
Arrays.asList(1,2,3,4,5).stream().map(x -> x * x).forEach(x -> System.out.println(x));
so you can use the doubler in the map caller

When should I use IntStream.range in Java?

I would like to know when I can use IntStream.range effectively. I have three reasons why I am not sure how useful IntStream.range is.
(Please think of start and end as integers.)
If I want an array, [start, start+1, ..., end-2, end-1], the code below is much faster.
int[] arr = new int[end - start];
int index = 0;
for(int i = start; i < end; i++)
arr[index++] = i;
This is probably because toArray() in IntStream.range(start, end).toArray() is very slow.
I use MersenneTwister to shuffle arrays. (I downloaded MersenneTwister class online.) I do not think there is a way to shuffle IntStream using MersenneTwister.
I do not think just getting int numbers from start to end-1 is useful. I can use for(int i = start; i < end; i++), which seems easier and not slow.
Could you tell me when I should choose IntStream.range?
There are several uses for IntStream.range.
One is to use the int values themselves:
IntStream.range(start, end).filter(i -> isPrime(i))....
Another is to do something N times:
IntStream.range(0, N).forEach(this::doSomething);
Your case (1) is to create an array filled with a range:
int[] arr = IntStream.range(start, end).toArray();
You say this is "very slow" but, like other respondents, I suspect your benchmark methodology. For small arrays there is indeed more overhead with stream setup, but this should be so small as to be unnoticeable. For large arrays the overhead should be negligible, as filling a large array is dominated by memory bandwidth.
Sometimes you need to fill an existing array. You can do that this way:
int[] arr = new int[end - start];
IntStream.range(0, end - start).forEach(i -> arr[i] = i + start);
There's a utility method Arrays.setAll that can do this even more concisely:
int[] arr = new int[end - start];
Arrays.setAll(arr, i -> i + start);
There is also Arrays.parallelSetAll which can fill an existing array in parallel. Internally, it simply uses an IntStream and calls parallel() on it. This should provide a speedup for large array on a multicore system.
I've found that a fair number of my answers on Stack Overflow involve using IntStream.range. You can search for them using these search criteria in the search box:
user:1441122 IntStream.range
One application of IntStream.range I find particularly useful is to operate on elements of an array, where the array indexes as well as the array's values participate in the computation. There's a whole class of problems like this.
For example, suppose you want to find the locations of increasing runs of numbers within an array. The result is an array of indexes into the first array, where each index points to the start of a run.
To compute this, observe that a run starts at a location where the value is less than the previous value. (A run also starts at location 0). Thus:
int[] arr = { 1, 3, 5, 7, 9, 2, 4, 6, 3, 5, 0 };
int[] runs = IntStream.range(0, arr.length)
.filter(i -> i == 0 || arr[i-1] > arr[i])
.toArray();
System.out.println(Arrays.toString(runs));
[0, 5, 8, 10]
Of course, you could do this with a for-loop, but I find that using IntStream is preferable in many cases. For example, it's easy to store an unknown number of results into an array using toArray(), whereas with a for-loop you have to handle copying and resizing, which distracts from the core logic of the loop.
Finally, it's much easier to run IntStream.range computations in parallel.
Here's an example:
public class Test {
public static void main(String[] args) {
System.out.println(sum(LongStream.of(40,2))); // call A
System.out.println(sum(LongStream.range(1,100_000_000))); //call B
}
public static long sum(LongStream in) {
return in.sum();
}
}
So, let's look at what sum() does: it counts the sum of an arbitrary stream of numbers. We call it in two different ways: once with an explicit list of numbers, and once with a range.
If you only had call A, you might be tempted to put the two numbers into an array and pass it to sum() but that's clearly not an option with call B (you'd run out of memory). Likewise you could just pass the start and end for call B, but then you couldn't support the case of call A.
So to sum it up, ranges are useful here because:
We need to pass them around between methods
The target method doesn't just work on ranges but any stream of numbers
But it only operates on individual numbers of the stream, reading them sequentially. (This is why shuffling with streams is a terrible idea in general.)
There is also the readability argument: code using streams can be much more concise than loops, and thus more readable, but I wanted to show an example where a solution relying on IntStreans is functionally superior too.
I used LongStream to emphasise the point, but the same goes for IntStream
And yes, for simple summing this may look like a bit of an overkill, but consider for example reservoir sampling
IntStream.range returns a range of integers as a stream so you can do stream processing over it.
like taking square of each element
IntStream.range(1, 10).map(i -> i * i);
Here are few differences that comes to my head between IntStream.range and traditional for loops :
IntStream are lazily evaluated, the pipeline is traversed when calling a terminal operation. For loops evaluate at each iteration.
IntStream will provides you some functions that are commonly applied to a range of ints such as sum and avg.
IntStream will allow you to code multiple operation over a range of int in a functional way which read more fluently - specially if you have a lot of operations.
So basically use IntStream when one or more of these differences are useful to you.
But please bear in mind that shuffling a Stream sound quite strange as a Stream is not a data structure and therefore it does not really make sense to shuffle it (in case you were planning on building a special IntSupplier). Shuffle the result instead.
As for the performance, while there may be a few overhead, you will still iterate N times in both case and should not really care more.
Basically, if you want Stream operations, you can use the range() method. For example, to use concurrency or want to use map() or reduce(). Then you are better off with IntStream.
For example:
IntStream.range(1, 5).parallel().forEach(i -> heavyOperation());
Or:
IntStream.range(1, 5).reduce(1, (x, y) -> x * y)
// > 24
You can achieve the second example also with a for-loop, but you need intermediate variables etc.
Also, if you want the first match for example, you can use findFirst() and cousins to stop consuming the rest of the Stream
It totally depends on the use case. However, the syntax and stream API adds lot of easy one liners which can definitely replace the conventional loops.
IntStream is really helpful and syntactic sugar in some cases,
IntStream.range(1, 101).sum();
IntStream.range(1, 101).average();
IntStream.range(1, 101).filter(i -> i % 2 == 0).count();
//... and so on
Whatever you can do with IntStream you can do with conventional loops. As one liner is more precise to understand and maintain.
Still for negative loops we can not use IntStream#range, it only works in positive increment. So following is not possible,
for(int i = 100; i > 1; i--) {
// Negative loop
}
Case 1 : Yes conventional loop is much faster in this case as toArray has a bit overhead.
Case 2 : I don't know anything about it, my apologies.
Case 3 : IntStream is not slow at all, IntStream.range and conventional loop are almost same in terms of performance.
See :
Java 8 nested loops with streams & performance
You could implement your Mersenne Twister as an Iterator and stream from that.

Java 8 stream peek and limit interaction

Why this code in java 8:
IntStream.range(0, 10)
.peek(System.out::print)
.limit(3)
.count();
outputs:
012
I'd expect it to output 0123456789, because peek preceeds limit.
It seems to me even more peculiar because of the fact that this:
IntStream.range(0, 10)
.peek(System.out::print)
.map(x -> x * 2)
.count();
outputs 0123456789 as expected (not 02481012141618).
P.S.: .count() here is used just to consume stream, it can be replaced with anything else
The most important thing to know about streams are that they do not contain elements themselves (like collections) but are working like a pipe whose values are lazily evaluated. That means that the statements that build up a stream - including mapping, filtering, or whatever - are not evaluated until the terminal operation runs.
In your first example, the stream tries to count from 0 to 9, one at each time doing the following:
print out the value
check whether 3 values are passed (if yes, terminate)
So you really get the output 012.
In your second example, the stream again counts from 0 to 9, one at each time doing the following:
print out the value
maping x to x*2, thus forwarding the double of the value to the next step
As you can see the output comes before the mapping and thus you get the result 0123456789. Try to switch the peek and the map calls. Then you will get your expected output.
From the docs:
limit() is a short-circuiting stateful intermediate operation.
map() is an intermediate operation
Again from the docs what that essentially means is that limit() will return a stream with x values from the stream it received.
An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result.
Streams are defined to do lazy processing. So in order to complete your count() operation it doesn’t need to look at the other items. Otherwise, it would be broken, as limit(…) is defined to be a proper way of processing infinite streams in a finite time (by not processing more than limit items).
In principle, it would be possible to complete your request without ever looking at the int values at all, as the operation chain limit(3).count() doesn’t need any processing of the previous operations (other than verifying whether the stream has at least 3 items).
Streams use lazy evaluation, the intermediate operations, i.e. peek() are not executed till the terminal operation runs.
For instances, the following code will just print 1 .In fact, as soon as the first element of the stream,1, will reach the terminal operation, findAny(), the stream execution will be ended.
Arrays.asList(1,2,3)
.stream()
.peek(System.out::print)
.filter((n)->n<3)
.findAny();
Viceversa, in the following example, will be printed 123. In fact the terminal operation, noneMatch(), needs to evaluate all the elements of the stream in order to make sure there is no match with its Predicate: n>4
Arrays.asList(1, 2, 3)
.stream()
.peek(System.out::print)
.noneMatch(n -> n > 4);
For future readers struggling to understand how the count method doesn't execute the peek method before it, I thought I add this additional note:
As per Java 9, the Java documentation for the count method states that:
An implementation may choose to not execute the stream pipeline
(either sequentially or in parallel) if it is capable of computing the
count directly from the stream source.
This means terminating the stream with count is no longer enough to ensure the execution of all previous steps, such as peek.

Categories

Resources