I am currently reading the O'Reilly Java 8 Lambdas, it is a really good book. I came across with a example like this.
I have a
private final BiFunction<StringBuilder,String,StringBuilder>accumulator=
(builder,name)->{if(builder.length()>0)builder.append(",");builder.append("Mister:").append(name);return builder;};
final Stream<String>stringStream = Stream.of("John Lennon","Paul Mccartney"
,"George Harrison","Ringo Starr");
final StringBuilder reduce = stringStream
.filter(a->a!=null)
.reduce(new StringBuilder(),accumulator,(left,right)->left.append(right));
System.out.println(reduce);
System.out.println(reduce.length());
this produce the right output.
Mister:John Lennon,Mister:Paul Mccartney,Mister:George Harrison,Mister:Ringo Starr
My question is regarding the reduce method the last parameter which is a BinaryOperator.
Which this parameter is used for? If I change by
.reduce(new StringBuilder(),accumulator,(left,right)->new StringBuilder());
the output is the same; if I pass NULL then N.P.E is returned.
What is this parameter used for?
Update
Why if I run it on parallelStream I am receiving different results?
First run:
returned StringBuilder length = 420
Second run:
returned StringBuilder length = 546
Third run:
returned StringBuilder length = 348
and so on. Why is this - should it not return all the values at each iteration?
The method reduce in the interface Stream is overloaded. The parameters for the method with three arguments are:
identity
accumulator
combiner
The combiner supports parallel execution. Apparently, it is not used for sequential streams. However, there is no such guarantee. If you change your streams into parallel stream, I guess you will see a difference:
Stream<String>stringStream = Stream.of(
"John Lennon", "Paul Mccartney", "George Harrison", "Ringo Starr")
.parallel();
Here is an example of how the combiner can be used to transform a sequential reduction into a reduction, that supports parallel execution. There is a stream with four Strings and acc is used as an abbreviation for accumulator.apply. Then the result of the reduction can be computed as follows:
acc(acc(acc(acc(identity, "one"), "two"), "three"), "four");
With a compatible combiner, the above expression can be transformed into the following expression. Now it is possible to execute the two sub-expressions in different threads.
combiner.apply(
acc(acc(identity, "one"), "two"),
acc(acc(identity, "three"), "four"));
Regarding your second question, I use a simplified accumulator to explain the problem:
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder.append(name);
According to the Javadoc for Stream::reduce, the accumulator has to be associative. In this case, that would imply, that the following two expressions return the same result:
acc(acc(acc(identity, "one"), "two"), "three")
acc(acc(identity, "one"), acc(acc(identity, "two"), "three"))
That's not true for the above accumulator. The problem is, that you are mutating the object referenced by identity. That's a bad idea for the reduce operation. Here are two alternative implementations which should work:
// identity = ""
BiFunction<String,String,String> accumulator = String::concat;
// identity = null
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder == null
? new StringBulder(name) : builder.append(name);
nosid's answer got it mostly right (+1) but I wanted to amplify a particular point.
The identity parameter to reduce must be an identity value. It's ok if it's an object, but if it is, it should immutable. If the "identity" object is mutated, it's no longer an identity! For more discussion of this point, see my answer to a related question.
It looks like this example originated from Example 5-19 of Richard Warburton, Java 8 Lambdas, O'Reilly 2014. If so, I shall have to have a word about this with the good Dr. Warburton.
Related
The question is straight forward: Why can't we use StringBuilder(...) as identity function in the reduce(...) operations in the java8 streams, but string1.concat(string2) can be used as the identity function?
string1.concat(string2) can be seen as similar to builder.append(string) (though it is understood that there are few differences in these opeations), but I am not able to understand the difference in the reduce operation. Consider the following example:
List<String> list = Arrays.asList("1", "2", "3");
// Example using the string concatenation operation
System.out.println(list.stream().parallel()
.reduce("", (s1, s2) -> s1 + s2, (s1, s2)->s1 + s2));
// The same example, using the StringBuilder
System.out.println(list.stream() .parallel()
.reduce(new StringBuilder(""), (builder, s) -> builder
.append(s),(builder1, builder2) -> builder1
.append(builder2)));
// using the actual concat(...) method
System.out.println(list.stream().parallel()
.reduce("", (s1, s2) -> s1.concat(s2), (s1, s2)->s1.concat(s2)));
Here is the output after executing above lines:
123
321321321321 // output when StringBuilder() is used as Identity
123
builder.append(string) is an associative operation as str1.concat(str2) is. Then why does concat work and append doesn't?
Yes, append is indeed associative, but that is not the only requirement for the function passed as the accumulator and combiner. According to the docs, they have to be:
Associative
Non-interfering
Stateless
append is not stateless. It is stateful. When you do sb.append("Hello"), not only does it return a StringBuilder with Hello appended to the end, it also changes the contents (i.e. the state) of sb.
Also from the docs:
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline.
Also because of this, new StringBuilder() is not a valid identity, once the accumulator or the combiner has been applied. Something would have been added to the empty string builder, and the following equation, which all identities must satisfy, is no longer satisfied:
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t)
It is possible that the parallel stream makes use of the old string builders after calling the accumulators and/or combiners, and expects their contents to not be changed. However, the accumulators and combiners mutate the string builders, causing the stream to produce incorrect results.
On the other hand, concat satisfies all three of the above. It is stateless because it does not change the string on which it is called on. It just retunes a new, concatenated string. (String is immutable anyway and can't be changed :D)
Anyway, this is a use case of mutable reduction with collect:
System.out.println((StringBuilder)list.stream().parallel()
.collect(
StringBuilder::new,
StringBuilder::append,
StringBuilder::append
)
);
After read the doc and do many tests, I think reduce is something like following steps:
there will be multi threads to do the reduce, every thread do a
partial reduce;
for identity, there will be only one instance. Every accumulator will use this identity instance;
first do accumulate with identity instance and a string element to get a
StringBuilder;
combine all these StringBuilders;
so the problem is every accumulate with identity instance and a string element will cause identity changed. the identity in the accumulates after first time is not identity anymore.
for example, we consider an list with 2 element {"1","2"}.
there will be 2 threads and every thread do 1 accumulate and one of them do last combine.
thread A do accumulate identity with element "1", then result is a StringBuilder which content is "1"(still be the identity, becuase return object of StringBuilder.append is itself), but identity also changed to content "1". then thread B do accumulate identity with element "2", then result is "12", not "2" any more.
then do combine is the result of these two accumulate result, they are all the identity instance itself, so the result will be "1212".
It like following code snippet:
StringBuilder identity = new StringBuilder();
StringBuilder accumulate1 = identity.append("1");
StringBuilder accumulate2 = identity.append("2");
StringBuilder combine = accumulate1.append(accumulate2);
// combine and accumulate1 and accumulate2 are all identity instance and result is "1212"
return combine;
for more elements, because of threads running randomly, the result will different every time.
after we know the reason, if we fix the accumulator as following
new StringBuilder(builder).append(s)
and full line code will like:
System.out.println(list.stream().parallel().reduce(new StringBuilder(), (builder, s) -> new StringBuilder(builder).append(s),
(builder1, builder2) -> new StringBuilder(builder1).append(builder2)));
then there will be no issue any more because accumulator will not change identity instance and return new StringBuilder every time. But it is not worth to do this as no benefit comparing with String concat method.
Edit: Thanks #Holger's example, seems if there is filter function, then some accumulators may be skipped. so the combiner function also need be changed to
new StringBuilder(builder1).append(builder2)
Don't use the .reduce() when there is already an implemantion (or own .collect() like Sweeper's answer).
List<String> list = Arrays.asList("1", "2", "3");
// Example using the string concatenation operation
System.out.println(list.stream()
.parallel()
.collect(Collectors.joining())
);
// prints "123"
Edit (this will not work for parallel streams)
Depends on of the implementation of .joining():
final List<String> list = Arrays.asList("1", "2", "3");
System.out.println(list.stream().reduce(new StringBuilder(),
StringBuilder::append,
StringBuilder::append)
.toString()
);
// prints "123"
List<Boolean> results = new ArrayList<>();
results.add(true);
results.add(true);
results.add(true);
results.add(false);
if (results.contains(false)) {
System.out.println(false);
} else {
System.out.println(true);
}
System.out.println(results.stream().reduce((a, b) -> a && b).get());
//System.out.println(!results.stream().anyMatch(a -> a == false));
System.out.println(!results.stream().anyMatch(a -> !a));
OUTPUT:
false
false
false
FYI, the results are a result of a map+collect op
List<Job> jobs;
List<Boolean> results = job.stream().map(j -> j.ready()).collect(Collector.toList())
If I choose either reduce or anyMatch, I don't have to collect the results from map operation.
From results which is a list of boolean, I just want to return false if there is at least one false.
I can do it via reduce or anyMatch. I kinda don't like Optional from reduce, and I kinda don't like that I have to negate anyMatch
Are there any pros/cons for using either?
It appears that the only reason you are collecting the booleans into the list is so you can check if some are false:
If I choose either reduce or anyMatch, I don't have to collect the results from map operation [...] I just want to return false if there is at least one false.
If this is the case, then you definitely should consider the straightforward stream-based approach:
return jobs.stream().allMatch(Job::ready);
You ask pros/cons. Contains is the fastest and simplest. Reduce is the most cumbersome/complicated here. But your task is very simple, so does it really matter? Maybe the better key to select which approach to use would be: "Which one is better readable, understandable and maintainable?" This clean-code approach is usually more important in practical software development than hunting microseconds in runtime or number of lines in source code. But then again I would say contains is the best here.
System.out.println(!results.contains(false));
Then your anyMatch(a -> !a) is effectively the same as contains and I would definitely prefer it over reduce for this concrete task. But again, real difference is very small and I would more concern the readability and understandability for a future maintainer of this software.
System.out.println(results.stream().reduce((a, b) -> a && b).get());
This will always return false, as the list(results) has at-least 1 false.
&& will always check for both value to be true to pass it as true.
System.out.println(!results.stream().anyMatch(a -> !a));
Stream anyMatch(Predicate predicate) returns whether any elements of this stream(results) match the provided predicate(a -> !a). As you are doing !results so the result will come-out to be false at last after initial true.
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.