I was playing with Spark. I am bit confused with the working of aggregateby key function.
If I provide non-zero initial value. It is adding 2*initial value in the total.
Following is the code snippet:
JavaPairRDD<String, Integer> mapToPair = rdd.mapToPair(message -> new Tuple2<String, Integer>(message.split(",")[0], Integer.parseInt(message.split(",")[1])))
Function2<Integer, Integer, Integer> mergeValue =(v1, v2) -> v1+v2; Function2<Integer, Integer, Integer> mergeCombiners =(v1, v2) -> v1+v2;
JavaPairRDD<String, Integer> aggregateByKey = mapToPair.aggregateByKey(1, mergeValue, mergeCombiners);
System.out.println("Aggregate by key "+ aggregateByKey.collect());
Following is my input rdd:
hello,1
hello,1
hello,1
hello,1
Output I am getting is
Aggregate by key [(hello,6)]
Please explain its working
zeroValue is added every time new key is seen on current partition so it can be added as many times as many partitions you have and shouldn't change the result of merge and seq ops. This is why 0 is valid for addition but 1 is not.
I agree with #LostInOverflow and here is the explanation why Spark has a zeroValue as first arugment in place in aggregateByKey:
Both 'merging values within a partition' (argument 2) and 'merging values betweeen partitions' (argument 3) functions read and update the first argument (zeroValue) and return it instead of creating a new return value to avoid extra memory allocation. This could be negligible for small scale operations but will be a memory saving technique for a very large scale operations running on cluster(s) with hundreds of nodes
Hence it will be an arbitrary value chosen based on the kind of operation performed in merge and combine to not to effect the actual result (0 for addition (or) 1 for multiplication)
Related
List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
lines.parallelStream().forEachOrdered(line->processRecord(line, dataFileName));
here process record will return string value how to store to any collection / file
could help me for this. thanks for advance.
Instead of using forEachOrdered you could use map to execute and store the result for each of the inputs. You can then collect these into a new list to store the results.
Example:
List<String> result = lines.parallelStream()
.map(line->line.toUpperCase())
.collect(Collectors.toCollection(ArrayList::new));
[Answer is based on comments]
Assuming, you want to process a list of String and also want to store the response of each processing with input and output, then you can follow the below approach.
List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
Map<String, String> result = lines.parallelStream()
.collect(Collectors.toMap(line -> line, line -> processRecord(line, dataFileName)));
Here, the result (a simple Map<Key, Value>) will contain both.. the key as each single line, and value as each single response of THAT line returned by processRecord method.
how many threads will execute parallelly this method (from comments)
The method parallelStream() uses threads form DefaultForkJoinPool. And this pool has default size equal to the number of cores on your system.
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.
aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.
combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.
I am currently reading the O'Reilly Java 8 Lambdas, it is a really good book. I came across with a example like this.
I have a
private final BiFunction<StringBuilder,String,StringBuilder>accumulator=
(builder,name)->{if(builder.length()>0)builder.append(",");builder.append("Mister:").append(name);return builder;};
final Stream<String>stringStream = Stream.of("John Lennon","Paul Mccartney"
,"George Harrison","Ringo Starr");
final StringBuilder reduce = stringStream
.filter(a->a!=null)
.reduce(new StringBuilder(),accumulator,(left,right)->left.append(right));
System.out.println(reduce);
System.out.println(reduce.length());
this produce the right output.
Mister:John Lennon,Mister:Paul Mccartney,Mister:George Harrison,Mister:Ringo Starr
My question is regarding the reduce method the last parameter which is a BinaryOperator.
Which this parameter is used for? If I change by
.reduce(new StringBuilder(),accumulator,(left,right)->new StringBuilder());
the output is the same; if I pass NULL then N.P.E is returned.
What is this parameter used for?
Update
Why if I run it on parallelStream I am receiving different results?
First run:
returned StringBuilder length = 420
Second run:
returned StringBuilder length = 546
Third run:
returned StringBuilder length = 348
and so on. Why is this - should it not return all the values at each iteration?
The method reduce in the interface Stream is overloaded. The parameters for the method with three arguments are:
identity
accumulator
combiner
The combiner supports parallel execution. Apparently, it is not used for sequential streams. However, there is no such guarantee. If you change your streams into parallel stream, I guess you will see a difference:
Stream<String>stringStream = Stream.of(
"John Lennon", "Paul Mccartney", "George Harrison", "Ringo Starr")
.parallel();
Here is an example of how the combiner can be used to transform a sequential reduction into a reduction, that supports parallel execution. There is a stream with four Strings and acc is used as an abbreviation for accumulator.apply. Then the result of the reduction can be computed as follows:
acc(acc(acc(acc(identity, "one"), "two"), "three"), "four");
With a compatible combiner, the above expression can be transformed into the following expression. Now it is possible to execute the two sub-expressions in different threads.
combiner.apply(
acc(acc(identity, "one"), "two"),
acc(acc(identity, "three"), "four"));
Regarding your second question, I use a simplified accumulator to explain the problem:
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder.append(name);
According to the Javadoc for Stream::reduce, the accumulator has to be associative. In this case, that would imply, that the following two expressions return the same result:
acc(acc(acc(identity, "one"), "two"), "three")
acc(acc(identity, "one"), acc(acc(identity, "two"), "three"))
That's not true for the above accumulator. The problem is, that you are mutating the object referenced by identity. That's a bad idea for the reduce operation. Here are two alternative implementations which should work:
// identity = ""
BiFunction<String,String,String> accumulator = String::concat;
// identity = null
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder == null
? new StringBulder(name) : builder.append(name);
nosid's answer got it mostly right (+1) but I wanted to amplify a particular point.
The identity parameter to reduce must be an identity value. It's ok if it's an object, but if it is, it should immutable. If the "identity" object is mutated, it's no longer an identity! For more discussion of this point, see my answer to a related question.
It looks like this example originated from Example 5-19 of Richard Warburton, Java 8 Lambdas, O'Reilly 2014. If so, I shall have to have a word about this with the good Dr. Warburton.