I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.
aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.
combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.
Related
I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
And it gave a surprising result 97.0.
This is quite counter-intuitive compared to the Scala version of fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
which gives the expected answer 13.0.
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
the code will give me 169.0 on my machine!
Can someone explain what exactly is happening here?
Well, it is actually pretty well explained by the official documentation:
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
To illustrate what is going on lets try to simulate what is going on step by step:
val rdd = sc.parallelize(Array(2., 3.))
val byPartition = rdd.mapPartitions(
iter => Array(iter.fold(0.0)((p, v) => (p + v * v))).toIterator).collect()
It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and
byPartition.reduce((p, v) => (p + v * v))
returns 97
Important thing to note is that results can differ from run to run depending on an order in which partitions are combined.
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
I have this function in Haskell, and I am wondering how it can be converted to Java, especially using streams:
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280], ((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
(I'm not a Haskell expert, but I know enough to be dangerous.)
The example code given has several Haskell constructs that map reasonably
well into Java constructs:
A Haskell list is lazy, so it corresponds to a Java Stream.
The ranges used are of integers, so they correspond to IntStream.
For example, [240..1280] corresponds to IntStream.rangeClosed(240, 1280).
A range with a step has no direct correspondence in Java, but it can easily
be computed; you just have to do a bit of arithmetic and then map the values
from the sequential range to the one with steps. For example, [2, 4..20]
can be written as
IntStream.rangeClosed(1, 10).map(i -> 2 * i)
A condition on a list comprehension corresponds to filtering a stream through
a predicate.
A comprehension with multiple generators corresponds to flatmapping
of nested streams.
There is no general way to represent tuples in Java. Various third party
libraries provide tuple implementations with varying tradeoffs regarding
generics and boxing. Or, you can just write your own class with the fields
you want. (This can be quite tedious if you use lots of different kinds of
tuples, though.) In this case, the tuple is simply four ints, so it's easily
represented using an int array with four elements.
Putting it all together, we get the following.
static Stream<int[]> build() {
return IntStream.rangeClosed(240, 1280).boxed()
.flatMap(w -> IntStream.rangeClosed(1, 10).map(m -> 2 * m).boxed()
.flatMap(m -> IntStream.rangeClosed(2, 100).boxed()
.flatMap(n -> IntStream.rangeClosed(240, 1280)
.filter(g -> ((w - 2*m - n*g) % (n+1) == 0))
.filter(g -> n*g+2*m <= w)
.filter(g -> n*g <= w)
.mapToObj(g -> new int[] { w, m, n, g }))));
}
This is clearly quite verbose compared to the original Haskell, but you can easily see where the Haskell constructs have ended up in the Java code. I believe this is correct, as it seems to generate the same output as the Haskell code.
Note that we are generating values using IntStream but we want the flatmap to give a stream of arrays (which are objects), whereas IntStream.flatMap returns an IntStream. Perhaps ideally there would be a flatMapToObj operation, but there isn't, so we must box the int value into an Integer object and then call Stream.flatMap it.
One could assign the stream pipeline to a variable of type Stream, but this wouldn't be very convenient, as Java streams can be used at most once. Since constructing such a stream is cheap (compared to evaluating it) it's reasonable to write a function build() that returns a freshly created stream, ready to be evaluated by the caller.
When the following Java code is run,
System.out.println(build().count());
System.out.println(build().findFirst().map(Arrays::toString).orElse("not found"));
System.out.println(build().reduce((a, b) -> b).map(Arrays::toString).orElse("not found"));
the result is:
654559
[484, 2, 2, 240]
[1280, 20, 5, 248]
Running the following Haskell code (the definition of build is copied from the question)
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280],
((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
main = do
print (length build)
print (head build)
print (last build)
gives the following output:
654559
(484,2,2,240)
(1280,20,5,248)
So the transliteration appears correct to my eye.
Times for the head (in Java, findFirst) and last (in Java, reduce((a, b) -> b)) operations were as follows: (updated using GHC 7.6.3 -O2)
head last
GHC 8s 36s
JDK 3s 9s
This at least shows that both systems provide laziness, as the computation is short-circuited after the first element is found, whereas finding the last element requires all to be computed.
Interestingly, in Haskell, calling all three of length, head, and last doesn't take any more time than just calling last (around 36s) presumably because of memoization. There's no memoization in Java, but of course you could explicitly store the results in an array or List and process that multiple times.
Overall, though, I was startled at how much faster the Java implementation is. I don't really understand Haskell performance, so I'll leave it to Haskell experts to comment on that. It's quite possible I've done something wrong, though mostly I just copied the function from the question into a file and compiled it using GHC.
My environment:
JDK 9, GHC 7.6.3 -O2, MacBook Pro mid 2014 2-core 3GHz Intel Core i7
I was playing with Spark. I am bit confused with the working of aggregateby key function.
If I provide non-zero initial value. It is adding 2*initial value in the total.
Following is the code snippet:
JavaPairRDD<String, Integer> mapToPair = rdd.mapToPair(message -> new Tuple2<String, Integer>(message.split(",")[0], Integer.parseInt(message.split(",")[1])))
Function2<Integer, Integer, Integer> mergeValue =(v1, v2) -> v1+v2; Function2<Integer, Integer, Integer> mergeCombiners =(v1, v2) -> v1+v2;
JavaPairRDD<String, Integer> aggregateByKey = mapToPair.aggregateByKey(1, mergeValue, mergeCombiners);
System.out.println("Aggregate by key "+ aggregateByKey.collect());
Following is my input rdd:
hello,1
hello,1
hello,1
hello,1
Output I am getting is
Aggregate by key [(hello,6)]
Please explain its working
zeroValue is added every time new key is seen on current partition so it can be added as many times as many partitions you have and shouldn't change the result of merge and seq ops. This is why 0 is valid for addition but 1 is not.
I agree with #LostInOverflow and here is the explanation why Spark has a zeroValue as first arugment in place in aggregateByKey:
Both 'merging values within a partition' (argument 2) and 'merging values betweeen partitions' (argument 3) functions read and update the first argument (zeroValue) and return it instead of creating a new return value to avoid extra memory allocation. This could be negligible for small scale operations but will be a memory saving technique for a very large scale operations running on cluster(s) with hundreds of nodes
Hence it will be an arbitrary value chosen based on the kind of operation performed in merge and combine to not to effect the actual result (0 for addition (or) 1 for multiplication)
I am currently reading the O'Reilly Java 8 Lambdas, it is a really good book. I came across with a example like this.
I have a
private final BiFunction<StringBuilder,String,StringBuilder>accumulator=
(builder,name)->{if(builder.length()>0)builder.append(",");builder.append("Mister:").append(name);return builder;};
final Stream<String>stringStream = Stream.of("John Lennon","Paul Mccartney"
,"George Harrison","Ringo Starr");
final StringBuilder reduce = stringStream
.filter(a->a!=null)
.reduce(new StringBuilder(),accumulator,(left,right)->left.append(right));
System.out.println(reduce);
System.out.println(reduce.length());
this produce the right output.
Mister:John Lennon,Mister:Paul Mccartney,Mister:George Harrison,Mister:Ringo Starr
My question is regarding the reduce method the last parameter which is a BinaryOperator.
Which this parameter is used for? If I change by
.reduce(new StringBuilder(),accumulator,(left,right)->new StringBuilder());
the output is the same; if I pass NULL then N.P.E is returned.
What is this parameter used for?
Update
Why if I run it on parallelStream I am receiving different results?
First run:
returned StringBuilder length = 420
Second run:
returned StringBuilder length = 546
Third run:
returned StringBuilder length = 348
and so on. Why is this - should it not return all the values at each iteration?
The method reduce in the interface Stream is overloaded. The parameters for the method with three arguments are:
identity
accumulator
combiner
The combiner supports parallel execution. Apparently, it is not used for sequential streams. However, there is no such guarantee. If you change your streams into parallel stream, I guess you will see a difference:
Stream<String>stringStream = Stream.of(
"John Lennon", "Paul Mccartney", "George Harrison", "Ringo Starr")
.parallel();
Here is an example of how the combiner can be used to transform a sequential reduction into a reduction, that supports parallel execution. There is a stream with four Strings and acc is used as an abbreviation for accumulator.apply. Then the result of the reduction can be computed as follows:
acc(acc(acc(acc(identity, "one"), "two"), "three"), "four");
With a compatible combiner, the above expression can be transformed into the following expression. Now it is possible to execute the two sub-expressions in different threads.
combiner.apply(
acc(acc(identity, "one"), "two"),
acc(acc(identity, "three"), "four"));
Regarding your second question, I use a simplified accumulator to explain the problem:
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder.append(name);
According to the Javadoc for Stream::reduce, the accumulator has to be associative. In this case, that would imply, that the following two expressions return the same result:
acc(acc(acc(identity, "one"), "two"), "three")
acc(acc(identity, "one"), acc(acc(identity, "two"), "three"))
That's not true for the above accumulator. The problem is, that you are mutating the object referenced by identity. That's a bad idea for the reduce operation. Here are two alternative implementations which should work:
// identity = ""
BiFunction<String,String,String> accumulator = String::concat;
// identity = null
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder == null
? new StringBulder(name) : builder.append(name);
nosid's answer got it mostly right (+1) but I wanted to amplify a particular point.
The identity parameter to reduce must be an identity value. It's ok if it's an object, but if it is, it should immutable. If the "identity" object is mutated, it's no longer an identity! For more discussion of this point, see my answer to a related question.
It looks like this example originated from Example 5-19 of Richard Warburton, Java 8 Lambdas, O'Reilly 2014. If so, I shall have to have a word about this with the good Dr. Warburton.