I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
And it gave a surprising result 97.0.
This is quite counter-intuitive compared to the Scala version of fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
which gives the expected answer 13.0.
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
the code will give me 169.0 on my machine!
Can someone explain what exactly is happening here?
Well, it is actually pretty well explained by the official documentation:
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
To illustrate what is going on lets try to simulate what is going on step by step:
val rdd = sc.parallelize(Array(2., 3.))
val byPartition = rdd.mapPartitions(
iter => Array(iter.fold(0.0)((p, v) => (p + v * v))).toIterator).collect()
It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and
byPartition.reduce((p, v) => (p + v * v))
returns 97
Important thing to note is that results can differ from run to run depending on an order in which partitions are combined.
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
I have this function in Haskell, and I am wondering how it can be converted to Java, especially using streams:
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280], ((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
(I'm not a Haskell expert, but I know enough to be dangerous.)
The example code given has several Haskell constructs that map reasonably
well into Java constructs:
A Haskell list is lazy, so it corresponds to a Java Stream.
The ranges used are of integers, so they correspond to IntStream.
For example, [240..1280] corresponds to IntStream.rangeClosed(240, 1280).
A range with a step has no direct correspondence in Java, but it can easily
be computed; you just have to do a bit of arithmetic and then map the values
from the sequential range to the one with steps. For example, [2, 4..20]
can be written as
IntStream.rangeClosed(1, 10).map(i -> 2 * i)
A condition on a list comprehension corresponds to filtering a stream through
a predicate.
A comprehension with multiple generators corresponds to flatmapping
of nested streams.
There is no general way to represent tuples in Java. Various third party
libraries provide tuple implementations with varying tradeoffs regarding
generics and boxing. Or, you can just write your own class with the fields
you want. (This can be quite tedious if you use lots of different kinds of
tuples, though.) In this case, the tuple is simply four ints, so it's easily
represented using an int array with four elements.
Putting it all together, we get the following.
static Stream<int[]> build() {
return IntStream.rangeClosed(240, 1280).boxed()
.flatMap(w -> IntStream.rangeClosed(1, 10).map(m -> 2 * m).boxed()
.flatMap(m -> IntStream.rangeClosed(2, 100).boxed()
.flatMap(n -> IntStream.rangeClosed(240, 1280)
.filter(g -> ((w - 2*m - n*g) % (n+1) == 0))
.filter(g -> n*g+2*m <= w)
.filter(g -> n*g <= w)
.mapToObj(g -> new int[] { w, m, n, g }))));
This is clearly quite verbose compared to the original Haskell, but you can easily see where the Haskell constructs have ended up in the Java code. I believe this is correct, as it seems to generate the same output as the Haskell code.
Note that we are generating values using IntStream but we want the flatmap to give a stream of arrays (which are objects), whereas IntStream.flatMap returns an IntStream. Perhaps ideally there would be a flatMapToObj operation, but there isn't, so we must box the int value into an Integer object and then call Stream.flatMap it.
One could assign the stream pipeline to a variable of type Stream, but this wouldn't be very convenient, as Java streams can be used at most once. Since constructing such a stream is cheap (compared to evaluating it) it's reasonable to write a function build() that returns a freshly created stream, ready to be evaluated by the caller.
When the following Java code is run,
System.out.println(build().findFirst().map(Arrays::toString).orElse("not found"));
System.out.println(build().reduce((a, b) -> b).map(Arrays::toString).orElse("not found"));
the result is:
[484, 2, 2, 240]
[1280, 20, 5, 248]
Running the following Haskell code (the definition of build is copied from the question)
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280],
((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
main = do
print (length build)
print (head build)
print (last build)
gives the following output:
So the transliteration appears correct to my eye.
Times for the head (in Java, findFirst) and last (in Java, reduce((a, b) -> b)) operations were as follows: (updated using GHC 7.6.3 -O2)
head last
GHC 8s 36s
JDK 3s 9s
This at least shows that both systems provide laziness, as the computation is short-circuited after the first element is found, whereas finding the last element requires all to be computed.
Interestingly, in Haskell, calling all three of length, head, and last doesn't take any more time than just calling last (around 36s) presumably because of memoization. There's no memoization in Java, but of course you could explicitly store the results in an array or List and process that multiple times.
Overall, though, I was startled at how much faster the Java implementation is. I don't really understand Haskell performance, so I'll leave it to Haskell experts to comment on that. It's quite possible I've done something wrong, though mostly I just copied the function from the question into a file and compiled it using GHC.
My environment:
JDK 9, GHC 7.6.3 -O2, MacBook Pro mid 2014 2-core 3GHz Intel Core i7
I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.
aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.
combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.
I followed the Xtend tutorial and the Movies example. At the end of this tutorial, you can find this question:
#Test def void sumOfVotesOfTop2() {
val long sum = movies.sortBy[ -rating ].take(2).map[ numberOfVotes ].reduce[ a, b | a + b ]
assertEquals(47_229L, sum)
First the movies are sorted by rating, then we take the best two. Next the list of movies is turned into a list of their numberOfVotes using the map function. Now we have a List which can be reduced to a single Long by adding the values.
You could also use reduce instead of map and reduce. Do you know how?
My question is: What is the best answer for the last question ?
I found a way to compute the same "sum" value without using map() extension method, but it seems awful for me. Here is my solution:
assertEquals(47229, this.movies.sortBy[ -rating ].take(2).reduce[m1, m2 | new Movie('', 0, 0.0, m1.numberOfVotes + m2.numberOfVotes,null)].numberOfVotes)
Is there a better (and cleaner) way to do that ?
You could use fold(R seed, (R,T)=>R function) instead of reduce((T,T)=>T):
.fold(0L) [ result, movie | result + movie.numberOfVotes ])
Please note that map((T)=>R) does not perform any eager computation but is evaluated on demand, so performance should not matter for a solution that uses the map function. Nevertheless, fold is quite handy if you need to accumulate a result for a set of values where the result type different from the element type.
Consider the following simple Python function by way of example:
def quantize(data, nlevels, quantizer=lambda x, d: int(floor(x/d))):
llim = min(data)
delta = (max(data) - llim)/(nlevels - 1) # last level x == max(data) only
y = type(data)
if delta == 0:
return y([0] * len(data))
return y([quantizer(x - llim, delta) for x in data])
And here it is in action:
>>> from random import random
>>> data = [10*random() for _ in range(10)]
>>> data
[6.6181668777075018, 9.0511321773967737, 1.8967672216187881, 7.3396890304913951,
4.0566699095012835, 2.3589022034131069, 0.76888247730320769, 8.994874996737197,
7.1717500363578246, 2.887112256757157]
>>> quantize(data, nlevels=5)
[2, 4, 0, 3, 1, 0, 0, 3, 3, 1]
>>> quantize(tuple(data), nlevels=5)
(2, 4, 0, 3, 1, 0, 0, 3, 3, 1)
>>> from math import floor
>>> quantize(data, nlevels=5, quantizer=lambda x, d: (floor(x/d) + 0.5))
[2.5, 4.5, 0.5, 3.5, 1.5, 0.5, 0.5, 3.5, 3.5, 1.5]
This function certainly has flaws --for one thing, it does not validate arguments, and it should be smarter about how it sets the type of the returned value--, but it has the virtue that it will work whether the elements in data are integers or floats or some other numeric type. Also, by default it returns a list of ints, though, by passing a suitable function as the optional quantizer argument, this type can be changed to something else. Furthermore, if the data parameter is a list, the returned value will be a list; if data is a tuple, the returned value will be a tuple. (This last feature is certainly the weakest one, but it is also the one that I'm least interested in replicating in Java, so I did not bother to make it more robust.)
I would like to write an efficient Java-equivalent of this function, which means figuring out how to get around Java's typing. Since I learned Java (aeons ago), generics were introduced into the language. I've tried learning about Java generics, but find them pretty incomprehensible. I don't know is this is due to early-onset senility, or because of the sheer growth in Java's complexity since I last programmed in it (ca. 2001), but every page I find on this topic is more confusing than the previous one. I'd really appreciate it if someone could show me how to do this in Java.
One solution to the input/output type question might be to use the Number class and its subclasses along with wildcards. If you wanted to accept any type of numerical argument, you could either specify the input type to be Number OR ? extends Number. If the input is a list, the latter form has an advantage as it will ensure that each element of the list is of the same type (which must be a subclass of Number). The ? is known as a Wildcard, and when it is expressed as ? extends Number it is a "Bounded Wildcard", and may only refer to a subtype of the bounding type.
public List<Number> func(List<? extends Number> data, Number nlevels)
This would take a List of a specific subclass of Number, a Number for the nlevels parameter, and return a List of Numbers
As for the function input parameter, it would be possible to input a Method, though the type checking at this point gets difficult, as you will be passing data of a bounded unknown parameter to a Method object. I am not exactly sure how that would work.
As for the return type, it would be possible to specify another parameter, a class object (likely a ? extends Number again) that the list elements would be cast (or converted) to.
public List<? extends Number> quantize(List<? extends Number> data,
Number nlevels,
Method quantizer,
Class<? extends Number> returnType)
That is an attempt at a possible declaration for your function in Java. The implementation, however, is somewhat more complex.
It's not quite what you asked, but may I suggest trying out Jython? You'd be able to take your Python code and compile directly to Java bytecode. As you haven't used Java since 2001 and you seem to be using Python these days, you may find Jython much easier to work with than having to bring yourself up to speed on all the changes in Java beforehand.