Understanding JavaPairRDD.reduceByKey function - java

I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?

As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.

I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)

Related

Difference between combinebykey and aggregatebykey

I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.
aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.
combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.

Apache Spark How aggregateby key works

I was playing with Spark. I am bit confused with the working of aggregateby key function.
If I provide non-zero initial value. It is adding 2*initial value in the total.
Following is the code snippet:
JavaPairRDD<String, Integer> mapToPair = rdd.mapToPair(message -> new Tuple2<String, Integer>(message.split(",")[0], Integer.parseInt(message.split(",")[1])))
Function2<Integer, Integer, Integer> mergeValue =(v1, v2) -> v1+v2; Function2<Integer, Integer, Integer> mergeCombiners =(v1, v2) -> v1+v2;
JavaPairRDD<String, Integer> aggregateByKey = mapToPair.aggregateByKey(1, mergeValue, mergeCombiners);
System.out.println("Aggregate by key "+ aggregateByKey.collect());
Following is my input rdd:
hello,1
hello,1
hello,1
hello,1
Output I am getting is
Aggregate by key [(hello,6)]
Please explain its working
zeroValue is added every time new key is seen on current partition so it can be added as many times as many partitions you have and shouldn't change the result of merge and seq ops. This is why 0 is valid for addition but 1 is not.
I agree with #LostInOverflow and here is the explanation why Spark has a zeroValue as first arugment in place in aggregateByKey:
Both 'merging values within a partition' (argument 2) and 'merging values betweeen partitions' (argument 3) functions read and update the first argument (zeroValue) and return it instead of creating a new return value to avoid extra memory allocation. This could be negligible for small scale operations but will be a memory saving technique for a very large scale operations running on cluster(s) with hundreds of nodes
Hence it will be an arbitrary value chosen based on the kind of operation performed in merge and combine to not to effect the actual result (0 for addition (or) 1 for multiplication)

How to do a counter of specific words in Java?

Hey guys I am developing a project where I have 4 questions where someone can evaluate as (great, good, regular, and poor), and after that I would need to check how many people voted as great, how many voted as good, regular, and poor, for each of the 4 questions. So I would like to make a count to check the .txt and count how many times the word (great, good, regular, and poor) apears on it. I was trying to do it like in Python, where you only need a dictionary (or a counter) and simply do something like:
dict["great"] += 1
However, it isn't possible to do so in Java. Does anyone know any method that would be similar to this one in Java, or another way to do it simply (without having to create a lot of variables to save each question's answer).
Thank you very much for your help.
In java 8 the compute method was added to the Map interface. It may be a bit more complicated than in python, but it's probably the closest it gets to the python code:
Map<String, Integer> map = new HashMap<>();
String rating = ...
map.compute(rating, (key, oldValue) -> ((oldValue == null) ? 1 : oldValue+1));
The lambda expression passed as second parameter to compute receives the old value the key was mapped to as second parameter or null, if there was no mapping.
This is 100% possible in Java.
Use a HashMap to store the values.
For example:
HashMap counts = new HashMap<String, Integer>();
counts.put("great", 0);
counts.put("good", 0);
counts.put("regular", 0);
counts.put("poor", 0);
Now, suppose you read in a string input.
To increase the counter, do :
counts.put(input, counts.get(input) + 1);
This will increase the counter in that position by 1.
Use counts.get(input) to get the counter of input string.

Difference between traditional imperative style of programming and functional style of programming

I have a problem statement here
what I need to do it iterate over a list find the first integer which is greater than 3 and is even then just double it and return it.
These are some methods to check how many operations are getting performed
public static boolean isGreaterThan3(int number){
System.out.println("WhyFunctional.isGreaterThan3 " + number);
return number > 3;
}
public static boolean isEven(int number){
System.out.println("WhyFunctional.isEven " + number);
return number % 2 == 0;
}
public static int doubleIt(int number){
System.out.println("WhyFunctional.doubleIt " + number);
return number << 1;
}
with java 8 streams I could do it like
List<Integer> integerList = Arrays.asList(1, 2, 3, 5, 4, 6, 7, 8, 9, 10);
integerList.stream()
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt)
.findFirst();
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
Optional[8]
so total 8 operations.
And with imperative style or before java8 I could code it like
for (Integer integer : integerList) {
if(isGreaterThan3(integer)){
if(isEven(integer)){
System.out.println(doubleIt(integer));
break;
}
}
}
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
8
and operations are same. So my question is what difference does it make if I am using streams rather traditional for loop.
Stream API introduces the new idea of streams which allows you to decouple the task in a new way. For example, based on your task it's possible that you want to do different things with the doubled even numbers greater than three. In some place you want to find the first one, in other place you need 10 such numbers, in third place you want to apply more filtering. You can encapsulate the algorithm of finding such numbers like this:
static IntStream numbers() {
return IntStream.range(1, Integer.MAX_VALUE)
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt);
}
Here it is. You've just created an algorithm to generate such numbers (without generating them) and you don't care how they will be used. One user might call:
int num = numbers().findFirst().get();
Other user might need to get 10 such numbers:
int[] tenNumbers = numbers().limit(10).toArray();
Third user might want to find the first matching number which is also divisible by 7:
int result = numbers().filter(n -> n % 7 == 0).findFirst().get();
It would be more difficult to encapsulate the algorithm in traditional imperative style.
In general the Stream API is not about the performance (though parallel streams may work faster than traditional solution). It's about the expressive power of your code.
The imperative style complects the computational logic with the mechanism used to achieve it (iteration). The functional style, on the other hand, decomplects the two. You code against an API to which you supply your logic and the API has the freedom to choose how and when to apply it.
In particular, the Streams API has two ways how to apply the logic: either sequentially or in parallel. The latter is actually the driving force behind the introduction of both lambdas and the Streams API itself into Java.
The freedom to choose when to perform computation gives rise to laziness: whereas in the imperative style you have a concrete collection of data, in the functional style you can have a collection paired with logic to transform it. The logic can be applied "just in time", when you actually consume the data. This further allows you to spread the building up of computation: each method can receive a stream and apply a further step of computation on it, or it can consume it in different ways (by collecting into a list, by finding just the first item and never applying computation to the rest, but calculating an aggregate value, etc.).
As a particular example of the new opportunities offered by laziness, I was able to write a Spring MVC controller which returned a Stream whose data source was a database—and at the time I return the stream, the data is still in the database. Only the View layer will pull the data, implicitly applying the transformation logic it has no knowledge of, never having to retain more than a single stream element in memory. This converted a solution which classically had O(n) space complexity into O(1), thus becoming insensitive to the size of the result set.
Using the Stream API you are describing an operation instead of implementing it. One commonly known advantage of letting the Stream API implement the operation is the option of using different execution strategies like parallel execution (as already said by others).
Another feature which seems to be a bit underestimated is the possibility to alter the operation itself in a way that is impossible to do in an imperative programming style as that would imply modifying the code:
IntStream is=IntStream.rangeClosed(1, 10).filter(i -> i > 4);
if(evenOnly) is=is.filter(i -> (i&1)==0);
if(doubleIt) is=is.map(i -> i<<1);
is.findFirst().ifPresent(System.out::println);
Here, the decision whether to filter out odd numbers or double the result is made before the terminal operation is commenced. In an imperative programming you either have to recheck the flags within the loop or code multiple alternative loops. It should be mentioned that checking such conditions within a loop isn’t that bad on today’s JVM as the optimizer is capable of moving them out of the loop at runtime, so coding multiple loops is usually unnecessary.
But consider the following example:
Stream<String> s = Stream.of("java8 streams", "are cool");
if(singleWords) s=s.flatMap(Pattern.compile("\\s")::splitAsStream);
s.collect(Collectors.groupingBy(str->str.charAt(0)))
.forEach((k,v)->System.out.println(k+" => "+v));
Since flatMap is the equivalent of a nested loop, coding the same in an imperative style isn’t that simple any more as we have either a simple loop or a nested loop based on a runtime value. Usually, you have to resort to splitting the code into multiple methods if you want to share it between both kind of loops.
I already encountered a real-life example where the composition of a complex operation had multiple conditional flatMap steps. The equivalent imperative code is insane…
1) Functional approach allows more declarative way of programming: you just provide a list of functions to apply and don't need to write iterations manually, so your code is more consine sometimes.
2) If you switch to parallel stream (https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) it will be possible to automatically convert your program to parallel and execute it faster. It is possbile because you don't explicitly code iteration, just list what functions to apply, so compiler/runtime may parallel it.
In this simple example, there is little difference, and the JVM will try to do the same amount of work in each case.
Where you start to see a difference is in more complicated examples like
integerList.parallelStream()
making the code concurrent for a loop is much harder. Note: you wouldn't actually do this as the overhead would to high and you only want the first element.
BTW The first example returns the result and the second prints.

Xtend "Movies example" best answer

I followed the Xtend tutorial and the Movies example. At the end of this tutorial, you can find this question:
#Test def void sumOfVotesOfTop2() {
val long sum = movies.sortBy[ -rating ].take(2).map[ numberOfVotes ].reduce[ a, b | a + b ]
assertEquals(47_229L, sum)
}
First the movies are sorted by rating, then we take the best two. Next the list of movies is turned into a list of their numberOfVotes using the map function. Now we have a List which can be reduced to a single Long by adding the values.
You could also use reduce instead of map and reduce. Do you know how?
My question is: What is the best answer for the last question ?
I found a way to compute the same "sum" value without using map() extension method, but it seems awful for me. Here is my solution:
assertEquals(47229, this.movies.sortBy[ -rating ].take(2).reduce[m1, m2 | new Movie('', 0, 0.0, m1.numberOfVotes + m2.numberOfVotes,null)].numberOfVotes)
Is there a better (and cleaner) way to do that ?
You could use fold(R seed, (R,T)=>R function) instead of reduce((T,T)=>T):
assertEquals(47229,
movies
.sortBy[rating]
.reverseView
.take(2)
.fold(0L) [ result, movie | result + movie.numberOfVotes ])
Please note that map((T)=>R) does not perform any eager computation but is evaluated on demand, so performance should not matter for a solution that uses the map function. Nevertheless, fold is quite handy if you need to accumulate a result for a set of values where the result type different from the element type.

Categories

Resources