Say I have a input as follows:
(1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Ouput is expected as follows:
(1,(2,3,4)) -> (1,3) //second index is total friend #
(2,(1,3,4)) -> (2,3)
(3,(1,2)) -> (3,2)
(4,(1,2)) -> (4,2)
I know how to do this with hashset in java. But don't know how this work with mapreduce model. Can any one throw any ideas or sample code on this problem? I will appreciate this.
------------------------------------------------------------------------------------
Here is my naive solution: 1 mapper, two reducer.
The mapper will organize input(1,2),(2,1),(1,3);
Organize output as
*(1,hashset<2>),(2,hashSet<1>),(1,hashset<2>),(2,hashset<1>),(1,hashset<3>),(3,hashset<1>).*
Reducer1:
take mapper's output as input and output as:
*(1,hashset<2,3>), (3,hashset<1>)and (2,hashset<1>)*
Reducer2:
take reducer1's output as input and output as:
*(1,2),(3,1) and (2,1)*
This is only my naive solution. I'm not sure if this can be done by hadoop's code.
I think there should be an easy way to solve this problem.
Mapper Input: (1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Just emit two records for each pair like this:
Mapper Output/ Reducer Input:
Key => Value
1 => 2
2 => 1
2 => 1
1 => 2
1 => 3
3 => 1
3 => 2
2 => 3
2 => 4
4 => 2
4 => 1
1 => 1
At reducer side, you'll get 4 different groups like this:
Reducer Output:
Key => Values
1 => [2,3,4]
2 => [1,3,4]
3 => [1,2]
4 => [1,2]
Now, you are good to format your result as you want. :)
Let me know if anybody can see any issue in this approach
1) Intro / Problem
Before going ahead with the job driver, it is important to understand that in a simple-minded approach, the values of the reducers should be sorted in an ascending order. The first thought is to pass the value list unsorted and do some sorting in the reducer per key. This has two disadvantages:
1) It is most probably not efficient for large Value Lists
and
2) How will the framework know if (1,4) is equal to (4,1) if these pairs are processed in different parts of the cluster?
2) Solution in theory
The way to do it in Hadoop is to "mock" the framework in a way by creating a synthetic key.
So our map function instead of the "conceptually more appropriate" (if I may say that)
map(k1, v1) -> list(k2, v2)
is the following:
map(k1, v1) -> list(ksynthetic, null)
As you notice we discard the usage of values (the reducer still gets a list of null values but we don't really care about them). What happens here is that these values are actually included in ksynthetic. Here is an example for the problem in question:
`map(1, 2) -> list([1,2], null)
However, some more operations need to be done so that the keys are grouped and partitioned appropriately and we achieve the correct result in the reducer.
3) Hadoop Implementation
We will implement a class called FFGroupKeyComparator and a class FindFriendPartitioner.
Here is our FFGroupKeyComparator:
public static class FFGroupComparator extends WritableComparator
{
protected FFGroupComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(",");
String[] t2Items = t2.toString().split(",");
String t1Base = t1Items[0];
String t2Base = t2Items[0];
int comp = t1Base.compareTo(t2Base); // We compare using "real" key part of our synthetic key
return comp;
}
}
This class will act as our Grouping Comparator class. It controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) This is very important as it ensures that each reducer gets the appropriate synthetic keys ( judging by the real key).
Due to the fact that Hadoop runs in a cluster with many nodes it is important to ensure that there as many reduce tasks as partitions. Their number should be the same as of the real keys (not synthetic). So, usually we do this with hash values. In our case, what we need to do is compute the partition that a synthetic key belongs based on the hash value of the real key (before the comma). So our FindFriendPartitioner is as follows:
public static class FindFriendPartitioner extends Partitioner implements Configurable
{
#Override
public int getPartition(Text key, Text NullWritable, int numPartitions)
{
String[] keyItems = key.toString().split(",");
String keyBase = keyItems[0];
int part = keyBase.hashCode() % numPartitions;
return part;
}
So now we are all set to write the actual job and solve our problem.
I am assuming your input file looks like this:
1,2
2,1
1,3
3,2
2,4
4,1
We will use the TextInputFormat.
Here's the code for the job driver using Hadoop 1.0.4:
public class FindFriendTwo
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable() );
String tempStrings[] = value.toString().split(",");
Text value2 = new Text(tempStrings[1] + "," + tempStrings[0]); //reverse relationship
context.write(value2, new NullWritable());
}
}
Notice that we also passed the reverse relationships in the map function.
For example if the input string is (1,4) we must not forget (4,1).
public static class FindFriendReducer extends Reducer<Text, NullWritable, IntWritable, IntWritable> {
private Set<String> friendsSet;
public void setup(Context context)
{
friendSet = new LinkedHashSet<String>();
}
public void reduce(Text syntheticKey, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
String tempKeys[] = syntheticKey.toString().split(",");
friendsSet.add(tempKeys[1]);
if( friendsList.size() == 2 )
{
IntWritable key = Integer.parseInt(tempKeys[0]);
IntWritable value = Integer.parseInt(tempKeys[1]);
write(key, value);
}
}
}
Finally, we must remember to include the following in our Main Class, so that the framework uses our classes.
jobConf.setGroupingComparatorClass(FFGroupComparator.class);
jobConf.setPartitionerClass(FindFriendPartitioner.class);
I would approach this problem as follows.
Make sure we have all the relations and have them exactly once each.
Simply count the
Notes on my aproach:
My notation for key value pairs is : K -> V
Both key and value are a almost always a datastructure (not just a string or int)
I never use the key for data. The key is ONLY there to control the flow from mappers towards the right reducer. In all other places I do not look at the key at all. The framework does require a key everywhere. With '()' I mean to say that there is a key that I ignore completely.
The key about my aproach is that it never needs 'all friends' in memory at the same moment (so it works also in the really big situations).
We start with a lot of
(x,y)
and we know that we do not have all relationships in the dataset.
Mapper: Create all relations
Input: () -> (x,y)
Output: (x,y) -> (x,y)
(y,x) -> (y,x)
Reducer: Remove duplicates (simply only output the first one from the iterator)
Input: (x,y) -> [(x,y),(x,y),(x,y),(x,y),.... ]
Output: () -> (x,y)
Mapper: "Wordcount"
Input: () -> (x,y)
Output: (x) -> (x,1)
Reducer: Count them
Input: (x) -> [(x,1),(x,1),(x,1),(x,1),.... ]
Output: () -> (x,N)
Being helped by so many excellent engineers, I finally tried out the solution.
Only one Mapper and one Reducer. No combiner here.
input of Mapper:
1,2
2,1
1,3
3,1
3,2
3,4
5,1
Output of Mapper:
1,2
2,1
1,2
2,1
1,3
3,1
1,3
3,1
4,3
3,4
1,5
5,1
Output Of Reducer:
1 3
2 2
3 3
4 1
5 1
The first col is user, the second is friend#.
On the reducer stage, I add hashSet to assistant analysis.
Thanks #Artem Tsikiridis #Ashish
Your answer gave me a nice clue.
Edited:
Added Code:
//mapper
public static class TokenizerMapper extends
Mapper<Object, Text, Text, Text> {
private Text word1 = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line,",");
if(itr.hasMoreElements()){
word1.set(itr.nextToken().toLowerCase());
}
if(itr.hasMoreElements()){
word2.set(itr.nextToken().toLowerCase());
}
context.write(word1, word2);
context.write(word2, word1);
//
}
}
//reducer
public static class IntSumReducer extends
Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
HashSet<Text> set = new HashSet<Text>();
int sum = 0;
for (Text val : values) {
if(!set.contains(val)){
set.add(val);
sum++;
}
}
result.set(sum);
context.write(key, result);
}
}
Related
i am working on WordsCount problem with MapReduce. I have used txt file of Lewis Carroll’s famous Through the Looking-Glass. Its pretty big file. I ran my MapReduce code and its working fine. Now i need i find out top 10 most frequent words excluding “the”, “am”, “is”, and “are”. I have no idea how to handle this issue.
Here is my code
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[^a-zA-Z0-9]", " ").trim().toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
/* job.setSortComparatorClass(Text.Comparator.class);*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm personally not going to write code until I see an attempt on your end that requires more effort than Wordcount.
You need a second mapper and reducer to perform a Top N operation. If you used a higher level language such as Pig, Hive, Spark, etc. that's what it would do.
For starters, you can at least filter out the words from itr.nextToken() to prevent the first mapper from ever seeing them.
Then, in the reducer, your output will be unsorted, but you are already getting the sum for all words to some output directory, which is a necessary first step in getting the top words.
The solution to the problem requires you to create a new Job object to read that first output directory, write to a new output directory, and for each line of text in the mapper generate null, line as the output (use NullWritable and Text).
With this, in the reducer, all lines of text will be sent into one reducer iterator, so in order to get the Top N items, you can create a TreeMap<Integer, String> to sort words by the count (refer Sorting Descending order: Java Map). While inserting elements, larger values will automatically get pushed to the top of the tree. You can optionally optimize this by tracking the smallest element in the tree as well, and only inserting items larger than it, and/or track the tree size and only insert items larger than the N'th item (this helps if you potentially have hundreds of thousands of words).
After the loop that adds all elements to the tree, take all the top N of the string values and their counts (the tree is already sorted for you), and write them out from the reducer. With that, you should end up with the Top N items.
My mapper class will output key-value pairs like:
abc 1
abc 2
abc 1
And I want to merge the values and calculate the occurrence of same pair in reducer class using HashMap, the output is like:
abc 1:2 2:1
But my output result is:
abc 1:2:1 2:1:1
It feels like there are additional Strings concatenated with the output, but I don't know why.
Here is my code:
Text combiner = new Text();
StringBuilder strBuilder = new StringBuilder();
#Override
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
HashMap<Text, Integer> result = new HashMap<Text, Integer>();
for (Text val : values) {
if(result.containsKey(val)){
int newVal = result.get(val) + 1;
result.put(val, newVal);
}else{
result.put(val, 1);
}
}
for(Map.Entry<Text, Integer> entry: result.entrySet()){
strBuilder.append(entry.getKey().toString());
strBuilder.append(":");
strBuilder.append(entry.getValue());
strBuilder.append("\t");
}
combiner.set(strBuilder.toString());
context.write(key, combiner);
}
I tested this code an it looks ok. The most likely reason you're getting output like this is because you're running this reducer as your combiner as well, which would explain why you're getting three values. The combine does the first concatenation, followed by the reduce which does a second.
You need to make sure a combiner isn't being configured in your job setup.
I would also suggest you change your code to make sure you store new versions of the Text values in your HashMap, remember Hadoop will be reusing the objects. So you should really be doing something like:
result.put(new Text(val), newVal);
or change your HashMap to store Strings, which is safe since they're immutable.
Say I have a input file as below
dept_id emp_id salary
1 13611 1234
2 13609 3245
3 13612 3251
2 13623 1232
1 13619 6574
3 13421 234
Now I want to find the average salary of each department. Like the following Hive query:
SELECT dept_id, avg(salary) FROM dept GROUP BY dept_id
This will return the output:
dept_id avg_sal
----------------
1 3904.0
2 2238.5
3 1742.5
Now, what I want to do is to generate the same output, but using the mapreduce framework. So how to write it? Thanks in advance!
IMPORTANT:
Before attempting to implement this, first try some basic examples in MapReduce, like implementing a word count program, to understand the logic and even before that, read a book or a tutorial about how MapReduce works.
The idea of aggregating stuff (like finding the average) is that you group by key (department id) in the map phase and then you reduce all the salaries of a specific department in the reduce phase.
In a more formalistic way:
MAP:
input:a line representing a salary record (i.e., dep_id, emp_id, salary)
output (key,value): (dep_id, salary)
REDUCE:
input (key, values): (dep_id, salaries:list of salary values having this dep_id)
output (key, value): (dep_id, avg(salaries))
This way, all the salaries that belong to the same department will be handled by the same reducer. All you have to do in the reducer, is find the average of the input values.
code----
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AverageSalary {
public static class AvgMapper
extends Mapper<Object, Text, Text, FloatWritable>{
private Text dept_id = new Text();
private FloatWritable salary = new FloatWritable();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String values[] = value.toString().split("\t");
dept_id.set(values[0]);
salary.set(Float.parseFloat(values[2]));
context.write(dept_id, salary);
}
}
public static class AvgReducer
extends Reducer<Text,FloatWritable,Text,FloatWritable> {
private FloatWritable result = new FloatWritable();
public void reduce(Text key, Iterable<FloatWritable> values,
Context context
) throws IOException, InterruptedException {
float sum = 0;
float count = 0;
for (FloatWritable val : values) {
sum += val.get();
count++;
}
result.set(sum/count);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "average salary");
job.setJarByClass(AverageSalary.class);
job.setMapperClass(AvgMapper.class);
job.setCombinerClass(AvgReducer.class);
job.setReducerClass(AvgReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/kishore/Data/mapreduce.txt")); // input path
FileOutputFormat.setOutputPath(job, new Path("/home/kishore/Data/map3")); // output path
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
output
1 3904.0
2 2238.5
3 1742.5
If you have not gone through any training program yet, visit free videos on you tube by edureka for better understanding concepts : Map Reduce
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
Working example in Apache Hadoop website about : Word Count example
For your use case, simple use word count example would not be suffice.
You have to use Combiner and partitioners on Mapper since you are using Group by. Visit this video : Advanced Map reduce
My job contains a mapper and a reducer. The reducer emits key value pairs where the key is the name of the student and the value is the gpa. The reducer computes the gpa. How can I make it so the reducer outputs are sorted on the value(gpa)?
Reducer code:
public class ReducerClass extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int i = 0;
int total = 0;
for (IntWritable value : values) {
i++;
total = total + value.get();
}
context.write(key, new IntWritable(total));
}
}
One way of doing is using secondary sort.here. The idea is adding the value also inside the reducer key(a composite key), and allow hadoop do the sorting at the output of map. This requires more changes to your existing design.
Another way(may be more easier), is once your current job is done, you can give the output of first job to second job, and interchange the key and value. In this case, the second job can have only a map, and the output will be shown as sorted based on gpa. Any repeated students, with same gpa, can come as a list for a specific gpa.
You can also try to sort the output in the reducer's cleanup method.
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph):
1 3
3 1
3 2
2 3
Now, I want to use MapReduce to count the triangles in this graph (obviously one). It is still work in progress and in the first phase, I try to get a list of all neighbors for each vertex.
My main class looks like the following:
public class TriangleCount {
public static void main( String[] args ) throws Exception {
// remove the old output directory
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path("output/"), true);
JobConf firstPhaseJob = new JobConf(FirstPhase.class);
firstPhaseJob.setOutputKeyClass(IntWritable.class);
firstPhaseJob.setOutputValueClass(IntWritable.class);
firstPhaseJob.setMapperClass(FirstPhase.Map.class);
firstPhaseJob.setCombinerClass(FirstPhase.Reduce.class);
firstPhaseJob.setReducerClass(FirstPhase.Reduce.class);
FileInputFormat.setInputPaths(firstPhaseJob, new Path("input/"));
FileOutputFormat.setOutputPath(firstPhaseJob, new Path("output/"));
JobClient.runJob(firstPhaseJob);
}
}
My Mapper and Reducer implementations look like this, they are both very easy:
public class FirstPhase {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
#Override
public void map(LongWritable longWritable, Text graphLine, OutputCollector<IntWritable, IntWritable> outputCollector, Reporter reporter) throws IOException {
StringTokenizer tokenizer = new StringTokenizer(graphLine.toString());
int n1 = Integer.parseInt(tokenizer.nextToken());
int n2 = Integer.parseInt(tokenizer.nextToken());
if(n1 > n2) {
System.out.println("emitting (" + new IntWritable(n1) + ", " + new IntWritable(n2) + ")");
outputCollector.collect(new IntWritable(n1), new IntWritable(n2));
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, IntWritable, Text> {
#Override
public void reduce(IntWritable key, Iterator<IntWritable> iterator, OutputCollector<IntWritable, Text> outputCollector, Reporter reporter) throws IOException {
List<IntWritable> nNodes = new ArrayList<>();
while(iterator.hasNext()) {
nNodes.add(iterator.next());
}
System.out.println("key: " + key + ", list: " + nNodes);
// create pairs and emit these
for(IntWritable n1 : nNodes) {
for(IntWritable n2 : nNodes) {
outputCollector.collect(key, new Text(n1.toString() + " " + n2.toString()));
}
}
}
}
}
I've added some logging to the program. In the map phase, I print which pairs I'm emitting. In the reduce phase, I print the input of the reduce. I get the following output:
emitting (3, 1)
emitting (3, 2)
key: 3, list: [1, 1]
The input for the reduce function is not what I expect. I expect it to be [1, 2] and not [1, 1]. I believe that Hadoop automatically combines all my emitted pairs from the output of the map phase but am I missing something here? Any help or explanation would be appreciated.
This is a typical problem for people beginning with Hadoop MapReduce.
The problem is in your reducer. When looping through the given Iterator<IntWritable>, each IntWritable instance is re-used, so it only keeps one instance around at a given time.
That means when you call iterator.next() your first saved IntWritable instance is set with the new value.
You can read more about this problem here
https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/