Hadoop MapReduce Jobs for Highest Frequency

Hadoop MapReduce Jobs for Highest Frequency - java

I'm trying to use the basic word count as defined here. Is it possible that when the IntSumReducer does context.write, that context.write could be passed to a second reducer or output class that would reduce/change the final list given by the IntSumReducer down to a single largest frequency?
I am quite new to Hadoop/MapReduce and the concept of jobs in Java so I'm uncertain how exactly I would need to modify the default WordCount to comply to make that possible. Could I write a second Reducer function and place it inside of the same job? How would I do that? How would I signal that there is another reducer to be run after IntSumReducer?
Base WordCount:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}`

What you're looking for is called a Combiner in hadoop, which does some semi-reduction before emitting the output to a final reducer class. For more info on it click here.

Related

Mapreduce how to chain Mapper >> Reducer >> Reducer

I have a problem where I need to chain
Mapper >> Reducer >> Reducer
This is my data:
Dpt.csv
EmpNo1,DeptNo1
EmpNo2,DeptNo2
EmpNo3,DeptNo1
EmpNo4,DeptNo2
EmpNo5,DeptNo2
EmpNo6,DeptNo1
Emp.csv
EmpNo1,10000
EmpNo2,4675432
EmpNo3,76568658
EmpNo4,241423
EmpNo5,75756
EmpNo6,9796854
And finally I want something like this:
Dept1 >> Total_Salary_Dept_1
One major issue is my first reducer is not getting called when I use multiple files as input.
The second issue is that I can't pass that output to next reducer. (ChainReducer can't chain 2 reducers)
I was using this as a reference but quickly realized it won't help.
I found this link where, in one of the comments the author says this: "In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer."
Does this mean I will have a structure like this:
Chain Mapper(mapper 1) --> Chain Reducer(reducer 1) --> ChainMapper(unnecessary mapper) --> Chain Reducer(rreducer 2)
And if this is the case then how exactly is the data handed off from Reducer 1 to Mapper 2?
Can someone help me out?
This is my code so far.
Thanks.
package Aggregate;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Sales extends Configured implements Tool{
public static class CollectionMapper extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] vals = value.toString().split(",");
context.write(new Text(vals[0]), new Text(vals[1]));
}
}
public static class DeptSalaryJoiner extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> DeptSal = new ArrayList<>();
for (Text val : values) {
DeptSal.add(val.toString());
}
context.write(new Text(DeptSal.get(0)), new Text(DeptSal.get(1)));
}
}
public static class SalaryAggregator extends Reducer<Text, Text, Text, IntWritable>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Integer totalSal = 0;
for (Text val : values) {
Integer salary = new Integer(val.toString());
totalSal += salary;
}
context.write(key, new IntWritable(totalSal));
}
}
public static void main(String[] args) throws Exception {
int exitFlag = ToolRunner.run(new Sales(), args);
System.exit(exitFlag);
}
#Override
public int run(String[] args) throws Exception {
String input1 = "./emp.csv";
String input2 = "./dept.csv";
String output = "./DeptAggregate";
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sales");
job.setJarByClass(getClass());
Configuration mapConf = new Configuration(false);
ChainMapper.addMapper(job, CollectionMapper.class, LongWritable.class, Text.class, Text.class, Text.class, mapConf);
Configuration reduce1Conf = new Configuration(false);
ChainReducer.setReducer(job, DeptSalaryJoiner.class, Text.class, Text.class, Text.class, Text.class, reduce1Conf);
Configuration reduce2Conf = new Configuration(false);
ChainReducer.setReducer(job, SalaryAggregator.class, Text.class, Text.class, Text.class, IntWritable.class, reduce2Conf);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(input1));
FileInputFormat.addInputPath(job, new Path(input2));
try {
File f = new File(output);
FileUtils.forceDelete(f);
} catch (Exception e) {
}
FileOutputFormat.setOutputPath(job, new Path(output));
return job.waitForCompletion(true) ? 0 : 1;
}
}

Hadoop run command java.lang.ClassNotFoundException

I have successfully installed hadoop 3.0.0 stand alone to run on Ubuntu 16.04.
I created a jar using the following code from Apache hadoop tutorial.
import java.io.IOException
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WDCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WDCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Creating WDCount.jar was successful with no errors
Then I created Input and Output folders and Made a text file with a phrase in and saved it as fileo1.txt in the input folder.
I created this text to run hadoop on the WDCount.jar
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/WDCount.jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Output
When I run the code I get this message;
Exception in thread "main" java.lang.ClassNotFoundException: /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.run(RunJar.java:232)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Can anyone tell me what is wrong?

Include name of the class file containing main method after jar name
usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/WDCount.jar WDCount /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Output

Type mismatch error in hadoop program

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CommonFriends {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable friend = new IntWritable();
private Text friends = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(),"\n");
while (itr.hasMoreTokens()) {
String[] line = itr.nextToken().split(" ");
if(line.length > 2 ){
int person = Integer.parseInt(line[0]);
for(int i=1; i<line.length;i++){
int ifriend = Integer.parseInt(line[i]);
friends.set((person < ifriend ? person+"-"+ifriend : ifriend+"-"+person));
for(int j=1; j< line.length; j++ ){
if( i != j ){
friend.set(Integer.parseInt(line[j]));
context.write(friends, friend);
}
}
}
}
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
HashSet<IntWritable> duplicates = new HashSet();
ArrayList<Integer> tmp = new ArrayList();
for (IntWritable val : values) {
if(duplicates.contains(val))
tmp.add(val.get());
else
duplicates.add(val);
}
result.set(tmp.toString());
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Common Friends");
job.setJarByClass(CommonFriends.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:194)
at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1350)
at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1667)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at CommonFriends$IntSumReducer.reduce(CommonFriends.java:51)
at CommonFriends$IntSumReducer.reduce(CommonFriends.java:38)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1688)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1637)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1489)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:723)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
This is my code, The error message is the following.
Any idea??
I think the problem in the configuration of output classes of mapper and the reducer
the input files are a list of numbers in file.
Some more details will be provided if needed.
The program finds the common friend between friends

remove job.setCombinerClass(IntSumReducer.class); in your code could solve this problem

Just had a look into your code, it seems you are using reducer code as combiner code.
One thing you need to check.
Your combiner code will take input in form of <Text, IntWritable> and output of Combiner would be <Text, Text> format .
Then the input to your Reducer would be in format of < Text, Text> but you had specified the input to Reducer as < Text, IntWritable > , so it is throwing the error.
Two things can be done :-
1) You might consider changing the output type of Reducer .
2) You might consider writing a separate Combiner code.

How to get user input in Hadoop 2.7.5?

I'm trying to make it so that when a user enters a word the program will go through the txt file and count all the instances of that word.
I'm using MapReduce and i'm new at it.
I know there is a really simple way to do this and i've been trying to figure that out for a while.
In this code I'm trying to make it so that it would ask for the user input and the program would go through the file and find instances.
I've seen some codes on stack overflow and someone mentioned that setting the configuration to conf.set("userinput","Data") would help somehow.
Also there is some updated way to have the user input.
The if statement in my program is an example of when the user word is entered it only finds that word.
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//So I've seen that this is the correct way of setting it up.
// However I've heard that there mroe efficeint ways of setting it up as well.
/*
public void setup(Context context) {
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
word.setAccessibleHelp(wordstring);
}
*/
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
if(word=="userinput") {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

I'm not sure about the setup method, but you pass the input at the command line as an argument.
conf.set("mapper.word",args[0]);
Job job =...
// Notice you now need 3 arguments to run this
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
In the mapper or reducer, you can get the string
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
And you need to get the string from the tokenizer before you can compare it. You also need to compare strings, not a string to a text object
String wordstring=config.get("mapper.word");
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(wordstring.equals(token)) {
word.set(token);
context.write(word, one);
}

Hadoop : Reducer class not called even with Overrides

I was trying a mapreduce wordcount code in hadoop, but the reducer class is never called and the program terminates after running the mapper class.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
I have even Overridden the classes as required.
IDE : Eclipse Luna
Hadoop : version 2.5

A Job object forms the specification of the job and gives you control over how the job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute around the cluster).
Rather than explicitly specify the name of the JAR file, we can pass a class in the Job's setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.
I do not see the statement in the main method. Hence, include this and then compile and run the code.
job.setJarByClass(WordCount.class);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop MapReduce Jobs for Highest Frequency - java

What you're looking for is called a Combiner in hadoop, which does some semi-reduction before emitting the output to a final reducer class. For more info on it click here.

Related

Mapreduce how to chain Mapper >> Reducer >> Reducer

Hadoop run command java.lang.ClassNotFoundException

Type mismatch error in hadoop program

How to get user input in Hadoop 2.7.5?

Hadoop : Reducer class not called even with Overrides

Categories

Resources