I am running the below Map reduce code to calculate sum and average length of words starting with each english alphabet.
For example : If the doc only contains the word 'and' 5 times
letter | total words | average length
a 5 3
The mapreduce program is as below:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LetterWiseAvgLengthV1
{
public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException
{
String st [] = value.toString().split("\\s+");
for(String word : st) {
String wordnew=word.replaceAll("[^a-zA-Z]","");
String firstLetter = wordnew.substring(0, 1);
if(!wordnew.isEmpty()){
// write ('a',3) if the word is and
context.write(new Text(firstLetter), new Text(String.valueOf(wordnew.length())));
}
else continue;
}
}
}
public static class IntSumReducer
extends Reducer<Text,Text,Text,Text>
{
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
int sum=0,count=0;
for (Text val : values)
{
sum += Integer.parseInt(val.toString());
count+= 1;
}
float avg=(sum/(float)count);
String op="Average length of " + count + " words = " + avg;
context.write(new Text(key), new Text(op));
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "wordLenAvgCombiner");
job.setJarByClass(LetterWiseAvgLengthV1.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I execute the program below on a text document, it creates an empty output directory in HDFS. There are no failures during execution, but the output folder is always empty
Related
I am new in mapreduce and hadoop (hadoop 3.2.3 and java 8).
I am trying to separate some lines based on a symbol in a line.
Example: "q1,a,q0," should be return ('a',"q1,a,q0,") as (key, value).
My dataset contains ten(10) lines , five(5) for key 'a' and five for key 'b'.
I expect to get 5 line for each key but i always get five for 'a' and 10 for 'b'
Data
A,q0,a,q1;A,q0,b,q0;A,q1,a,q1;A,q1,b,q2;A,q2,a,q1;A,q2,b,q0;B,s0,a,s0;B,s0,b,s1;B,s1,a,s1;B,s1,b,s0
Mapper class:
import java.io.IOException;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<LongWritable, Text, ByteWritable ,Text>{
private ByteWritable key1 = new ByteWritable();
//private int n ;
private int count =0 ;
private Text wordObject = new Text();
#Override
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String ftext = value.toString();
for (String line: ftext.split(";")) {
wordObject = new Text();
if (line.split(",")[2].equals("b")) {
key1.set((byte) 'b');
wordObject.set(line) ;
context.write(key1,wordObject);
continue ;
}
key1.set((byte) 'a');
wordObject.set(line) ;
context.write(key1,wordObject);
}
}
}
Reducer class:
import java.io.IOException;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class MyReducer extends Reducer<ByteWritable, Text, ByteWritable ,Text>{
private Integer count=0 ;
#Override
public void reduce(ByteWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text val : values ) {
count++ ;
}
Text symb = new Text(count.toString()) ;
context.write(key , symb);
}
}
Driver class:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: %s [generic options] <inputdir> <outputdir>\n", getClass().getSimpleName());
return -1;
}
#SuppressWarnings("deprecation")
Job job = new Job(getConf());
job.setJarByClass(MyDriver.class);
job.setJobName("separation ");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(ByteWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(ByteWritable.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new MyDriver(), args);
System.exit(exitCode);
}
}
The problem was solved by putting the variable "count" inside the function "Reduce()".
Does your input read more than one line that has 5 more b's? I cannot reproduce for that one line, but your code can be cleaned up.
For the following code, I get output as
a 5
b 5
static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, ByteWritable, Text> {
final ByteWritable keyOut = new ByteWritable();
final Text valueOut = new Text();
#Override
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, ByteWritable, Text>.Context context) throws IOException, InterruptedException {
String line = value.toString();
if (line.isEmpty()) {
return;
}
StringTokenizer tokenizer = new StringTokenizer(line, ";");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
String[] parts = token.split(",");
String keyStr = parts[2];
if (keyStr.matches("[ab]")) {
keyOut.set((byte) keyStr.charAt(0));
valueOut.set(token);
context.write(keyOut, valueOut);
}
}
}
}
static class Reducer extends org.apache.hadoop.mapreduce.Reducer<ByteWritable, Text, Text, LongWritable> {
static final Text keyOut = new Text();
static final LongWritable valueOut = new LongWritable();
#Override
protected void reduce(ByteWritable key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer<ByteWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
keyOut.set(new String(new byte[]{key.get()}, StandardCharsets.UTF_8));
valueOut.set(StreamSupport.stream(values.spliterator(), true)
.mapToLong(v -> 1).sum());
context.write(keyOut, valueOut);
}
}
I try to build a project that find the maximum of Average temperature of each month. Here is my code:
File Map.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, FloatWritable> {
private FloatWritable average = new FloatWritable();
private float maxFloat, minFloat, averageFloat;
private Text word = new Text();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer line = new StringTokenizer(value.toString(), ",");
if (line.countTokens() > 0) {
word.set(line.nextToken().substring(2,8));
if (line.hasMoreTokens()) {
maxFloat = Float.parseFloat(line.nextToken());
}
if (line.hasMoreTokens()) {
minFloat = Float.parseFloat(line.nextToken());
}
averageFloat = (minFloat + maxFloat) / 2;
average.set(averageFloat);
context.write(word, average);
}
}
}
File Reduce.java
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
public class Reduce extends Reducer<Text, FloatWritable, Text, FloatWritable> {
private float max_temp = Float.MIN_VALUE;
private float temp = 0;
#Override
protected void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
Iterator<FloatWritable> itr = values.iterator();
while (itr.hasNext()) {
temp = itr.next().get();
if (temp > max_temp) {
max_temp = temp;
}
}
context.write(key, new FloatWritable(max_temp));
}
}
File MaxTempDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTempDriver {
public static void main(String[] args) throws Exception {
// Create a new job
Job job = new Job();
// Set job name to locate it in the distributed environment
job.setJarByClass(MaxTempDriver.class);
job.setJobName("Max Temperature");
// Set input and output Path, note that we use the default input format
// which is TextInputFormat (each record is a line of input)
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set Mapper and Reducer class
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
// Set Output key and value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And now I don't know how to compile these 3 files. I have read from the some tutorials from internet but it seems that they only had 1 file with map.class and reduce.class at the same file. How to compile these file?
Hi I am trying to find average of few numbers using map reduce technique in stand alone mode. I have two input files.It contain values file1: 25 25 25 25 25 and file2: 15 15 15 15 15.
My program is working fine but the output file contains output of the mapper instead of reducer output.
Here is my code :
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.Writable;
import java.io.*;
public class Average {
public static class SumCount implements Writable {
public int sum;
public int count;
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(sum);
out.writeInt(count);
}
#Override
public void readFields(DataInput in) throws IOException {
sum = in.readInt();
count =in.readInt();
}
}
public static class TokenizerMapper extends Mapper<Object, Text, Text, Object>{
private final static IntWritable valueofkey = new IntWritable();
private Text word = new Text();
SumCount sc=new SumCount();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
int sum=0;
int count=0;
int v;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
v=Integer.parseInt(word.toString());
count=count+1;
sum=sum+v;
}
word.set("average");
sc.sum=sum;
sc.count=count;
context.write(word,sc);
}
}
public static class IntSumReducer extends Reducer<Text,Object,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<SumCount> values,Context context) throws IOException, InterruptedException {
int sum = 0;
int count=0;
int wholesum=0;
int wholecount=0;
for (SumCount val : values) {
wholesum=wholesum+val.sum;
wholecount=wholecount+val.count;
}
int res=wholesum/wholecount;
result.set(res);
context.write(key, result );
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "");
job.setJarByClass(Average.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SumCount.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
after i run the program my output file is like this:
average Average$SumCount#434ba039
average Average$SumCount#434ba039
You can't use your Reducer class IntSumReducer as a combiner. A combiner must receive and emit the same Key/Value types.
So i would remove job.setCombinerClass(IntSumReducer.class);.
Remember the output from the combine is the input to the reduce, so writing out Text and IntWritable isnt going to work.
If your output files looked like part-m-xxxxx then the above issue could mean it only ran the Map phase and stoppped. Your counters would confirm this.
You also have Reducer<Text,Object,Text,IntWritable> which should be Reducer<Text,SumCount,Text,IntWritable>.
Here is my sample program joining 2 datasets.
The program has 2 mappers and 1 reducer joining the values obtained from 2 different mappers having 2 different files as input.
I am getting an error in the hadoop jar command.
command:
hadoop jar /home/rahul/Downloads/testjars/datajoin.jar DataJoin
/user/rahul/cust.txt /user/rahul/delivery.txt /user/rahul/output
Error: Invalid number of arguments Datajoin
It is actually expecting only 1 input path and 1 output path whereas in my command I have 2 inputs for 2 different mappers and 1 output.
Can anyone help me out ?
Code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class DataJoin {
public static class TokenizerMapper1 extends Mapper {
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String itr[] = value.toString().split("::");
word.set(itr[0].trim());
context.write(word, new Text("CD~" + itr[1]));
}
}
public static class TokenizerMapper2 extends Mapper {
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String itr[] = value.toString().split("::");
word.set(itr[0].trim());
context.write(word, new Text("DD~" + itr[1]));
}
}
public static class IntSumReducer extends Reducer {
private Text result = new Text();
public void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
String sum = "";
for (Text val : values) {
sum += val.toString();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: DataJoin ");
System.exit(2);
}
Job job = new Job(conf, "Data Join");
job.setJarByClass(DataJoin.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[0]),
TextInputFormat.class, TokenizerMapper1.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]),
TextInputFormat.class, TokenizerMapper2.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You have error in this portion
if (otherArgs.length != 2) {
System.err.println("Usage: DataJoin ");
System.exit(2);
}
Your argument is of length 3. 2 inputs and 1 output.
Argument count starts from 1,2... not from 0,1....
Change to
if (otherArgs.length != 3) {
System.err.println("Usage: DataJoin ");
System.exit(0);
}
This solves your issue.
This program is supposed to accomplish the MapReduce job. The output of the first job has to be taken as the input of the second job.
When I run it, I get two errors:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException
The mapping part is running 100% but the reducer is not running.
Here's my code:
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.LongWritable;
public class MaxPubYear {
public static class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Text word = new Text();
String delim = ";";
Integer year = 0;
String tokens[] = value.toString().split(delim);
if (tokens.length >= 4) {
year = TryParseInt(tokens[3].replace("\"", "").trim());
if (year > 0) {
word = new Text(year.toString());
context.write(word, new IntWritable(1));
}
}
}
}
public static class FrequencyReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
public static class MaxPubYearMapper extends
Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String delim = "\t";
Text valtosend = new Text();
String tokens[] = value.toString().split(delim);
if (tokens.length == 2) {
valtosend.set(tokens[0] + ";" + tokens[1]);
context.write(new IntWritable(1), valtosend);
}
}
}
public static class MaxPubYearReducer extends
Reducer<IntWritable, Text, Text, IntWritable> {
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int maxiValue = Integer.MIN_VALUE;
String maxiYear = "";
for (Text value : values) {
String token[] = value.toString().split(";");
if (token.length == 2
&& TryParseInt(token[1]).intValue() > maxiValue) {
maxiValue = TryParseInt(token[1]);
maxiYear = token[0];
}
}
context.write(new Text(maxiYear), new IntWritable(maxiValue));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Frequency");
job.setJarByClass(MaxPubYear.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
int exitCode = job.waitForCompletion(true) ? 0 : 1;
if (exitCode == 0) {
Job SecondJob = new Job(conf, "Maximum Publication year");
SecondJob.setJarByClass(MaxPubYear.class);
SecondJob.setOutputKeyClass(Text.class);
SecondJob.setOutputValueClass(IntWritable.class);
SecondJob.setMapOutputKeyClass(IntWritable.class);
SecondJob.setMapOutputValueClass(Text.class);
SecondJob.setMapperClass(MaxPubYearMapper.class);
SecondJob.setReducerClass(MaxPubYearReducer.class);
FileInputFormat.addInputPath(SecondJob, new Path(args[1] + "_temp"));
FileOutputFormat.setOutputPath(SecondJob, new Path(args[1]));
System.exit(SecondJob.waitForCompletion(true) ? 0 : 1);
}
}
public static Integer TryParseInt(String trim) {
// TODO Auto-generated method stub
return(0);
}
}
Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException
Map-reduce job does not overwrite the contents in a existing directory. Output path to MR job must be a directory path which does not exist. MR job will create a directory at specified path with files within it.
In your code:
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
Make sure this path does not exist when you run MR job.