Reducer Writing Mapper Output into Output file

Reducer Writing Mapper Output into Output file - java

I am learning Hadoop and tried executing my Mapreduce program. All Map tasks and Reducer tasks are completed fine, but Reducer Writing Mapper Output into Output file. It means Reduce function not at all invoked. My sample input is like below
1,a
1,b
1,c
2,s
2,d
and the expected output is like below
1 a,b,c
2 s,d
Below is my Program.
package patentcitation;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyJob
{
public static class Mymapper extends Mapper <Text, Text, Text, Text>
{
public void map (Text key, Text value, Context context) throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Myreducer extends Reducer<Text,Text,Text,Text>
{
StringBuilder str = new StringBuilder();
public void reduce(Text key, Iterable<Text> value, Context context) throws IOException, InterruptedException
{
for(Text x : value)
{
if(str.length() > 0)
{
str.append(",");
}
str.append(x.toString());
}
context.write(key, new Text(str.toString()));
}
}
public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "PatentCitation");
FileSystem fs = FileSystem.get(conf);
job.setJarByClass(MyJob.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mymapper.class);
job.setReducerClass(Myreducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
if(fs.exists(new Path(args[1]))){
//If exist delete the output path
fs.delete(new Path(args[1]),true);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Same question is asked here, I used the Iterable value in my reduce function as the answer suggested in that thread. But that doesnt fix the issue. I cannot comment there since my reputation score is low. So created the new Thread
Kindly help me where am doing wrong.

You have made few mistakes in your program. Following are the mistakes:
In the driver, following statement should be called before instantiating the Job class:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
In reducer, you should put the StringBuilder inside the reduce() function.
I have modified your code as below and I got the output:
E:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop fs -cat /out/part-r-00000
1 c,b,a
2 d,s
Modified code:
package patentcitation;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyJob
{
public static class Mymapper extends Mapper <Text, Text, Text, Text>
{
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Myreducer extends Reducer<Text,Text,Text,Text>
{
public void reduce(Text key, Iterable<Text> value, Context context) throws IOException, InterruptedException
{
StringBuilder str = new StringBuilder();
for(Text x : value)
{
if(str.length() > 0)
{
str.append(",");
}
str.append(x.toString());
}
context.write(key, new Text(str.toString()));
}
}
public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
Job job = Job.getInstance(conf, "PatentCitation");
FileSystem fs = FileSystem.get(conf);
job.setJarByClass(MyJob.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mymapper.class);
job.setReducerClass(Myreducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
/*if(fs.exists(new Path(args[1]))){
//If exist delete the output path
fs.delete(new Path(args[1]),true);
}*/
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Related

Mapreduce how to chain Mapper >> Reducer >> Reducer

I have a problem where I need to chain
Mapper >> Reducer >> Reducer
This is my data:
Dpt.csv
EmpNo1,DeptNo1
EmpNo2,DeptNo2
EmpNo3,DeptNo1
EmpNo4,DeptNo2
EmpNo5,DeptNo2
EmpNo6,DeptNo1
Emp.csv
EmpNo1,10000
EmpNo2,4675432
EmpNo3,76568658
EmpNo4,241423
EmpNo5,75756
EmpNo6,9796854
And finally I want something like this:
Dept1 >> Total_Salary_Dept_1
One major issue is my first reducer is not getting called when I use multiple files as input.
The second issue is that I can't pass that output to next reducer. (ChainReducer can't chain 2 reducers)
I was using this as a reference but quickly realized it won't help.
I found this link where, in one of the comments the author says this: "In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer."
Does this mean I will have a structure like this:
Chain Mapper(mapper 1) --> Chain Reducer(reducer 1) --> ChainMapper(unnecessary mapper) --> Chain Reducer(rreducer 2)
And if this is the case then how exactly is the data handed off from Reducer 1 to Mapper 2?
Can someone help me out?
This is my code so far.
Thanks.
package Aggregate;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Sales extends Configured implements Tool{
public static class CollectionMapper extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] vals = value.toString().split(",");
context.write(new Text(vals[0]), new Text(vals[1]));
}
}
public static class DeptSalaryJoiner extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> DeptSal = new ArrayList<>();
for (Text val : values) {
DeptSal.add(val.toString());
}
context.write(new Text(DeptSal.get(0)), new Text(DeptSal.get(1)));
}
}
public static class SalaryAggregator extends Reducer<Text, Text, Text, IntWritable>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Integer totalSal = 0;
for (Text val : values) {
Integer salary = new Integer(val.toString());
totalSal += salary;
}
context.write(key, new IntWritable(totalSal));
}
}
public static void main(String[] args) throws Exception {
int exitFlag = ToolRunner.run(new Sales(), args);
System.exit(exitFlag);
}
#Override
public int run(String[] args) throws Exception {
String input1 = "./emp.csv";
String input2 = "./dept.csv";
String output = "./DeptAggregate";
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sales");
job.setJarByClass(getClass());
Configuration mapConf = new Configuration(false);
ChainMapper.addMapper(job, CollectionMapper.class, LongWritable.class, Text.class, Text.class, Text.class, mapConf);
Configuration reduce1Conf = new Configuration(false);
ChainReducer.setReducer(job, DeptSalaryJoiner.class, Text.class, Text.class, Text.class, Text.class, reduce1Conf);
Configuration reduce2Conf = new Configuration(false);
ChainReducer.setReducer(job, SalaryAggregator.class, Text.class, Text.class, Text.class, IntWritable.class, reduce2Conf);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(input1));
FileInputFormat.addInputPath(job, new Path(input2));
try {
File f = new File(output);
FileUtils.forceDelete(f);
} catch (Exception e) {
}
FileOutputFormat.setOutputPath(job, new Path(output));
return job.waitForCompletion(true) ? 0 : 1;
}
}

Hadoop run command java.lang.ClassNotFoundException

I have successfully installed hadoop 3.0.0 stand alone to run on Ubuntu 16.04.
I created a jar using the following code from Apache hadoop tutorial.
import java.io.IOException
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WDCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WDCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Creating WDCount.jar was successful with no errors
Then I created Input and Output folders and Made a text file with a phrase in and saved it as fileo1.txt in the input folder.
I created this text to run hadoop on the WDCount.jar
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/WDCount.jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Output
When I run the code I get this message;
Exception in thread "main" java.lang.ClassNotFoundException: /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.run(RunJar.java:232)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Can anyone tell me what is wrong?

Include name of the class file containing main method after jar name
usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/WDCount.jar WDCount /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Input /usr/local/hadoop/share/hadoop/mapreduce/Wordcount/Output

Error opening job jar: file in hdfs

I have been trying to fix this one but not sure what is the mistake I make here! Can you please help me on this! Thanks a lot in advance!
My program:
package hadoopbook;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
//Mapper
public static class WcMapperDemo extends Mapper<LongWritable, Text, Text, IntWritable>{
Text MapKey = new Text();
IntWritable MapValue = new IntWritable();
public void map(LongWritable key, Text Value, Context Context) throws IOException, InterruptedException{
String Record = Value.toString();
String[] Words = Record.split(",");
for (String Word:Words){
MapKey.set(Word);
MapValue.set(1);
Context.write(MapKey, MapValue);
}
}
}
//Reducer
public static class WcReducerDemo extends Reducer<Text, IntWritable, Text, IntWritable>{
IntWritable RedValue = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> Values, Context Context) throws IOException, InterruptedException{
int sum = 0;
for (IntWritable Value:Values){
sum = sum + Value.get();
}
RedValue.set(sum);
Context.write(key, RedValue);
}
}
//Driver
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration Conf = new Configuration();
Job Job = new Job(Conf, "Word Count Job");
Job.setJarByClass(WordCount.class);
Job.setMapperClass(WcMapperDemo.class);
Job.setReducerClass(WcReducerDemo.class);
Job.setMapOutputKeyClass(Text.class);
Job.setMapOutputValueClass(IntWritable.class);
Job.setOutputKeyClass(Text.class);
Job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(Job, new Path (args[0]));
FileOutputFormat.setOutputPath(Job, new Path (args[1]));
System.exit(Job.waitForCompletion(true) ? 0:1);
}
}
Jar file is placed on hdfs in the below location:
/user/cloudera/Programs/WordCount.jar
Permissions are:
rw-rw-rw-
Input file is placed in below location:
/user/cloudera/Input/Words.txt
Permissions are:
rw-rw-rw-
Output folder is as below:
/user/cloudera/Output
When I am trying to run this:
[cloudera#localhost ~]$ hadoop jar /user/cloudera/Programs/WordCount.jar hadoopbook.WordCount /user/cloudera/Input/Words.txt /user/cloudera/Output
After this I get an error and I am stuck here!
Exception in thread "main" java.io.IOException: Error opening job jar: /user/cloudera/Programs/WordCount.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:135)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:127)
at java.util.jar.JarFile.<init>(JarFile.java:135)
at java.util.jar.JarFile.<init>(JarFile.java:72)
at org.apache.hadoop.util.RunJar.main(RunJar.java:133)

Jar needs to be present in the local file system (it should not be present in HDFS.) and you need to have entire package name for the main class.

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !

Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

getJobStatus error in Hadoop

I am trying run an example program from Hadoop in Action Book. The example is 4-1. This is just a simple MR program to give a comma separated key and value pairs.
I am getting an error with JobClient.runJob() method. I am not sure where I made mistakes, it is just what's given in book. Any help is greatly appreciated
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapred.LocalJobRunner;
public class MyJob extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
output.collect(value, key);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String csv = "";
while (values.hasNext()) {
if (csv.length() > 0) csv += ",";
csv += values.next().toString();
}
output.collect(key, new Text(csv));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MyJob.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MyJob");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyJob(), args);
System.exit(res);
}
}
Error:
Exception in thread "main" java.lang.VerifyError: (class: org/apache/hadoop/mapred/LocalJobRunner, method: getJobStatus signature: (Lorg/apache/hadoop/mapreduce/JobID;)Lorg/apache/hadoop/mapreduce/JobStatus;) Wrong return type in function
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:520)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1411)
at MyJob.run(MyJob.java:71)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at MyJob.main(MyJob.java:77)

I came across the same issue here. Just to leave a note, the problem is that both mr1 and yarn jars are present in classpath and the classes are getting mixed together.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reducer Writing Mapper Output into Output file - java

Related

Mapreduce how to chain Mapper >> Reducer >> Reducer

Hadoop run command java.lang.ClassNotFoundException

Error opening job jar: file in hdfs

When to use NLineInputFormat in Hadoop Map-Reduce?

getJobStatus error in Hadoop

Categories

Resources