When to use NLineInputFormat in Hadoop Map-Reduce?

When to use NLineInputFormat in Hadoop Map-Reduce? - java

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !

Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

Related

How do i create custom mapper class in mapreduce

I am having unique requirement where i have to pass the zip shell command from text file and mapper will process the script that will create zip files in parallel fashion using mapper only. I am thinking to execute shell command using exec in java. I am bit stuck on how to implement the custom mapper as my output would be compressed format.
Below is my mapper class -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, NullWritable>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line= value.toString();
StringTokenizer tokenizer= new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
value.set(tokenizer.nextToken());
context.write(value,NullWritable.get());
}
}
}
Processor class -
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
public class ZipProcessor extends Configured implements Tool {
public static void main(String [] args) throws Exception{
int exitCode = ToolRunner.run(new ZipProcessor(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if(args.length!=2){
System.err.printf("Usage: %s needs two arguments, input and output files\n", getClass().getSimpleName());
return -1;
}
Configuration conf=new Configuration();
Job job = Job.getInstance(conf,"zipping");
job.setJarByClass(ZipProcessor.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(Map.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
Sample mapr.txt
zip -r "/folder1/file.zip" "sourceFolder"
zip -r "/folder2/file.zip" "sourceFolder"
zip -r "/folder3/file.zip" "sourceFolder"

Mapreduce Mapper explaination

There is an NCDC weather data set example in Hadoop definite guide.
The Mapper class code is as follows
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
And the driver code is:
Example 2-5. Application to find the maximum temperature in the weather dataset
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm not able to understand since we pass a file containing multiple lines why there is no iteration on lines. The code seems as if it is processing on a single line.

The book explains what Mapper<LongWritable, Text,means... The key is the offset of the file. The value is a line.
It also mentions that TextInputFormat is the default mapreduce input type, which is a type of FileInputFormat
public class TextInputFormat
extends FileInputFormat<LongWritable,Text>
And therefore, the default input types must be Long, Text pairs
As the JavaDoc says
Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
The book also has sections on defining custom RecordReaders
You need to call job.setInputFormatClass to change it to read anything other than single lines

Hadoop Jar runs but no output. Driver, mapper and reduce compiles successfully in namenode

I'm a newbie to Hadoop Programming and I have started learning by setting up Hadoop 2.7.1 on a three node cluster. I have tried running helloworld jars that comes out of the box in Hadoop and it ran fine with success but I wrote my own driver code in my local machine and bundled it into a jar and executed it this way but it fails with NO error messages.
Here is my code and this is what I did.
WordCountMapper.java
package mot.com.bin.test;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text Value,
OutputCollector<Text, IntWritable> opc, Reporter r)
throws IOException {
String s = Value.toString();
for (String word :s.split(" ")) {
if( word.length() > 0) {
opc.collect(new Text(word), new IntWritable(1));
}
}
}
}
WordCountReduce.java
package mot.com.bin.test;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordCountReduce extends MapReduceBase implements Reducer < Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> opc, Reporter r)
throws IOException {
// TODO Auto-generated method stub
int i = 0;
while (values.hasNext()) {
IntWritable in = values.next();
i+=in.get();
}
opc.collect(key, new IntWritable (i));
}
WordCount.java
/**
* **DRIVER**
*/
package mot.com.bin.test;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.io.Text;
//import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;
/**
* #author rgb764
*
*/
public class WordCount extends Configured implements Tool{
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
}
public int run(String[] arg0) throws Exception {
if (arg0.length < 2) {
System.out.println("Need input file and output directory");
return -1;
}
JobConf conf = new JobConf();
FileInputFormat.setInputPaths(conf, new Path( arg0[0]));
FileOutputFormat.setOutputPath(conf, new Path( arg0[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
}
First I tried extracting it as a jar from eclipse and run it in my hadoop cluster. No errors yet no success as well. Then moved my individual java files to my NameNode and compiled each java files and then created the jar file there, still hadoop command returns no results but no errors as well. Kindly help me on this.
hadoop jar WordCout.jar mot.com.bin.test.WordCount /karthik/mytext.txt /tempo
Extracted all dependent jar files using Maven and I added them into the classpath in my name node. Help me figure what and where am I going wrong.

IMO you are missing the code in your main method which instantiate the Tool implementation ( WordCount in your case) and runs the same.
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
Refer this.

Reducer Writing Mapper Output into Output file

I am learning Hadoop and tried executing my Mapreduce program. All Map tasks and Reducer tasks are completed fine, but Reducer Writing Mapper Output into Output file. It means Reduce function not at all invoked. My sample input is like below
1,a
1,b
1,c
2,s
2,d
and the expected output is like below
1 a,b,c
2 s,d
Below is my Program.
package patentcitation;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyJob
{
public static class Mymapper extends Mapper <Text, Text, Text, Text>
{
public void map (Text key, Text value, Context context) throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Myreducer extends Reducer<Text,Text,Text,Text>
{
StringBuilder str = new StringBuilder();
public void reduce(Text key, Iterable<Text> value, Context context) throws IOException, InterruptedException
{
for(Text x : value)
{
if(str.length() > 0)
{
str.append(",");
}
str.append(x.toString());
}
context.write(key, new Text(str.toString()));
}
}
public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "PatentCitation");
FileSystem fs = FileSystem.get(conf);
job.setJarByClass(MyJob.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mymapper.class);
job.setReducerClass(Myreducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
if(fs.exists(new Path(args[1]))){
//If exist delete the output path
fs.delete(new Path(args[1]),true);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Same question is asked here, I used the Iterable value in my reduce function as the answer suggested in that thread. But that doesnt fix the issue. I cannot comment there since my reputation score is low. So created the new Thread
Kindly help me where am doing wrong.

You have made few mistakes in your program. Following are the mistakes:
In the driver, following statement should be called before instantiating the Job class:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
In reducer, you should put the StringBuilder inside the reduce() function.
I have modified your code as below and I got the output:
E:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop fs -cat /out/part-r-00000
1 c,b,a
2 d,s
Modified code:
package patentcitation;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyJob
{
public static class Mymapper extends Mapper <Text, Text, Text, Text>
{
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Myreducer extends Reducer<Text,Text,Text,Text>
{
public void reduce(Text key, Iterable<Text> value, Context context) throws IOException, InterruptedException
{
StringBuilder str = new StringBuilder();
for(Text x : value)
{
if(str.length() > 0)
{
str.append(",");
}
str.append(x.toString());
}
context.write(key, new Text(str.toString()));
}
}
public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",",");
Job job = Job.getInstance(conf, "PatentCitation");
FileSystem fs = FileSystem.get(conf);
job.setJarByClass(MyJob.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mymapper.class);
job.setReducerClass(Myreducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
/*if(fs.exists(new Path(args[1]))){
//If exist delete the output path
fs.delete(new Path(args[1]),true);
}*/
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

How to override the default sorting of Hadoop

I have a map-reduce job in which the keys are numbers from 1-200. My intended output was (number,value) in the number order.
But I'm getting the output as :
1 value
10 value
11 value
:
:
2 value
20 value
:
:
3 value
I know this is due to the default behavior of Map-Reduce to sort keys in ascending order.
I want my keys to be sorted in numerical order only. How can I achieve this?

If I had to take a guess, I'd say that you are storing your numbers as Text objects and not IntWritable objects.
Either way, once you have more than one reducer, only the items within a reducer will be sorted, but it won't be totally sorted.

The default WritableComparator in MapReduce framework would normally handle your numerical ordering if the key was IntWritable. I suspect it's getting a Text key thus resulting in lexicographical ordering in your case. Please have a look at the sample code which uses IntWritable key to emit the values:
1) Mapper Implementaion
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SourceFileMapper extends Mapper<LongWritable, Text, IntWritable, Text> {
private static final String DEFAULT_DELIMITER = "\t";
private IntWritable keyToEmit = new IntWritable();
private Text valueToEmit = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
keyToEmit.set(Integer.parseInt(line.split(DEFAULT_DELIMITER)[0]));
valueToEmit.set(line.split(DEFAULT_DELIMITER)[1]);
context.write(keyToEmit, valueToEmit);
}
}
2) Reducer Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SourceFileReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
for (Text value : values) {
context.write(key, value);
}
}
}
3) Driver Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class SourceFileDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Path inputPath = new Path(args[0]);
Path outputDir = new Path(args[1]);
// Create configuration
Configuration conf = new Configuration(true);
// Create job
Job job = new Job(conf, "SourceFileDriver");
job.setJarByClass(SourceFileDriver.class);
// Setup MapReduce
job.setMapperClass(SourceFileMapper.class);
job.setReducerClass(SourceFileReducer.class);
job.setNumReduceTasks(1);
// Specify key / value
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
}
Thank you!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

When to use NLineInputFormat in Hadoop Map-Reduce? - java

Related

How do i create custom mapper class in mapreduce

Mapreduce Mapper explaination

Hadoop Jar runs but no output. Driver, mapper and reduce compiles successfully in namenode

Reducer Writing Mapper Output into Output file

How to override the default sorting of Hadoop

Categories

Resources