KeyValueTextInputFormat in Driver class - java

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
106298345|Surender,Raja,CTS,50000,Chennai
106297845|Murali,Bala,TCS,60000,Chennai
106291271|Rajagopal,Ravi,CTS,50000,Chennai
106298616|Vikram,Darma,TCS,70000,Chennai
106299100|Kumar,Selvam,TCS,90000,Chennai
106299288|Sandeep,Krishna,CTS,10000,Chennai
106290071|Vimal,Pillai,TCS,20000,Chennai
I am specifying KeyValueTextInputFormat as :
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "|");
Job myhadoopJob = new Job(conf);
my mapper code is below
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class KeyValueMapper extends Mapper<Text, Text, Text, Text>
{
#Override
protected void map(Text key, Text value, Context context)throws IOException, InterruptedException {
String mapOutPutValue="";
String line = value.toString();
String[] details=line.split(",");
for(int i=0;i<details.length;i++)
{
if(details[i].equalsIgnoreCase("TCS"))
{
mapOutPutValue=line;
}
}if(mapOutPutValue!="")context.write(key, new Text(mapOutPutValue)); }
}
but my mapper class is printing all the output in my inputfile.My mapper class is not filtering the input as per logic in map method..
Can Someone help me

Please try the below option in driver code.
conf.set("key.value.separator.in.input.line", "|");

Related

Hadoop WordCount code, the following errors are shown

I was reading and implementing this tutorial. At the last, I implement the three classes- Mapper, Reducer and driver. I copied the exact code given on the webpage for all three classes. But following two errors didn't go away:-
Mapper Class
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase // Here WordCountMapper was underlined as error source by Eclipse
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
The error was:
The type WordCountMapper must implement the inherited abstract method
Mapper.map(LongWritable, Text,
OutputCollector, Reporter)
Driver Class (WordCount.java)
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
// specify input and output dirs
FileInputPath.addInputPath(conf, new Path("input")); //////////FileInputPath was underlined
FileOutputPath.addOutputPath(conf, new Path("output")); ////////FileOutputPath as underlined
// specify a mapper
conf.setMapperClass(WordCountMapper.class);
// specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The error was:
FileInputPath cannot be resolved
FileOutputPath cannot be resolved
Use this
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
or
FileInputFormat.addInputPath(conf, new Path("inputfile.txt"));
FileOutputFormat.setOutputPath(conf,new Path("outputfile.txt"));
instead of this
// specify input and output dirs
FileInputPath.addInputPath(conf, new Path("input")); //////////FileInputPath was underlined
FileOutputPath.addOutputPath(conf, new Path("output")); ////////FileOutputPath as underlined
This should be the one to import in this case
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;

How to override the default sorting of Hadoop

I have a map-reduce job in which the keys are numbers from 1-200. My intended output was (number,value) in the number order.
But I'm getting the output as :
1 value
10 value
11 value
:
:
2 value
20 value
:
:
3 value
I know this is due to the default behavior of Map-Reduce to sort keys in ascending order.
I want my keys to be sorted in numerical order only. How can I achieve this?
If I had to take a guess, I'd say that you are storing your numbers as Text objects and not IntWritable objects.
Either way, once you have more than one reducer, only the items within a reducer will be sorted, but it won't be totally sorted.
The default WritableComparator in MapReduce framework would normally handle your numerical ordering if the key was IntWritable. I suspect it's getting a Text key thus resulting in lexicographical ordering in your case. Please have a look at the sample code which uses IntWritable key to emit the values:
1) Mapper Implementaion
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SourceFileMapper extends Mapper<LongWritable, Text, IntWritable, Text> {
private static final String DEFAULT_DELIMITER = "\t";
private IntWritable keyToEmit = new IntWritable();
private Text valueToEmit = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
keyToEmit.set(Integer.parseInt(line.split(DEFAULT_DELIMITER)[0]));
valueToEmit.set(line.split(DEFAULT_DELIMITER)[1]);
context.write(keyToEmit, valueToEmit);
}
}
2) Reducer Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SourceFileReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
for (Text value : values) {
context.write(key, value);
}
}
}
3) Driver Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class SourceFileDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Path inputPath = new Path(args[0]);
Path outputDir = new Path(args[1]);
// Create configuration
Configuration conf = new Configuration(true);
// Create job
Job job = new Job(conf, "SourceFileDriver");
job.setJarByClass(SourceFileDriver.class);
// Setup MapReduce
job.setMapperClass(SourceFileMapper.class);
job.setReducerClass(SourceFileReducer.class);
job.setNumReduceTasks(1);
// Specify key / value
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
}
Thank you!

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !
Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

Eclipse - Run on Hadoop does not prompt anything

Im trying to build a simple Wordcount Hadoop project(https://developer.yahoo.com/hadoop/tutorial/module3.html#running) but when I click "Run on Hadoop" there is no action at all...Infact nothing is displayed in the console.
Here is my project structure -
Here is my wordcount job file...
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) {
Configuration config = new Configuration();
config.addResource(new Path("/HADOOP_HOME/conf/hadoop-default.xml"));
config.addResource(new Path("/HADOOP_HOME/conf/hadoop-site.xml"));
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
// specify input and output dirs
FileInputPath.addInputPath(conf, new Path("input"));
FileOutputPath.addOutputPath(conf, new Path("output"));
// specify a mapper
conf.setMapperClass(WordCountMapper.class);
// specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
I think the problem is with the jar file you used for hadoop client to server, So What happens is it tries to stay in the following line and try to search for the server
Configuration config = new Configuration();
Try to debug and let us know if you face any more problem,
If not try the following
Have you tried running the program on eclipse by pointing the core-site , hdfs-site
Configuration.addResource(new Path("path-to-your-core-site.xml file"));
Configuration.addResource(new Path("path-to-your-hdfs-site.xml file"));
and
FileInputPath.addInputPath(hdfs path to your input file);
FileInputPath.addOutputPath(hdfs path to your output file);
See that It works and get back to us
try this,
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<Object, Text, Text, IntWritable> {
#Override
public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
System.out.println(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
output.collect(value, new IntWritable(1));
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception,IOException {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("/home/user17/test.txt"));
FileOutputFormat.setOutputPath(conf, new Path("hdfs://localhost:9000/out2"));
JobClient.runJob(conf);
}
}
I had the exact same problem, and I have just figured it out.
Add parameters in the run configurations.
Right Click WordCount.java > Run As > Run Configurations > Java Application > Word Count > Arguments
Enter this hdfs://hadoop:9000/ hdfs://hadoop:9000/
Apply and Finish Run again.
After running, refresh the project and the result is in the output folder.

Map Reduce: Wordcount don't make anything

I want to write my own word count example using MapReduce and hadoop v. 1.0.3 (I'm on MacOS) but i don't understand why it doesn't work
Sharing my code :
main:
package org.myorg;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
// set job name, mapper, combiner, and reducer classes
conf.setJobName("WordCount");
// set input and output formats
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
// set input and output paths
//FileInputFormat. setInputPaths(conf, new Path(input));
//FileOutputFormat.setOutputPath(conf, new Path(output));
FileOutputFormat.setCompressOutput(conf, false);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(org.myorg.Map.class);
conf.setReducerClass(org.myorg.Reduce.class);
String host = args[0];
String input = host + "/" + args[1];
String output = host + "/" + args[2];
// set input and output paths
FileInputFormat.addInputPath(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient j=new JobClient(conf);
(j.submitJob(conf)).waitForCompletion();
}
}
Mapper:
package org.myorg;
import java.io.IOException;
import java.util.HashMap;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.Vector;
import java.util.Map.Entry;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
MapWritable hs = new MapWritable();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
//hs.put(word, one);
output.collect(word,one);
}
// TODO Auto-generated method stub
}
}
Reducer:
package org.myorg;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.math.RoundingMode;
import java.net.URI;
import java.net.URISyntaxException;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
import java.text.NumberFormat;
import java.text.ParseException;
import java.util.Iterator;
import java.util.Map;
import java.util.TreeMap;
import java.util.Map.Entry;
import java.util.Vector;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Reducer.Context;
//public class Reduce extends MapReduceBase implements Reducer<Text, MapWritable, Text, Text> {
public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
String host = "hdfs://localhost:54310/";
String tmp = host + "Temporany/output.txt";
FileSystem srcFS;
try {
srcFS = FileSystem.get(new URI(tmp), new JobConf());
srcFS.delete(new Path(tmp), true);
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(
srcFS.create(new Path(tmp))));
wr.write(key.toString() + ":" + sum);
wr.close();
srcFS.close();
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// context.write(key, new IntWritable(sum));
}
}
The Job started and end with no error but didn't write the output file.
I launch the jar with Hadoop with this command:
./Hadoop jar /Users/User/Desktop/hadoop/wordcount.jar hdfs://localhost:54310 /In/testo.txt /Out/wordcount17
This is the output:
2014-03-03 17:56:22.063 java[6365:1203] Unable to load realm info from SCDynamicStore
14/03/03 17:56:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/03 17:56:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/03 17:56:23 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/03 17:56:23 INFO mapred.FileInputFormat: Total input paths to process : 1
I suppose the problem is "unable to load native-hadoop library" but works fine for other Jar's.
Q : The Job start and end with no error but don't write the output-file ??
Ans : I am not sure that job successfully ends w/o error .
Problems :
Job Configuration:
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
The setOutputKeyClass() and setOutputValueClass() methods control the
output types for the map and the reduce functions, which are often the
same. If they are different, then the map
output types can be set using the methods setMapOutputKeyClass() and
setMapOutputValueClass().
The Output class in your case is :
Map key : Text
Map Value : IntWritable
Reduce key : Text
Reduce Value : Text
Which will result in Type mismatch exception
Reduce
I am not sure why you are using hdfs API to write your output to a file ?.
Should use output.collect(key,value).
In case of multiple reducer are you handling the simultaneous write operation?
And I wonder what context.write is doing in old apis (it's commented )? .
You can use following for more information
Debug your map-reduce Job :
Counters
Jobtracker web interface
Q. Difference b/w SubmitJob() && waitForCompletion() ?
Ans : SubmitJob() : submits the job and ends .
waitForCompletion() : submits the job and prints the status of the job on console .
So waitForCompletion() is SubmitJob()+status update of job until complete.
word count
Please read
Map Reduce Apache
You can also find hadoop-examples-X.X.X.jar in your installation folder.
Go through $HADOOP_HOME/src/expalmes/ for source code .
**$HADOOP_HOME = hadoop installation folder

Categories

Resources