Issue with specifying the separator for KeyValueTextInputFormat - java

I was running a mapreduce job to count the number of citations of patents in a dataset. The input type was of KeyValueTextInputFormat, and the separator was comma. I am using the New API. However specifying the separator in Config using mapreduce.input.keyvaluelinerecordreader.key.value.separator isn't working, but using key.value.separator.in.input.line (which is meant for the old API) works.
When I go to the declaration of KeyValueTextInputFormat class, go inside KeyValueLineRecordReader class, and check KEY_VALUE_SEPERATOR static variable, it is set to "mapreduce.input.keyvaluelinerecordreader.key.value.separator". So, setting this key in config should have worked, but it doesn't.
Here's the driver code:
package stubs.reverse_index;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
// THIS DOESN'T WORK
// conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
// THIS WORKS
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MyJob");
job.setMapperClass(InverseMapper.class);
job.setReducerClass(CsvReducer.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
return (success ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyJob(), args);
System.exit(res);
}
}
What could be the issue here? As I understand key.value.separator.in.input.line is for the old API and shouldn't have worked here.

Related

How do i create custom mapper class in mapreduce

I am having unique requirement where i have to pass the zip shell command from text file and mapper will process the script that will create zip files in parallel fashion using mapper only. I am thinking to execute shell command using exec in java. I am bit stuck on how to implement the custom mapper as my output would be compressed format.
Below is my mapper class -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, NullWritable>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line= value.toString();
StringTokenizer tokenizer= new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
value.set(tokenizer.nextToken());
context.write(value,NullWritable.get());
}
}
}
Processor class -
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
public class ZipProcessor extends Configured implements Tool {
public static void main(String [] args) throws Exception{
int exitCode = ToolRunner.run(new ZipProcessor(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if(args.length!=2){
System.err.printf("Usage: %s needs two arguments, input and output files\n", getClass().getSimpleName());
return -1;
}
Configuration conf=new Configuration();
Job job = Job.getInstance(conf,"zipping");
job.setJarByClass(ZipProcessor.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(Map.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
Sample mapr.txt
zip -r "/folder1/file.zip" "sourceFolder"
zip -r "/folder2/file.zip" "sourceFolder"
zip -r "/folder3/file.zip" "sourceFolder"

Not able to use CompositetextinputFormat in Mapside Join

I am trying to implement Map-side join using CompositeTextInoutFormat. However I am getting below errors in Map reduce job which I am unable to resolve,.
1. In the below code I am getting error while using Compose method and also getting an error while setting inputformat Class. The error says as below.
The method compose(String, Class, Path...) in
the type CompositeInputFormat is not applicable for the arguments
(String, Class, Path[])
Can someone please help
package Hadoop.MR.Practice;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.join.CompositeInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//import org.apache.hadoop.mapred.join.CompositeInputFormat;
public class MapJoinJob implements Tool{
private Configuration conf;
public Configuration getConf() {
return conf;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "MapSideJoinJob");
job.setJarByClass(this.getClass());
Path[] inputs = new Path[] { new Path(args[0]), new Path(args[1])};
String join = CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, inputs);
job.getConfiguration().set("mapreduce.join.expr", join);
job.setInputFormatClass(CompositeInputFormat.class);
job.setMapperClass(MapJoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//Configuring reducer
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
MapJoinJob mjJob = new MapJoinJob();
ToolRunner.run(conf, mjJob, args);
}
I would say your problem is likely related to mixing hadoop APIs. You can see that your imports are mixing mapred and mapreduce.
For example, you're trying to use org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat with org.apache.hadoop.mapred.join.CompositeInputFormat which is unlikely to work.
You should choose one (probably mapreduce i would say) and make sure everything is using the same API.

Hadoop program (java) to read comma separated input file

I have an input file that looks like below
000001928162247ffaf63185cd8b2a244c78e7c6,2009324,abcat0101001,Sharp,"2011-09-05 12:25:37.42","2011-09-05 12:25:01.187"
0001be1731ee7d1c519bc7e87110c9eb880cb396,1649294,abcat0715001,"Gunnar eyewear","2011-09-23 17:13:36.175","2011-09-23 17:12:18.389"
0001bfa0c494c01f9f8c141c476c11bb4625a746,17240521,cat02015,refrigerator,"2011-10-19 23:43:51.71","2011-10-19 23:43:06.485"
0001fb09f03fea4d04e2267ed3194c806839d997,1271997,abcat0513004,Razer,"2011-09-07 09:20:07.11","2011-09-07 09:19:03.279"
0002965b083b6e508f7740c47c8f39e1072b4219,3562379,pcmcat209400050001,"I phone 4","2011-10-27 14:10:31.92","2011-10-27 14:09:33.327"
0002bb28a9ca07f5515b01996fd5d7ca84742e41,3230638,pcmcat177200050009,"hd antenna","2011-10-20 00:03:49.966","2011-10-20 00:02:01.458"
0002bd9c3d654698bb514194c4f4171ad6992266,9947181,pcmcat253300050012,printer,"2011-10-06 19:51:40.984","2011-10-06 19:47:13.803"
0002fee45e1c32eb94e82fc6c15c4db14e796248,3519969,pcmcat247400050000,vaio,"2011-10-19 23:31:51.015","2011-10-19 23:31:12.213"
00042033d355973baf9454b021a15c6b5b48f4a3,2677297,pcmcat212600050008,"desk top","2011-08-29 12:03:38.265","2011-08-29 12:03:12.348"
000433e0ef411c2cb8ee1727002d6ba15fe9426b,8959317,cat02015,"how i met your mother","2011-09-17 19:44:40.129","2011-09-17 19:43:37.564"
which contains the following information
user_id,product_id,category,query,click_time,query_time
I want to read this file in Hadoop and extract the user_id and the category (the 1st and 3rd fields). I have a basic Hadoop program as below that I used for wordcount. In this task, the items are separated by comma, and I have to store them in ArrayList.
Here is my starting program:
import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
public class popularcats extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String [] args) throws Exception {
int exitCode = ToolRunner.run(new popularcats(), args);
System.exit(exitCode);
}
}
I suppose Hadoop must have some classes for reading in a CSV file. I found this class CSVLineRecordReader from this address
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
So how should i process this file and extract the desired fields?

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !
Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

Error in setting job.setInputFormatClass in Mapreduce

I am running a MapReduce program. I need to give input text file in the format of KEYVALUE pair. so that If I write
job.setInputFormatClass(KeyValueTextInputFormat.class);
The eclipse compiler is showing error that I cant use InputFormat.
anyhow I need to set the Input's format as KeyValueTextInputFormat
How do I do this ?? Any IDea ?????
My Code is
`
package com.iot.dictionary;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.iot.dictionary.Dictionary.AllTranslationsReducer;
import com.iot.dictionary.Dictionary.WordMapper;
public class Driver2 {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "dictionary");
System.out.println("Job-> "+job.toString());
job.setJarByClass(Dictionary.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(AllTranslationsReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
`
If you are using new Hadoop API (Hadoop 0.20.2 and above), you have to import the KeyValueTextInputFormat.class class from package org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat and if you are using the old Hadoop API, you have to import it from org.apache.hadoop.mapred.KeyValueTextInputFormat
You see that line in your code:
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
Change it to
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
Hope this helps.
Thanks

Categories

Resources