Run WordCount example map reduce on AWS EMR - java

I am trying to run the word count example on AWS EMR, however I am having a hard time deploying and running the jar on the cluster. Its a customized word count example, where I have used some JSON parsing. The input is in my S3 bucket. When I try to run my job on EMR cluster I am getting the error that main function was not found in my Mapper class. Everywhere on the internet, the code for the word count example map reduce job is like they have created, three class, one static mapper class that extend Mapper, then the reducer which extends Reducer, and then the main class which contains the job configuration, so I am not sure why I am seeing the error. I build my code using maven assembly plugin so as to wrap all the third party dependencies in my JAR. Here is my code that I have written
package com.amalwa.hadoop.MapReduce;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.google.gson.Gson;
public class ETL{
public static void main(String[] args) throws Exception{
if (args.length < 2) {
System.err.println("Usage: ETL <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "etl");
job.setJarByClass(ETL.class);
job.setMapperClass(JsonParserMapper.class);
job.setReducerClass(JsonParserReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetArray.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
public static class JsonParserMapper extends Mapper<LongWritable, Text, Text, Text>{
private Text mapperKey = null;
private Text mapperValue = null;
Date filterDate = getDate("Sun Apr 20 00:00:00 +0000 2014");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String jsonString = value.toString();
if(!jsonString.isEmpty()){
#SuppressWarnings("unchecked")
Map<String, Object> tweetData = new Gson().fromJson(jsonString, HashMap.class);
Date timeStamp = getDate(tweetData.get("created_at").toString());
if(timeStamp.after(filterDate)){
#SuppressWarnings("unchecked")
com.google.gson.internal.LinkedTreeMap<String, Object> userData = (com.google.gson.internal.LinkedTreeMap<String, Object>) tweetData.get("user");
mapperKey = new Text(userData.get("id_str") + "~" + tweetData.get("created_at").toString());
mapperValue = new Text(tweetData.get("text").toString() + " tweetId = " + tweetData.get("id_str"));
context.write(mapperKey, mapperValue);
}
}
}
public Date getDate(String timeStamp){
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("E MMM dd HH:mm:ss Z yyyy");
Date date = null;
try {
date = simpleDateFormat.parse(timeStamp);
} catch (ParseException e) {
e.printStackTrace();
}
return date;
}
}
public static class JsonParserReducer extends Reducer<Text, Text, Text, TweetArray> {
private ArrayList<Text> tweetList = new ArrayList<Text>();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
tweetList.add(new Text(val.toString()));
}
context.write(key, new TweetArray(Text.class, tweetList.toArray(new Text[tweetList.size()])));
}
}
}
please if someone can clarify this problem, it would be really nice. I have deployed this jar on my local machine on which I installed hadoop and it works fine, but when I set up my cluster using AWS and provide the streaming job with all the parameters it doesn't work. Here is a screen shot of my configuration:
The Mapper textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserMapper
The Reducer textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserReducer
Thanks and regards.

You need to select custom jar step instead of streaming program.

When you make the jar file (I usually do it using Eclipse or a custom gradle build), check if your main class is set to ETL. Apparently, that does not happen by default. Also check the Java version you are using on ur system. I think the aws emr works with upto java 7.

Related

Not able to use CompositetextinputFormat in Mapside Join

I am trying to implement Map-side join using CompositeTextInoutFormat. However I am getting below errors in Map reduce job which I am unable to resolve,.
1. In the below code I am getting error while using Compose method and also getting an error while setting inputformat Class. The error says as below.
The method compose(String, Class, Path...) in
the type CompositeInputFormat is not applicable for the arguments
(String, Class, Path[])
Can someone please help
package Hadoop.MR.Practice;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.join.CompositeInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//import org.apache.hadoop.mapred.join.CompositeInputFormat;
public class MapJoinJob implements Tool{
private Configuration conf;
public Configuration getConf() {
return conf;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "MapSideJoinJob");
job.setJarByClass(this.getClass());
Path[] inputs = new Path[] { new Path(args[0]), new Path(args[1])};
String join = CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, inputs);
job.getConfiguration().set("mapreduce.join.expr", join);
job.setInputFormatClass(CompositeInputFormat.class);
job.setMapperClass(MapJoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//Configuring reducer
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
MapJoinJob mjJob = new MapJoinJob();
ToolRunner.run(conf, mjJob, args);
}
I would say your problem is likely related to mixing hadoop APIs. You can see that your imports are mixing mapred and mapreduce.
For example, you're trying to use org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat with org.apache.hadoop.mapred.join.CompositeInputFormat which is unlikely to work.
You should choose one (probably mapreduce i would say) and make sure everything is using the same API.

MapReduce - Reduce is not called

I've been trying to run this project which I found in internet and altered for my intention.
Map function is called and works properly, I checked the results from console. But reduce is not getting called
First two digits are key and rest is value.
I've controlled the match between map output and reduce input key,value pairs, I've changed them many times, tried different things but couldn't get a solution.
Since I'm a beginner in this topic probably there is a small mistake. I wrote an other project and had the same mistake again "reduce is not called"
I also tried to change output valu class of reduce to IntWritable, TextWritable instead of MedianStdDevTuple and configured the job but nothing changed.
I don't need the solution only, want to know the reason as well. Thanks.
here is the code
package usercommend;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.logging.Log;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.htrace.commons.logging.LogFactory;
import usercommend.starter.map;
public class starter extends Configured implements Tool {
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new starter(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"starter");
job.setJarByClass(this.getClass());
job.setMapperClass(map.class);
job.setReducerClass(reduces.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static class map extends Mapper<LongWritable, Text,IntWritable, IntWritable> {
private IntWritable outHour = new IntWritable();
private IntWritable outCommentLength = new IntWritable();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
#SuppressWarnings("deprecation")
#Override
public void map(LongWritable key , Text value,Context context) throws IOException, InterruptedException
{
//System.err.println(value.toString()+"vv");
Map<String, String> parsed = transforXmlToMap1(value.toString());
//System.err.println("1");
String strDate = parsed.get("CreationDate");
//System.err.println(strDate);
String text = parsed.get("Text");
//System.err.println(text);
Date creationDate=new Date();
try {
// System.err.println("basla");
creationDate = frmt.parse(strDate);
outHour.set(creationDate.getHours());
outCommentLength.set(text.length());
System.err.println(outHour+""+outCommentLength);
context.write(outHour, outCommentLength);
} catch (ParseException e) {
// TODO Auto-generated catch block
System.err.println("catch");
e.printStackTrace();
return;
}
//context.write(new IntWritable(2), new IntWritable(12));
}
public static Map<String,String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length()-3).split("\"");
for(int i = 0; i < tokens.length-1 ; i+=2) {
String key = tokens[i].trim();
String val = tokens[i+1];
map.put(key.substring(0, key.length()-1),val);
//System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}
}
public static class reduces extends Reducer<IntWritable, IntWritable, IntWritable, MedianStdDevTuple> {
private MedianStdDevTuple result = new MedianStdDevTuple();
private ArrayList<Float> commentLengths = new ArrayList<Float>();
Log log=(Log) LogFactory.getLog(this.getClass());
#Override
public void reduce(IntWritable key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
System.out.println("1");
log.info("aa");
float sum = 0;
float count = 0;
commentLengths.clear();
result.setStdDev(0);
for(IntWritable val : values) {
commentLengths.add((float)val.get());
sum+=val.get();
++count;
}
Collections.sort(commentLengths);
if(count % 2 ==0) {
result.setMedian((commentLengths.get((int)count / 2 -1)+
commentLengths.get((int) count / 2)) / 2.0f);
} else {
result.setMedian(commentLengths.get((int)count / 2));
}
double avg = sum/commentLengths.size();
double totalSquare = 0;
for(int i =0 ;i<commentLengths.size();i++) {
double diff = commentLengths.get(i)-avg;
totalSquare += (diff*diff);
}
double stdSapma= Math.sqrt(totalSquare/(commentLengths.size()));
result.setStdDev(stdSapma);
context.write(key, result);
}
}
}
sample input
<row Id="2" PostId="7" Score="0" Text="I see what you mean, but I've had Linux systems set up so that if the mouse stayed on a window for a certain time period (greater than zero), then that window became active. That would be one solution. Another would be to simply let clicks pass to whatever control they are over, whether it is in the currently active window or not. Is that doable?" CreationDate="2010-08-17T19:38:20.410" UserId="115" />
<row Id="3" PostId="13" Score="1" Text="I am using Iwork and OpenOffice right now But I need some features that just MS has it." CreationDate="2010-08-17T19:42:04.487" UserId="135" />
<row Id="4" PostId="17" Score="0" Text="I've been using that on my MacBook Pro since I got it, with no issues. Last week I got an iMac and immediately installed StartSound.PrefPane but it doesn't work -- any ideas why? The settings on the two machines are identical (except the iMac has v1.1b3 instead of v1.1b2), but one is silent at startup and the other isn't...." CreationDate="2010-08-17T19:42:15.097" UserId="115" />
<row Id="5" PostId="6" Score="0" Text="+agreed. I would add that I think you can choose to not clone everything so it takes less time to make a bootable volume" CreationDate="2010-08-17T19:44:00.270" UserId="2" />
<row Id="6" PostId="22" Score="2" Text="Applications are removed from memory by the OS at it's discretion. Just because they are in the 'task manager' does not imply they are running and in memory. I have confirmed this with my own apps.
After a reboot, these applications are not reloaded until launched by a user." CreationDate="2010-08-17T19:46:01.950" UserId="589" />
<row Id="7" PostId="7" Score="0" Text="Honestly, I don't know. It's definitely interesting though. I'm currently scouring Google, since it would save on input clicks. I'm just concerned that any solution might get a little "hack-y" and not behave consistently in all UI elements or applications. The last thing I'd want is to not know if I'm focusing a window or pressing a button :(" CreationDate="2010-08-17T19:50:00.723" UserId="421" />
<row Id="8" PostId="21" Score="3" Text="Could you expand on the features for those not familiar with ShakesPeer?" CreationDate="2010-08-17T19:51:11.953" UserId="581" />
<row Id="9" PostId="23" Score="1" Text="Apple's vernacular is Safe Sleep." CreationDate="2010-08-17T19:51:35.557" UserId="171" />
You took this code? I'm guessing the problem is that you did not set the correct inputs and outputs for the Job.
Here is what you are trying to do based on your class definitions.
Map Input: (Object, Text)
Map Output: (IntWritable, IntWritable)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (IntWritable, MedianStdDevTuple)
But, based on your Job configuration
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(MedianStdDevTuple.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
It thinks you want to do this
Map Input: (Object, Text) -- I think it's actually LongWritable instead Object, though, for file-split locations
Map Output: (IntWritable,MedianStdDevTuple)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (Text,IntWritable)
Notice how those are different? Your reducer is expecting to read in IntWritable instead of MedianStdDevTuple, and the outputs are also of the incorrect class, therefore, it doesn't run.
To fix, change your job configuration like so
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
Edit: Got it to run fine and the only thing I really changed outside of the code in the link above was the mapper class with this method.
public static Map<String, String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
.split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].replaceAll("[= ]", "");
String val = tokens[i + 1];
map.put(key, val);
// System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

This question already has an answer here:
Output a list from a Hadoop Map Reduce job using custom writable
(1 answer)
Closed 7 years ago.
How can we pass an arraylist as value from the mapper to the reducer.
My code basically has certain rules to work with and would create new values(String) based on the rules.I am maintaining all the outputs(generated after the rule execution) in a list and now need to send this output(Mapper value) to the Reducer and do not have a way to do so.
Can some one please point me to a direction
Adding Code
package develop;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import utility.RulesExtractionUtility;
public class CustomMap{
public static class CustomerMapper extends Mapper<Object, Text, Text, Text> {
private Map<String, String> rules;
#Override
public void setup(Context context)
{
try
{
URI[] cacheFiles = context.getCacheFiles();
setupRulesMap(cacheFiles[0].toString());
}
catch (IOException ioe)
{
System.err.println("Error reading state file.");
System.exit(1);
}
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Map<String, String> rules = new LinkedHashMap<String, String>();
// rules.put("targetcolumn[1]", "ASSIGN(source[0])");
// rules.put("targetcolumn[2]", "INCOME(source[2]+source[3])");
// rules.put("targetcolumn[3]", "ASSIGN(source[1]");
// Above is the "rules", which would basically create some list values from source file
String [] splitSource = value.toString().split(" ");
List<String>lists=RulesExtractionUtility.rulesEngineExecutor(splitSource,rules);
// lists would have values like (name, age) for each line from a huge text file, which is what i want to write in context and pass it to the reducer.
// As of now i havent implemented the reducer code, as m stuck with passing the value from mapper.
// context.write(new Text(), lists);---- I do not have a way of doing this
}
private void setupRulesMap(String filename) throws IOException
{
Map<String, String> rule = new LinkedHashMap<String, String>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line = reader.readLine();
while (line != null)
{
String[] split = line.split("=");
rule.put(split[0], split[1]);
line = reader.readLine();
// rules logic
}
rules = rule;
}
}
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: customerMapper <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf);
job.setJarByClass(CustomMap.class);
job.setMapperClass(CustomerMapper.class);
job.addCacheFile(new URI("Some HDFS location"));
URI[] cacheFiles= job.getCacheFiles();
if(cacheFiles != null) {
for (URI cacheFile : cacheFiles) {
System.out.println("Cache file ->" + cacheFile);
}
}
// job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To pass an arraylist from mapper to reducer, it's clear that objects must implement Writable interface. Why don't you try this library?
<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0-hadoop2</version>
</dependency>
It has an abstract class:
public abstract class ArrayListWritable<M extends org.apache.hadoop.io.Writable>
extends ArrayList<M>
implements org.apache.hadoop.io.Writable, org.apache.hadoop.conf.Configurable
You could create your own class and source code filling the abstract methods and implementing the interface methods with your code. For instance:
public class MyListWritable extends ArrayListWritable<Text>{
...
}
A way to do that (probably not the only nor the best one), would be to
serialize your list in a string to pass it to the output value in the mapper
deserialize and rebuild your list from the string when you read the input value in the reducer
If you do so, then you should also get rid of all special symbols in the string containing the serialized list (symbols like \n or \t for instance). An easy way to achieve that is to used base64 encoded strings.
You should send Text objects instead String objects. Then you can use object.toString() in your Reducer. Be sure to config your driver properly.
If you post your code we will help you further.

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !
Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

How to read multiple types of Avro data in single MapReduce

I have two different types of Avro data which have some common fields. I want to read those common fields in the mapper. I want to read this by spawning a single job in cluster.
Below is the sample avro schema
Schema 1:
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaOne","doc":"Avro storing with schema using MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"BB","type":["null","string"],"default":null},
{"name":"CC","type":["null","string"],"default":null}]}
Schema 2 :
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaTwo","doc":"Avro
storing with schema using
MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"CC","type":["null","string"],"default":null},
{"name":"DD","type":["null","string"],"default":null}]}
Driver Class:
package com.mango.schema.aggrDaily;
import java.util.Date;
import org.apache.avro.Schema;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(super.getConf(), getClass());
conf.setJobName("DF");
args[0] = "hdfs://localhost:9999/home/hadoop/work/alok/aggrDaily/data/avro512MB/part-m-00000.avro";
args[1] = "/home/hadoop/work/alok/tmp"; // temp location
args[2] = "hdfs://localhost:9999/home/hadoop/work/alok/tmp/10";
FileInputFormat.addInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[2]));
AvroJob.setInputReflect(conf);
AvroJob.setMapperClass(conf, AvroMapper.class);
AvroJob.setOutputSchema(
conf,
Pair.getPairSchema(Schema.create(Schema.Type.STRING),
Schema.create(Schema.Type.INT)));
RunningJob job = JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
long startTime = new Date().getTime();
System.out.println("Start Time :::::" + startTime);
Configuration conf = new Configuration();
int exitCode = ToolRunner.run(conf, new AvroDriver(), args);
long endTime = new Date().getTime();
System.out.println("End Time :::::" + endTime);
System.out.println("Total Time Taken:::"
+ new Double((endTime - startTime) * 0.001) + "Sec.");
System.exit(exitCode);
}
}
Mapper class:
package com.mango.schema.aggrDaily;
import java.io.IOException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
public class AvroMapper extends
AvroMapper<GenericData, Pair<CharSequence, Integer>> {
#Override
public void map(GenericData record,
AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter) throws IOException {
System.out.println("record :: " + record);
}
}
I able to read Avro data with this code by setting the input schema.
AvroJob.setInputSchema(conf, new AggrDaily().getSchema());
As the Avro data has builtin schema into the data, I don't want to pass the specific schema to the job explicitly. I achieve this in Pig. But now I want to achieve the same in MapReduce also.
Can anybody help me to achieve this through MR code or let me know where am I going wrong?
By *org.apache.hadoop.mapreduce.lib.input.MultipleInputs class we can read multiple avro data trough single MR job
We cannot use org.apache.hadoop.mapreduce.lib.input.MultipleInputs to read multiple avro data because each one of the avro inputs will have a schema associated with it and currently context can store the schema for only one of the inputs. So other mappers will not be able read data..
The same thing is true with HCatInputFormat as well (because every input has a schema associated with it). However in Hcatalog 0.14 onwards there is a provision for the same.
AvroMultipleInputs can be used to accomplish the same. It works only with Specific and Reflect mappings. It is available from version 1.7.7 onwards.

Categories

Resources