I've been trying to run this project which I found in internet and altered for my intention.
Map function is called and works properly, I checked the results from console. But reduce is not getting called
First two digits are key and rest is value.
I've controlled the match between map output and reduce input key,value pairs, I've changed them many times, tried different things but couldn't get a solution.
Since I'm a beginner in this topic probably there is a small mistake. I wrote an other project and had the same mistake again "reduce is not called"
I also tried to change output valu class of reduce to IntWritable, TextWritable instead of MedianStdDevTuple and configured the job but nothing changed.
I don't need the solution only, want to know the reason as well. Thanks.
here is the code
package usercommend;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.logging.Log;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.htrace.commons.logging.LogFactory;
import usercommend.starter.map;
public class starter extends Configured implements Tool {
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new starter(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"starter");
job.setJarByClass(this.getClass());
job.setMapperClass(map.class);
job.setReducerClass(reduces.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static class map extends Mapper<LongWritable, Text,IntWritable, IntWritable> {
private IntWritable outHour = new IntWritable();
private IntWritable outCommentLength = new IntWritable();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
#SuppressWarnings("deprecation")
#Override
public void map(LongWritable key , Text value,Context context) throws IOException, InterruptedException
{
//System.err.println(value.toString()+"vv");
Map<String, String> parsed = transforXmlToMap1(value.toString());
//System.err.println("1");
String strDate = parsed.get("CreationDate");
//System.err.println(strDate);
String text = parsed.get("Text");
//System.err.println(text);
Date creationDate=new Date();
try {
// System.err.println("basla");
creationDate = frmt.parse(strDate);
outHour.set(creationDate.getHours());
outCommentLength.set(text.length());
System.err.println(outHour+""+outCommentLength);
context.write(outHour, outCommentLength);
} catch (ParseException e) {
// TODO Auto-generated catch block
System.err.println("catch");
e.printStackTrace();
return;
}
//context.write(new IntWritable(2), new IntWritable(12));
}
public static Map<String,String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length()-3).split("\"");
for(int i = 0; i < tokens.length-1 ; i+=2) {
String key = tokens[i].trim();
String val = tokens[i+1];
map.put(key.substring(0, key.length()-1),val);
//System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}
}
public static class reduces extends Reducer<IntWritable, IntWritable, IntWritable, MedianStdDevTuple> {
private MedianStdDevTuple result = new MedianStdDevTuple();
private ArrayList<Float> commentLengths = new ArrayList<Float>();
Log log=(Log) LogFactory.getLog(this.getClass());
#Override
public void reduce(IntWritable key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
System.out.println("1");
log.info("aa");
float sum = 0;
float count = 0;
commentLengths.clear();
result.setStdDev(0);
for(IntWritable val : values) {
commentLengths.add((float)val.get());
sum+=val.get();
++count;
}
Collections.sort(commentLengths);
if(count % 2 ==0) {
result.setMedian((commentLengths.get((int)count / 2 -1)+
commentLengths.get((int) count / 2)) / 2.0f);
} else {
result.setMedian(commentLengths.get((int)count / 2));
}
double avg = sum/commentLengths.size();
double totalSquare = 0;
for(int i =0 ;i<commentLengths.size();i++) {
double diff = commentLengths.get(i)-avg;
totalSquare += (diff*diff);
}
double stdSapma= Math.sqrt(totalSquare/(commentLengths.size()));
result.setStdDev(stdSapma);
context.write(key, result);
}
}
}
sample input
<row Id="2" PostId="7" Score="0" Text="I see what you mean, but I've had Linux systems set up so that if the mouse stayed on a window for a certain time period (greater than zero), then that window became active. That would be one solution. Another would be to simply let clicks pass to whatever control they are over, whether it is in the currently active window or not. Is that doable?" CreationDate="2010-08-17T19:38:20.410" UserId="115" />
<row Id="3" PostId="13" Score="1" Text="I am using Iwork and OpenOffice right now But I need some features that just MS has it." CreationDate="2010-08-17T19:42:04.487" UserId="135" />
<row Id="4" PostId="17" Score="0" Text="I've been using that on my MacBook Pro since I got it, with no issues. Last week I got an iMac and immediately installed StartSound.PrefPane but it doesn't work -- any ideas why? The settings on the two machines are identical (except the iMac has v1.1b3 instead of v1.1b2), but one is silent at startup and the other isn't...." CreationDate="2010-08-17T19:42:15.097" UserId="115" />
<row Id="5" PostId="6" Score="0" Text="+agreed. I would add that I think you can choose to not clone everything so it takes less time to make a bootable volume" CreationDate="2010-08-17T19:44:00.270" UserId="2" />
<row Id="6" PostId="22" Score="2" Text="Applications are removed from memory by the OS at it's discretion. Just because they are in the 'task manager' does not imply they are running and in memory. I have confirmed this with my own apps.
After a reboot, these applications are not reloaded until launched by a user." CreationDate="2010-08-17T19:46:01.950" UserId="589" />
<row Id="7" PostId="7" Score="0" Text="Honestly, I don't know. It's definitely interesting though. I'm currently scouring Google, since it would save on input clicks. I'm just concerned that any solution might get a little "hack-y" and not behave consistently in all UI elements or applications. The last thing I'd want is to not know if I'm focusing a window or pressing a button :(" CreationDate="2010-08-17T19:50:00.723" UserId="421" />
<row Id="8" PostId="21" Score="3" Text="Could you expand on the features for those not familiar with ShakesPeer?" CreationDate="2010-08-17T19:51:11.953" UserId="581" />
<row Id="9" PostId="23" Score="1" Text="Apple's vernacular is Safe Sleep." CreationDate="2010-08-17T19:51:35.557" UserId="171" />
You took this code? I'm guessing the problem is that you did not set the correct inputs and outputs for the Job.
Here is what you are trying to do based on your class definitions.
Map Input: (Object, Text)
Map Output: (IntWritable, IntWritable)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (IntWritable, MedianStdDevTuple)
But, based on your Job configuration
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(MedianStdDevTuple.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
It thinks you want to do this
Map Input: (Object, Text) -- I think it's actually LongWritable instead Object, though, for file-split locations
Map Output: (IntWritable,MedianStdDevTuple)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (Text,IntWritable)
Notice how those are different? Your reducer is expecting to read in IntWritable instead of MedianStdDevTuple, and the outputs are also of the incorrect class, therefore, it doesn't run.
To fix, change your job configuration like so
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
Edit: Got it to run fine and the only thing I really changed outside of the code in the link above was the mapper class with this method.
public static Map<String, String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
.split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].replaceAll("[= ]", "");
String val = tokens[i + 1];
map.put(key, val);
// System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}
Related
There is an NCDC weather data set example in Hadoop definite guide.
The Mapper class code is as follows
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
And the driver code is:
Example 2-5. Application to find the maximum temperature in the weather dataset
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm not able to understand since we pass a file containing multiple lines why there is no iteration on lines. The code seems as if it is processing on a single line.
The book explains what Mapper<LongWritable, Text,means... The key is the offset of the file. The value is a line.
It also mentions that TextInputFormat is the default mapreduce input type, which is a type of FileInputFormat
public class TextInputFormat
extends FileInputFormat<LongWritable,Text>
And therefore, the default input types must be Long, Text pairs
As the JavaDoc says
Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
The book also has sections on defining custom RecordReaders
You need to call job.setInputFormatClass to change it to read anything other than single lines
Problem Statement:
Input
Monami 45000 A
Tarun 34000 B
Riju 25000 C
Rita 42000 A
Mithun 40000 A
Archana 21000 C
Shovik 32000 B
I want to use Custom Partitioner in Mapreduce to separate employee records with grade A, B and C in three different output files.
Output 1
Monami 45000 A
Rita 42000 A
Mithun 40000 A
Output 2
Tarun 34000 B
Shovik 32000 B
Output 3
Riju 25000 C
Archana 21000 C
Map Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class Map
extends Mapper<LongWritable,Text,Text,Text>
{
//private Text key1 = new Text();
//private Text value1 = new Text();
#Override
protected void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException
{
String line = value.toString();
String[] part = line.split("\t");
int len = part.length;
//System.out.println(len);
if (len == 3)
{
context.write(new Text(part[2]), new Text(part[0]+"\t"+part[1]));
//System.out.println(part[0]+part[1]+part[2]);
}
}
Partitioner Code
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class CustomPartitioner
extends Partitioner<Text,Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(numReduceTasks==0)
return 0;
if(key.equals(new Text("A")))
return 0;
if(key.equals(new Text("B")))
return 1;
else
return 2;
}
}
Reduce Code
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce
extends Reducer<Text,Text,Text,Text>
{
#Override
protected void reduce(Text key,Iterable<Text> values,Context context)
throws IOException,InterruptedException
{
Iterator<Text> itr = values.iterator();
while(itr.hasNext())
{
context.write(new Text(itr.next().getBytes()),new Text(key));
}
}
}
Driver Class
import org.apache.hadoop.fs.Path;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MapReduceDriver
{
public static void main(String[] args) throws Exception
{
Job job = new Job();
job.setJarByClass(MapReduceDriver.class);
job.setJobName("Custom Partitioner");
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(Map.class);
job.setPartitionerClass(CustomPartitioner.class);
job.setReducerClass(Reduce.class);
job.setNumReduceTasks(3);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)?0:1);
}
}
The code runs without any errors but three reduce output files are empty. Also when the job runs, it shows map output bytes as zero. Hence I believe the map is not generating any key-value pairs. But I cannot find out the reason. Can you help me find the mistake?
Also I have one more confusion: In Map class, when variable len is checked for > 0, then I am getting ArrayIndexOutOfBoundsException but it runs fine without any exception if checked with == 3. Why does it throw an exception with > 0 ?
The problem is that your input data (as pasted here) is not tab-separated, but comma-separated. It should work fine, if you replace this line:
String[] part = line.split("\t");
with this line:
String[] part = line.split(" ");
The reason you are getting an exception when you check for len > 0 is that your string is not split into any sub-parts so len is 1. Then it satisfies the if condition and tries to execute something for the position 2 of parts, which does not exist.
In the existing code, len is not 3, so the code never enters the if block, hence, no exception thrown.
I'm trying to reduce a map like this:
01 true
01 true
01 false
02 false
02 false
where the first column is of Text, the second is BooleanWritable. The aim is to keep only those keys which only contain false next to them, and then write pairs of the first columns digits (so the output for above input would be 0, 2). For this, I wrote the following reducer:
import java.io.IOException;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class BeadReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text _key, Iterable<BooleanWritable> values, Context context) throws IOException, InterruptedException {
// process values
boolean dontwrite= false;
for (BooleanWritable val : values) {
dontwrite = (dontwrite || val.get());
}
if (!dontwrite) {
context.write(new Text(_key.toString().substring(0,1)), new Text(_key.toString().substring(1,2)));
}
else {
context.write(new Text("not"), new Text("good"));
}
}
}
This, however, does nothing. Nor does it write the pairs, not "not good", as if it doesn't even enter the if-else branch. All I get is the mapped (mapping works as intended) values.
The driver:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class BeadDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "task2");
job.setJarByClass(hu.pack.task2.BeadDriver.class);
// TODO: specify a mapper
job.setMapperClass(hu.pack.task2.BeadMapper.class);
// TODO: specify a reducer
job.setReducerClass(hu.pack.task2.BeadReducer.class);
// TODO: specify output types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BooleanWritable.class);
// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(job, new Path("local"));
FileOutputFormat.setOutputPath(job, new Path("outfiles"));
FileSystem fs;
try {
fs = FileSystem.get(conf);
if (fs.exists(new Path("outfiles")))
fs.delete(new Path("outfiles"),true);
} catch (IOException e1) {
e1.printStackTrace();
}
if (!job.waitForCompletion(true))
return;
}
}
The mapper:
import java.io.IOException;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class BeadMapper extends Mapper<LongWritable, Text, Text, BooleanWritable > {
private final Text wordKey = new Text("");
public void map(LongWritable ikey, Text value, Context context) throws IOException, InterruptedException {
String[] friend = value.toString().split(";");
String[] friendswith = friend[1].split(",");
for (String s : friendswith) {
wordKey.set(friend[0] + s);
context.write(wordKey, new BooleanWritable(true));
wordKey.set(s + friend[0]);
context.write(wordKey, new BooleanWritable(true));
}
if (friendswith.length > 0) {
for(int i = 0; i < friendswith.length-1; ++i) {
for(int j = i+1; j < friendswith.length; ++j) {
wordKey.set(friendswith[i] + friendswith[j]);
context.write(wordKey, new BooleanWritable(false));
}
}
}
}
}
I wonder what the problem is, what am I missing?
The output key and value types of a mapper should be input types for reducer and therefore in your case, the reducer must inherit from
Reducer<Text, BooleanWritable, Text, BooleanWritable>
The setOutputKeyClass and setOutputValueClass set the types for the job output i.e. both map and reduce. If you want to specify a different type for mapper, you should use the methods setMapOutputKeyClass and setMapOutputValueClass.
As a side note, when you don't want the true values in the output then why emit them from the mapper. Also with the below code in reducer,
for (BooleanWritable val : values) {
dontwrite = (dontwrite || val.get());
}
if dontwrite becomes true once it will be true till the end of the loop. You may want to change your logic for optimization.
This question already has an answer here:
Output a list from a Hadoop Map Reduce job using custom writable
(1 answer)
Closed 7 years ago.
How can we pass an arraylist as value from the mapper to the reducer.
My code basically has certain rules to work with and would create new values(String) based on the rules.I am maintaining all the outputs(generated after the rule execution) in a list and now need to send this output(Mapper value) to the Reducer and do not have a way to do so.
Can some one please point me to a direction
Adding Code
package develop;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import utility.RulesExtractionUtility;
public class CustomMap{
public static class CustomerMapper extends Mapper<Object, Text, Text, Text> {
private Map<String, String> rules;
#Override
public void setup(Context context)
{
try
{
URI[] cacheFiles = context.getCacheFiles();
setupRulesMap(cacheFiles[0].toString());
}
catch (IOException ioe)
{
System.err.println("Error reading state file.");
System.exit(1);
}
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Map<String, String> rules = new LinkedHashMap<String, String>();
// rules.put("targetcolumn[1]", "ASSIGN(source[0])");
// rules.put("targetcolumn[2]", "INCOME(source[2]+source[3])");
// rules.put("targetcolumn[3]", "ASSIGN(source[1]");
// Above is the "rules", which would basically create some list values from source file
String [] splitSource = value.toString().split(" ");
List<String>lists=RulesExtractionUtility.rulesEngineExecutor(splitSource,rules);
// lists would have values like (name, age) for each line from a huge text file, which is what i want to write in context and pass it to the reducer.
// As of now i havent implemented the reducer code, as m stuck with passing the value from mapper.
// context.write(new Text(), lists);---- I do not have a way of doing this
}
private void setupRulesMap(String filename) throws IOException
{
Map<String, String> rule = new LinkedHashMap<String, String>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line = reader.readLine();
while (line != null)
{
String[] split = line.split("=");
rule.put(split[0], split[1]);
line = reader.readLine();
// rules logic
}
rules = rule;
}
}
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: customerMapper <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf);
job.setJarByClass(CustomMap.class);
job.setMapperClass(CustomerMapper.class);
job.addCacheFile(new URI("Some HDFS location"));
URI[] cacheFiles= job.getCacheFiles();
if(cacheFiles != null) {
for (URI cacheFile : cacheFiles) {
System.out.println("Cache file ->" + cacheFile);
}
}
// job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To pass an arraylist from mapper to reducer, it's clear that objects must implement Writable interface. Why don't you try this library?
<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0-hadoop2</version>
</dependency>
It has an abstract class:
public abstract class ArrayListWritable<M extends org.apache.hadoop.io.Writable>
extends ArrayList<M>
implements org.apache.hadoop.io.Writable, org.apache.hadoop.conf.Configurable
You could create your own class and source code filling the abstract methods and implementing the interface methods with your code. For instance:
public class MyListWritable extends ArrayListWritable<Text>{
...
}
A way to do that (probably not the only nor the best one), would be to
serialize your list in a string to pass it to the output value in the mapper
deserialize and rebuild your list from the string when you read the input value in the reducer
If you do so, then you should also get rid of all special symbols in the string containing the serialized list (symbols like \n or \t for instance). An easy way to achieve that is to used base64 encoded strings.
You should send Text objects instead String objects. Then you can use object.toString() in your Reducer. Be sure to config your driver properly.
If you post your code we will help you further.
I am trying to run the word count example on AWS EMR, however I am having a hard time deploying and running the jar on the cluster. Its a customized word count example, where I have used some JSON parsing. The input is in my S3 bucket. When I try to run my job on EMR cluster I am getting the error that main function was not found in my Mapper class. Everywhere on the internet, the code for the word count example map reduce job is like they have created, three class, one static mapper class that extend Mapper, then the reducer which extends Reducer, and then the main class which contains the job configuration, so I am not sure why I am seeing the error. I build my code using maven assembly plugin so as to wrap all the third party dependencies in my JAR. Here is my code that I have written
package com.amalwa.hadoop.MapReduce;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.google.gson.Gson;
public class ETL{
public static void main(String[] args) throws Exception{
if (args.length < 2) {
System.err.println("Usage: ETL <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "etl");
job.setJarByClass(ETL.class);
job.setMapperClass(JsonParserMapper.class);
job.setReducerClass(JsonParserReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetArray.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
public static class JsonParserMapper extends Mapper<LongWritable, Text, Text, Text>{
private Text mapperKey = null;
private Text mapperValue = null;
Date filterDate = getDate("Sun Apr 20 00:00:00 +0000 2014");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String jsonString = value.toString();
if(!jsonString.isEmpty()){
#SuppressWarnings("unchecked")
Map<String, Object> tweetData = new Gson().fromJson(jsonString, HashMap.class);
Date timeStamp = getDate(tweetData.get("created_at").toString());
if(timeStamp.after(filterDate)){
#SuppressWarnings("unchecked")
com.google.gson.internal.LinkedTreeMap<String, Object> userData = (com.google.gson.internal.LinkedTreeMap<String, Object>) tweetData.get("user");
mapperKey = new Text(userData.get("id_str") + "~" + tweetData.get("created_at").toString());
mapperValue = new Text(tweetData.get("text").toString() + " tweetId = " + tweetData.get("id_str"));
context.write(mapperKey, mapperValue);
}
}
}
public Date getDate(String timeStamp){
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("E MMM dd HH:mm:ss Z yyyy");
Date date = null;
try {
date = simpleDateFormat.parse(timeStamp);
} catch (ParseException e) {
e.printStackTrace();
}
return date;
}
}
public static class JsonParserReducer extends Reducer<Text, Text, Text, TweetArray> {
private ArrayList<Text> tweetList = new ArrayList<Text>();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
tweetList.add(new Text(val.toString()));
}
context.write(key, new TweetArray(Text.class, tweetList.toArray(new Text[tweetList.size()])));
}
}
}
please if someone can clarify this problem, it would be really nice. I have deployed this jar on my local machine on which I installed hadoop and it works fine, but when I set up my cluster using AWS and provide the streaming job with all the parameters it doesn't work. Here is a screen shot of my configuration:
The Mapper textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserMapper
The Reducer textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserReducer
Thanks and regards.
You need to select custom jar step instead of streaming program.
When you make the jar file (I usually do it using Eclipse or a custom gradle build), check if your main class is set to ETL. Apparently, that does not happen by default. Also check the Java version you are using on ur system. I think the aws emr works with upto java 7.