Hadoop reducer ArrayIndexOutOfBoundsException when passing values from mapper

Hadoop reducer ArrayIndexOutOfBoundsException when passing values from mapper - java

I'm trying to output two values from the mapper to the reducer by passing a string value but when I parse the string in the Mapper I get an out of bounds error. However, I made the string in the Mapper so I'm sure it has two values, what I'm doing wrong? How can I pass two values from the mapper to the reducer? (Eventually, I need to pass more variables to the reducer but this makes the problem a bit simpler.)
This is the error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at TotalTime$TimeReducer.reduce(TotalTime.java:57)
at TotalTime$TimeReducer.reduce(TotalTime.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
and this is my code
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class TotalTime {
public static class TimeMapper extends Mapper<Object, Text, Text, Text> {
Text textKey = new Text();
Text textValue = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String data = value.toString();
String[] field = data.split(",");
if (null != field && field.length == 4) {
String strTimeIn[] = field[1].split(":");
String strTimeOout[] = field[2].split(":");
int timeOn = Integer.parseInt(strTimeIn[0]) * 3600 + Integer.parseInt(strTimeIn[1]) * 60 + Integer.parseInt(strTimeIn[2]);
int timeOff = Integer.parseInt(strTimeOout[0]) * 3600 + Integer.parseInt(strTimeOout[1]) * 60 + Integer.parseInt(strTimeOout[2]);
String v = String.valueOf(timeOn) + "," + String.valueOf(timeOff);
textKey.set(field[0]);
textValue.set(v);
context.write(textKey, textValue);
}
}
}
public static class TimeReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Text textValue = new Text();
int sumTime = 0;
for (Text val : values) {
String line = val.toString();
// Split the string by commas
String[] field = line.split(",");
int timeOn = Integer.parseInt(field[0]);
int timeOff = Integer.parseInt(field[1]);
int time = timeOff - timeOn;
sumTime += time;
}
String v = String.valueOf(sumTime);
textValue.set(v);
context.write(key, textValue);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "User Score");
job.setJarByClass(TotalTime.class);
job.setMapperClass(TimeMapper.class);
job.setCombinerClass(TimeReducer.class);
job.setReducerClass(TimeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The input file looks like this:
ID2347,15:40:51,16:21:44,20
ID4568,14:27:57,14:58:04,72
ID8755,13:40:49,13:42:31,99
ID3258,13:12:48,13:37:11,73
ID9666,13:44:34,15:53:36,114
ID8755,09:43:59,10:47:52,123
ID3258,10:25:22,10:41:12,14
ID9666,09:40:10,11:44:01,15

It seems it is combiner which causes your code fails. remember combiner is a piece of code that is ran before reducer. now imagine this scenario:
your mapper process this line:
ID2347,15:40:51,16:21:44,20
and write following output to context
[ID2347, (56451,58904)]
now combiner come into play and process the output of your mapper before reducer and produce this:
[ID2347, 2453]
now above line go to reducer and it fails because in your code your assumption is the value is something like this val1,val2
if you want to your code work just remove combiner [or change your logic]

Related

Type mismatch error in hadoop program

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CommonFriends {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable friend = new IntWritable();
private Text friends = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(),"\n");
while (itr.hasMoreTokens()) {
String[] line = itr.nextToken().split(" ");
if(line.length > 2 ){
int person = Integer.parseInt(line[0]);
for(int i=1; i<line.length;i++){
int ifriend = Integer.parseInt(line[i]);
friends.set((person < ifriend ? person+"-"+ifriend : ifriend+"-"+person));
for(int j=1; j< line.length; j++ ){
if( i != j ){
friend.set(Integer.parseInt(line[j]));
context.write(friends, friend);
}
}
}
}
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
HashSet<IntWritable> duplicates = new HashSet();
ArrayList<Integer> tmp = new ArrayList();
for (IntWritable val : values) {
if(duplicates.contains(val))
tmp.add(val.get());
else
duplicates.add(val);
}
result.set(tmp.toString());
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Common Friends");
job.setJarByClass(CommonFriends.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:194)
at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1350)
at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1667)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at CommonFriends$IntSumReducer.reduce(CommonFriends.java:51)
at CommonFriends$IntSumReducer.reduce(CommonFriends.java:38)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1688)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1637)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1489)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:723)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
This is my code, The error message is the following.
Any idea??
I think the problem in the configuration of output classes of mapper and the reducer
the input files are a list of numbers in file.
Some more details will be provided if needed.
The program finds the common friend between friends

remove job.setCombinerClass(IntSumReducer.class); in your code could solve this problem

Just had a look into your code, it seems you are using reducer code as combiner code.
One thing you need to check.
Your combiner code will take input in form of <Text, IntWritable> and output of Combiner would be <Text, Text> format .
Then the input to your Reducer would be in format of < Text, Text> but you had specified the input to Reducer as < Text, IntWritable > , so it is throwing the error.
Two things can be done :-
1) You might consider changing the output type of Reducer .
2) You might consider writing a separate Combiner code.

How to get user input in Hadoop 2.7.5?

I'm trying to make it so that when a user enters a word the program will go through the txt file and count all the instances of that word.
I'm using MapReduce and i'm new at it.
I know there is a really simple way to do this and i've been trying to figure that out for a while.
In this code I'm trying to make it so that it would ask for the user input and the program would go through the file and find instances.
I've seen some codes on stack overflow and someone mentioned that setting the configuration to conf.set("userinput","Data") would help somehow.
Also there is some updated way to have the user input.
The if statement in my program is an example of when the user word is entered it only finds that word.
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//So I've seen that this is the correct way of setting it up.
// However I've heard that there mroe efficeint ways of setting it up as well.
/*
public void setup(Context context) {
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
word.setAccessibleHelp(wordstring);
}
*/
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
if(word=="userinput") {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

I'm not sure about the setup method, but you pass the input at the command line as an argument.
conf.set("mapper.word",args[0]);
Job job =...
// Notice you now need 3 arguments to run this
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
In the mapper or reducer, you can get the string
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
And you need to get the string from the tokenizer before you can compare it. You also need to compare strings, not a string to a text object
String wordstring=config.get("mapper.word");
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(wordstring.equals(token)) {
word.set(token);
context.write(word, one);
}

Hadoop Map Reduce program for hashing

I have written a Map Reduce Program in Hadoop for hashing all the records of the file, and appending the hased value as an additional attribute to each record and then output to Hadoop file system
This is the code i have written
public class HashByMapReduce
{
public static class LineMapper extends Mapper<Text, Text, Text, Text>
{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
key.set("single")
String line = value.toString();
word.set(line);
context.write(key, line);
}
}
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
String translations = "";
for (Text val : values)
{
translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error
result.set(translations);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
job.setJarByClass(HashByMapReduce.class);
job.setMapperClass(LineMapper.class);
job.setReducerClass(LineReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
I have written this code with the logic that Each line is read by the Map method which assigns all value to a single key which then passes to same Reducer method. which the passes each values to hash64() function.
But i see its passing a null value(empty value) to hash function. I am not unable to figure it out why? Thanks in advance

The cause of the problem is most probably due to the use of KeyValueTextInputFormat. From Yahoo tutorial :
InputFormat: Description: Key: Value:
TextInputFormat Default format; The byte offset The line contents
reads lines of of the line
text files
KeyValueInputFormat Parses lines Everything up to the The remainder of
into key, first tab character the line
val pairs
It's breaking your input lines wrt tab character. I suppose there is no tab in your lines. As a result the key in the LineMapper is a whole line while nothing is being passed as value ( not sure null or empty ).
From your code I think you should better use TextInputFormat class as your inputformat which produces line offset as key and the complete line as value. This should solve your problem.
EDIT : I run your code with following changes, and it seems to work fine:
Changed inputformat to TextInputFormat and accordingly change declaration of the Mapper
Added proper setMapOutputKeyClass & setMapOutputValueClass to the job. These are not mandatory but often creates problem on running.
Removed your ket.set("single") and added a private outkey to the Mapper.
Since you provided no details of hash64 method, I used String.toUpperCase for testing.
If the issue persists, then I'm sure that your hash method hasn't handle null well.
Full code :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class HashByMapReduce {
public static class LineMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
private Text outKey = new Text("single");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
word.set(line);
context.write(outKey, word);
}
}
public static class LineReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String translations = "";
for (Text val : values) {
translations = val.toString() + ","
+ val.toString().toUpperCase(); // Point of Error
result.set(translations);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
job.setJarByClass(HashByMapReduce.class);
job.setMapperClass(LineMapper.class);
job.setReducerClass(LineReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Wordcount on (Key,Value) outputs from a Map Reduce

I have several (title , text ) ordered pairs obtained as an output from a MapReduce application in Hadoop using Java.
Now I would like to implement Word Count on the text field of these ordered pairs.
So my final output should look like :
(title-a , word-a-1 , count-a-1 , word-a-2 , count-a-2 ....)
(title-b , word-b-1, count-b-1 , word-b-2 , count-b-2 ....)
.
.
.
.
(title-x , word-x-1, count-x-1 , word-x-2 , count-x-2 ....)
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it or how I can chain a second map reduce job to create the above output or format it better ?
The following is the code , borrowed it from github and made some changes
package com.org;
import javax.xml.stream.XMLStreamConstants;//XMLInputFactory;
import java.io.*;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.TaskAttemptID;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import javax.xml.stream.*;
public class XmlParser11
{
public static class XmlInputFormat1 extends TextInputFormat {
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}
/**
* XMLRecordReader class to read through a given xml document to output
* xml blocks as records as specified by the start tag and end tag
*
*/
// #Override
public static class XmlRecordReader extends
RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
startTag = conf.get(START_TAG_KEY).getBytes("utf-8");
endTag = conf.get(END_TAG_KEY).getBytes("utf-8");
FileSplit fileSplit = (FileSplit) split;
// open the file and seek to the start of the split
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);
}
#Override
public boolean nextKeyValue() throws IOException,
InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0,
buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException,
InterruptedException {
return value;
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length)
return true;
} else
i = 0;
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
}
}
public static class Map extends Mapper<LongWritable, Text,Text, Text> {
#Override
protected void map(LongWritable key, Text value,
Mapper.Context context)
throws
IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(new
ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case XMLStreamConstants.START_ELEMENT: //START_ELEMENT:
currentElement = reader.getLocalName();
break;
case XMLStreamConstants.CHARACTERS: //CHARACTERS:
if (currentElement.equalsIgnoreCase("title")) {
propertyName += reader.getText();
//System.out.println(propertyName);
} else if (currentElement.equalsIgnoreCase("text")) {
propertyValue += reader.getText();
//System.out.println(propertyValue);
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
}
catch(Exception e){
throw new IOException(e);
}
}
}
public static class Reduce
extends Reducer<Text, Text, Text, Text> {
#Override
protected void setup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("<Start>"), null);
}
#Override
protected void cleanup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("</Start>"), null);
}
private Text outputKey = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
for (Text value : values) {
outputKey.set(constructPropertyXml(key, value));
context.write(outputKey, null);
}
}
public static String constructPropertyXml(Text name, Text value) {
StringBuilder sb = new StringBuilder();
sb.append("<property><name>").append(name)
.append("</name><value>").append(value)
.append("</value></property>");
return sb.toString();
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("xmlinput.start", "<page>");
conf.set("xmlinput.end", "</page>");
Job job = new Job(conf);
job.setJarByClass(XmlParser11.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(XmlParser11.Map.class);
job.setReducerClass(XmlParser11.Reduce.class);
job.setInputFormatClass(XmlInputFormat1.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The wordcount code we find online does the word count of all files and gives the output . I want to do the wordcount for each of text fields separately. The above mapper is used to pull title and text from an XML document . Is there any way I can do the wordcount in the same mapper. If I do that , my next doubt is how do I pass it along with the already existing key value pairs (title,text) to the reducer. Sorry , I am not able to phrase my question properly but I guess the reader must have got some idea

I'm not sure if I have understood it properly. So I have many questions along with my answer.
First of all whoever has written this code is probably trying to show how to write a custom InputFormat to process xml data using MR. I don't know how it is related to your problem.
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it
Read the output file generated by first MR and do it.
or how I can chain a second map reduce job to create the above output or format it better ?
You can definitely chain jobs together in this fashion by writing multiple driver methods, one for each job. See this for more details and this for an example.
I want to do the wordcount for each of text fields separately.
What do you mean by separately?In the traditional wordcount program count of each word is calculated independently of the others.
Is there any way I can do the wordcount in the same mapper.
I hope you have understood the wordcount program properly. In the traditional wordcount program you read the input file, one line at a time, slit the line into words and then emit each word as the key with 1 as the value. All this happens inside the Mapper, which is essentially the same Mapper. And then the total count for each word is determined in the Reducer part of your job. If you wish to emit the words with their total counts from the mapper itself you have to read the whole file in the Mapper itself and do the counting. For that you need to set isSplittable in your InputFormat to false so that your input file is read as a whole and goes to just one Mapper.
When you emit something from Mapper and if it is not a Map only job, the output of your Mapper automatically goes to the Reducer. Do you need something else?

i suggested you can go with regular expression
and perform mapping and grouping.
in hadoop example jar file provide Grep class using this you can perform mapping of your hdfs data using regular expression. and group your maped data.

Unexpected output from Hadoop word count

I modified the code below to output words which occurred at least ten times. But it does not work -- the output file does not change at all. What do I have to do to make it work?
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
// ...
public class WordCount extends Configured implements Tool {
// ...
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// where I modified, but not working, the output file didnt change
if(sum >= 10)
{
context.write(key, new IntWritable(sum));
}
}
}
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
//job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new WordCount(), args);
System.exit(ret);
}
}

Code looks completely valid. I can suspect that your dataset is big enough, so words happens to appear more then 10 times?
Please laso make sure that you indeed looking into new results..

You can see the default Hadoop counters and have an idea of whats happening.

The code looks valid.
To be able to help you we need at least the command line you used to run this. It would also help if you could post the actual output if you feed it a file like this
one
two two
three three three
Etc up till 20

The code is definitely correct, Maybe you are reading the output generated before you modified the code. Or maybe you did not update the jar file which you previously used after modifying the code?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.