I am trying to reverse the contents of the file by each word. I have the program running fine, but the output i am getting is something like this
1 dwp
2 seviG
3 eht
4 tnerruc
5 gnikdrow
6 yrotcerid
7 ridkm
8 desU
9 ot
10 etaerc
I want the output to be something like this
dwp seviG eht tnerruc gnikdrow yrotcerid ridkm desU
ot etaerc
The code i am working with
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Reproduce {
public static int temp =0;
public static class ReproduceMap extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text>{
private Text word = new Text();
#Override
public void map(LongWritable arg0, Text value,
OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException {
String line = value.toString().concat("\n");
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(new StringBuffer(tokenizer.nextToken()).reverse().toString());
temp++;
output.collect(new IntWritable(temp),word);
}
}
}
public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>{
#Override
public void reduce(IntWritable arg0, Iterator<Text> arg1,
OutputCollector<IntWritable, Text> arg2, Reporter arg3)
throws IOException {
String word = arg1.next().toString();
Text word1 = new Text();
word1.set(word);
arg2.collect(arg0, word1);
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(ReproduceMap.class);
conf.setReducerClass(ReproduceReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
How do i modify my output instead of writing another java program to do that
Thanks in advance
Here is a simple code demonstrate the use of custom FileoutputFormat
public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
#Override
public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
//get the current path
Path path = FileOutputFormat.getOutputPath(arg0);
//create the full path with the output directory plus our filename
Path fullPath = new Path(path, "result.txt");
//create the file in the file system
FileSystem fs = path.getFileSystem(arg0.getConfiguration());
FSDataOutputStream fileOut = fs.create(fullPath, arg0);
//create our record writer with the new file
return new MyCustomRecordWriter(fileOut);
}
}
public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
private DataOutputStream out;
public MyCustomRecordWriter(DataOutputStream stream) {
out = stream;
try {
out.writeBytes("results:\r\n");
}
catch (Exception ex) {
}
}
#Override
public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
//close our file
out.close();
}
#Override
public void write(Text arg0, List arg1) throws IOException, InterruptedException {
//write out our key
out.writeBytes(arg0.toString() + ": ");
//loop through all values associated with our key and write them with commas between
for (int i=0; i<arg1.size(); i++) {
if (i>0)
out.writeBytes(",");
out.writeBytes(String.valueOf(arg1.get(i)));
}
out.writeBytes("\r\n");
}
}
Finally we need to tell our job about our ouput format and the path before running it.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ArrayList.class);
job.setOutputFormatClass(MyTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));
We can customize the output by writing a custom fileoutputformat class
you can use NullWritable as a output value. NullWritable is just a placeholder Since you don't want number to be displayed as a part of your output. I have given modified reducer class. Note :- need to add import statement for NullWritable
public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, Text, NullWritable>{
#Override
public void reduce(IntWritable arg0, Iterator<Text> arg1,
OutputCollector<Text, NullWritable> arg2, Reporter arg3)
throws IOException {
String word = arg1.next().toString();
Text word1 = new Text();
word1.set(word);
arg2.collect(word1, new NullWritable());
}
}
and change the driver class or main method
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(NullWritable.class);
In Mapper key temp is incremented for each word value, So each word is processed as a separate key-value pair.
Below steps should solve the problem
1) In Mapper just remove the temp++, so that all the reversed words will have the key as 0 (temp =0).
2) Reducer receives the key 0 and list of reversed strings.
In reducer set the key to NullWritable and write the output.
What you can try is take one constant key (or simply nullwritable) and pass this as a key and your complete line as a value(you can reverse it in mapper class or you can also reverse it in the reducer class as well). so your reducer will receive a constant key (or place holder if you have used nullwritable as a key) and complete line. Now you can simply reverse the line and write it to output file. By not using tmp as a key you avoid writing unwanted numbers in your output file.
Related
As it is quite obvious from title, my aim is to use the Mapper's counter in the reduce phase, before finishing the particular job.
I have come across a few questions which were highly related to this question, but non of them solved all my problems.
(Accessing a mapper's counter from a reducer,
Hadoop, MapReduce Custom Java Counters Exception in thread "main" java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING,
etc.)
#Override
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Cluster cluster = new Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
mapperCounter = currentJob.getCounters().findCounter(COUNTER_NAME).getValue();
}
My problem is that the cluster does not contain any job history.
The way how I call the mapreduce job:
private void firstFrequents(String outpath) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Cluster cluster = new Cluster(conf);
conf.setInt("minFreq", MIN_FREQUENCY);
Job job = Job.getInstance(conf, "APR");
// Counters counters = job.getCounters();
job.setJobName("TotalTransactions");
job.setJarByClass(AssociationRules.class);
job.setMapperClass(FirstFrequentsMapper.class);
job.setReducerClass(CandidateReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path(outpath));
job.waitForCompletion(true);
}
Mapper:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class FirstFrequentsMapper extends
Mapper<Object, Text, Text, IntWritable> {
public enum Counters {
TotalTransactions
}
private IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] line = value.toString().split("\\\t+|,+");
int iter = 0;
for (String string : line) {
context.write(new Text(line[iter]), one);
iter++;
}
context.getCounter(Counters.TotalTransactions).increment(1);
}
}
Reducer
public class CandidateReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private int minFrequency;
private long totalTransactions;
#Override
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
minFrequency = conf.getInt("minFreq", 1);
Cluster cluster = new Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
totalTransactions = currentJob.getCounters().findCounter(FirstFrequentsMapper.Counters.TotalTransactions).getValue();
System.out.print(totalTransactions);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int counter = 0;
for (IntWritable val : values) {
counter+=val.get();
}
/* Item frequency calculated*/
/* Write it to output if it is frequent */
if (counter>= minFrequency) {
context.write(key,new IntWritable(counter));
}
}
}
The correct setup(), or reduce() implementation, to get the value of a counter, is exactly the one shown in the post that you mention:
Counter counter = context.getCounter(CounterMapper.TestCounters.TEST);
long counterValue = counter.getValue();
where TEST is the name of the counter, which is declared in an enum TestCounters.
I don't see the reason why you declare a Cluster variable...
Also, in the code that you mention in your comments, you should store the returned result of the getValue() method in a variable, as the counterValue variable above.
Perhaps, you will find this post useful, as well.
UPDATE: Based on your edit, I believe that all you are looking for is the number of MAP_INPUT_RECORDS which is a default counter, so you don't need to re-implement it.
To get the value of a counter from the Driver class, you can use (taken from this post):
job.getCounters().findCounter(COUNTER_NAME).getValue();
I have written a Map Reduce Program in Hadoop for hashing all the records of the file, and appending the hased value as an additional attribute to each record and then output to Hadoop file system
This is the code i have written
public class HashByMapReduce
{
public static class LineMapper extends Mapper<Text, Text, Text, Text>
{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
key.set("single")
String line = value.toString();
word.set(line);
context.write(key, line);
}
}
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
String translations = "";
for (Text val : values)
{
translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error
result.set(translations);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
job.setJarByClass(HashByMapReduce.class);
job.setMapperClass(LineMapper.class);
job.setReducerClass(LineReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
I have written this code with the logic that Each line is read by the Map method which assigns all value to a single key which then passes to same Reducer method. which the passes each values to hash64() function.
But i see its passing a null value(empty value) to hash function. I am not unable to figure it out why? Thanks in advance
The cause of the problem is most probably due to the use of KeyValueTextInputFormat. From Yahoo tutorial :
InputFormat: Description: Key: Value:
TextInputFormat Default format; The byte offset The line contents
reads lines of of the line
text files
KeyValueInputFormat Parses lines Everything up to the The remainder of
into key, first tab character the line
val pairs
It's breaking your input lines wrt tab character. I suppose there is no tab in your lines. As a result the key in the LineMapper is a whole line while nothing is being passed as value ( not sure null or empty ).
From your code I think you should better use TextInputFormat class as your inputformat which produces line offset as key and the complete line as value. This should solve your problem.
EDIT : I run your code with following changes, and it seems to work fine:
Changed inputformat to TextInputFormat and accordingly change declaration of the Mapper
Added proper setMapOutputKeyClass & setMapOutputValueClass to the job. These are not mandatory but often creates problem on running.
Removed your ket.set("single") and added a private outkey to the Mapper.
Since you provided no details of hash64 method, I used String.toUpperCase for testing.
If the issue persists, then I'm sure that your hash method hasn't handle null well.
Full code :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class HashByMapReduce {
public static class LineMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
private Text outKey = new Text("single");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
word.set(line);
context.write(outKey, word);
}
}
public static class LineReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String translations = "";
for (Text val : values) {
translations = val.toString() + ","
+ val.toString().toUpperCase(); // Point of Error
result.set(translations);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
job.setJarByClass(HashByMapReduce.class);
job.setMapperClass(LineMapper.class);
job.setReducerClass(LineReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I get the following error when I execute my alphabet count program.
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1014)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.example.AlphabetCount$Map.map(AlphabetCount.java:40)
Command used to run: ./bin/hadoop jar /home/ubuntu/Documents/AlphabetCount.jar input output
I have browsed and checked the first eight links when I google using the error message. I have implemented their advice and yet the error message appears. Can you help me out, please?
Code:
package com.example;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AlphabetCount {
public static class Map1 extends
Mapper<LongWritable, Text, Text, IntWritable> {
private Text alphabet = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
byte[] byteArray = line.getBytes();
int sum = 0;
alphabet.set("a");
for (int i = 0; i < byteArray.length; i++) {
if ((byteArray[i] == 'a') || (byteArray[i] == 'A')) {
sum += 1;
}
}
context.write(alphabet, new IntWritable(sum));
}
}
public static class Reduce1 extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value,
Context context) throws IOException, InterruptedException {
final Text alphabet = new Text();
alphabet.set(key);
int sum = 0;
while (value.hasNext()) {
sum = sum + value.next().get();
}
context.write(alphabet, new IntWritable(sum));
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(AlphabetCount.class);
job.setMapperClass(Map1.class);
job.setCombinerClass(Reduce1.class);
job.setReducerClass(Reduce1.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Update: (Solution) the above code works! I was getting the error because the jar I was executing was different from the jar I was updating with the above code! I had initially exported the jar (with erroneous code) from eclipse to location x and subsequently I was updating the code in location y but still executing the jar from location x! damn!
Try specifying your input and output formats classes in your main method and also the input key format of your Mapper. You should have something similar to this :
public class AlphabetCount {
public static class Map1 extends
Mapper<Text, Text, Text, IntWritable> {
...
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
...
}
}
public static class Reduce1 extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value,
Context context) throws IOException, InterruptedException {
...
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
...
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
...
}
}
I have several (title , text ) ordered pairs obtained as an output from a MapReduce application in Hadoop using Java.
Now I would like to implement Word Count on the text field of these ordered pairs.
So my final output should look like :
(title-a , word-a-1 , count-a-1 , word-a-2 , count-a-2 ....)
(title-b , word-b-1, count-b-1 , word-b-2 , count-b-2 ....)
.
.
.
.
(title-x , word-x-1, count-x-1 , word-x-2 , count-x-2 ....)
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it or how I can chain a second map reduce job to create the above output or format it better ?
The following is the code , borrowed it from github and made some changes
package com.org;
import javax.xml.stream.XMLStreamConstants;//XMLInputFactory;
import java.io.*;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.TaskAttemptID;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import javax.xml.stream.*;
public class XmlParser11
{
public static class XmlInputFormat1 extends TextInputFormat {
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}
/**
* XMLRecordReader class to read through a given xml document to output
* xml blocks as records as specified by the start tag and end tag
*
*/
// #Override
public static class XmlRecordReader extends
RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
startTag = conf.get(START_TAG_KEY).getBytes("utf-8");
endTag = conf.get(END_TAG_KEY).getBytes("utf-8");
FileSplit fileSplit = (FileSplit) split;
// open the file and seek to the start of the split
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);
}
#Override
public boolean nextKeyValue() throws IOException,
InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0,
buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException,
InterruptedException {
return value;
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length)
return true;
} else
i = 0;
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
}
}
public static class Map extends Mapper<LongWritable, Text,Text, Text> {
#Override
protected void map(LongWritable key, Text value,
Mapper.Context context)
throws
IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(new
ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case XMLStreamConstants.START_ELEMENT: //START_ELEMENT:
currentElement = reader.getLocalName();
break;
case XMLStreamConstants.CHARACTERS: //CHARACTERS:
if (currentElement.equalsIgnoreCase("title")) {
propertyName += reader.getText();
//System.out.println(propertyName);
} else if (currentElement.equalsIgnoreCase("text")) {
propertyValue += reader.getText();
//System.out.println(propertyValue);
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
}
catch(Exception e){
throw new IOException(e);
}
}
}
public static class Reduce
extends Reducer<Text, Text, Text, Text> {
#Override
protected void setup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("<Start>"), null);
}
#Override
protected void cleanup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("</Start>"), null);
}
private Text outputKey = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
for (Text value : values) {
outputKey.set(constructPropertyXml(key, value));
context.write(outputKey, null);
}
}
public static String constructPropertyXml(Text name, Text value) {
StringBuilder sb = new StringBuilder();
sb.append("<property><name>").append(name)
.append("</name><value>").append(value)
.append("</value></property>");
return sb.toString();
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("xmlinput.start", "<page>");
conf.set("xmlinput.end", "</page>");
Job job = new Job(conf);
job.setJarByClass(XmlParser11.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(XmlParser11.Map.class);
job.setReducerClass(XmlParser11.Reduce.class);
job.setInputFormatClass(XmlInputFormat1.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The wordcount code we find online does the word count of all files and gives the output . I want to do the wordcount for each of text fields separately. The above mapper is used to pull title and text from an XML document . Is there any way I can do the wordcount in the same mapper. If I do that , my next doubt is how do I pass it along with the already existing key value pairs (title,text) to the reducer. Sorry , I am not able to phrase my question properly but I guess the reader must have got some idea
I'm not sure if I have understood it properly. So I have many questions along with my answer.
First of all whoever has written this code is probably trying to show how to write a custom InputFormat to process xml data using MR. I don't know how it is related to your problem.
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it
Read the output file generated by first MR and do it.
or how I can chain a second map reduce job to create the above output or format it better ?
You can definitely chain jobs together in this fashion by writing multiple driver methods, one for each job. See this for more details and this for an example.
I want to do the wordcount for each of text fields separately.
What do you mean by separately?In the traditional wordcount program count of each word is calculated independently of the others.
Is there any way I can do the wordcount in the same mapper.
I hope you have understood the wordcount program properly. In the traditional wordcount program you read the input file, one line at a time, slit the line into words and then emit each word as the key with 1 as the value. All this happens inside the Mapper, which is essentially the same Mapper. And then the total count for each word is determined in the Reducer part of your job. If you wish to emit the words with their total counts from the mapper itself you have to read the whole file in the Mapper itself and do the counting. For that you need to set isSplittable in your InputFormat to false so that your input file is read as a whole and goes to just one Mapper.
When you emit something from Mapper and if it is not a Map only job, the output of your Mapper automatically goes to the Reducer. Do you need something else?
i suggested you can go with regular expression
and perform mapping and grouping.
in hadoop example jar file provide Grep class using this you can perform mapping of your hdfs data using regular expression. and group your maped data.
I modified the code below to output words which occurred at least ten times. But it does not work -- the output file does not change at all. What do I have to do to make it work?
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
// ...
public class WordCount extends Configured implements Tool {
// ...
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// where I modified, but not working, the output file didnt change
if(sum >= 10)
{
context.write(key, new IntWritable(sum));
}
}
}
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
//job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new WordCount(), args);
System.exit(ret);
}
}
Code looks completely valid. I can suspect that your dataset is big enough, so words happens to appear more then 10 times?
Please laso make sure that you indeed looking into new results..
You can see the default Hadoop counters and have an idea of whats happening.
The code looks valid.
To be able to help you we need at least the command line you used to run this. It would also help if you could post the actual output if you feed it a file like this
one
two two
three three three
Etc up till 20
The code is definitely correct, Maybe you are reading the output generated before you modified the code. Or maybe you did not update the jar file which you previously used after modifying the code?