WordCount MapReduce is giving unexpected result - java

I am trying this java code for wordcount in mapreduce and after completion of reduce method I want to display the only word that comes maximum number of times.
For that i have created some class level variables named as myoutput, mykey and completeSum.
I am writing this data in close method but I am getting unexpected result at the end.
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
static int completeSum = -1;
static OutputCollector<Text, IntWritable> myoutput;
static Text mykey = new Text();
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (completeSum < sum) {
completeSum = sum;
myoutput = output;
mykey = key;
}
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
super.close();
myoutput.collect(mykey, new IntWritable(completeSum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
// conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
input file data
one
three three three
four four four four
six six six six six six six six six six six six six six six six six six
five five five five five
seven seven seven seven seven seven seven seven seven seven seven seven seven
result should come as
six 18
however I am getting this result
three 18
By the result I can see that the sum is correct but the key is not.
If someone can give good reference on these map and reduce methods, that would be very helpful.

The problem you are observing is due to reference aliasing. The object referenced by the key is reused with a new content for multiple invocations, thus changing mykey that references the same object. It ends up with the last reduced key. This could be avoided by copying the object, as in:
mykey = new Text(key);
However, you should get the result only from the output file as static variables cannot be shared by different nodes in a distributed cluster. It sort of works only in standalone mode, defeating the purpose of map-reduce.
Finally, using global variables, even in standalone mode, will most likley lead to races if using parallel local tasks (see MAPREDUCE-1367 and MAPREDUCE-434).

Related

Find top 10 most frequent words excluding “the”, “am”, “is”, and “are” in Hadoop MapReduce?

i am working on WordsCount problem with MapReduce. I have used txt file of Lewis Carroll’s famous Through the Looking-Glass. Its pretty big file. I ran my MapReduce code and its working fine. Now i need i find out top 10 most frequent words excluding “the”, “am”, “is”, and “are”. I have no idea how to handle this issue.
Here is my code
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[^a-zA-Z0-9]", " ").trim().toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
/* job.setSortComparatorClass(Text.Comparator.class);*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm personally not going to write code until I see an attempt on your end that requires more effort than Wordcount.
You need a second mapper and reducer to perform a Top N operation. If you used a higher level language such as Pig, Hive, Spark, etc. that's what it would do.
For starters, you can at least filter out the words from itr.nextToken() to prevent the first mapper from ever seeing them.
Then, in the reducer, your output will be unsorted, but you are already getting the sum for all words to some output directory, which is a necessary first step in getting the top words.
The solution to the problem requires you to create a new Job object to read that first output directory, write to a new output directory, and for each line of text in the mapper generate null, line as the output (use NullWritable and Text).
With this, in the reducer, all lines of text will be sent into one reducer iterator, so in order to get the Top N items, you can create a TreeMap<Integer, String> to sort words by the count (refer Sorting Descending order: Java Map). While inserting elements, larger values will automatically get pushed to the top of the tree. You can optionally optimize this by tracking the smallest element in the tree as well, and only inserting items larger than it, and/or track the tree size and only insert items larger than the N'th item (this helps if you potentially have hundreds of thousands of words).
After the loop that adds all elements to the tree, take all the top N of the string values and their counts (the tree is already sorted for you), and write them out from the reducer. With that, you should end up with the Top N items.

Result of Hadoop MapReduce isn't writing any data to output file

I've been trying to debug this error for a while now. Basically, I've confirmed that my reduce class is writing the correct output to its context, but for some reason I'm always getting a zero bytes output file.
My mapper class:
public class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Document t = Jsoup.parse(value.toString());
String text = t.body().text();
String[] content = text.split(" ");
for (String s : content) {
context.write(new Text(s), new IntWritable(1));
}
}
}
My reducer class:
public class FrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int n = 0;
for (IntWritable i : values) {
n++;
}
if (n > 5) { // Do we need this check?
context.write(key, new IntWritable(n));
System.out.println("<" + key + ", " + n + ">");
}
}
}
and my driver:
public class FrequencyMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration(true);
// setup the job
Job job = Job.getInstance(conf, "FrequencyCount");
job.setJarByClass(FrequencyMain.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And for some reason "reduce output records" is always
Job complete: job_local805637130_0001
Counters: 17
Map-Reduce Framework
Spilled Records=250
Map output materialized bytes=1496
Reduce input records=125
Map input records=6
SPLIT_RAW_BYTES=1000
Map output bytes=57249
Reduce shuffle bytes=0
Reduce input groups=75
Combine output records=125
Reduce output records=0
Map output records=5400
Combine input records=5400
Total committed heap usage (bytes)=3606577152
File Input Format Counters
Bytes Read=509446
FileSystemCounters
FILE_BYTES_WRITTEN=385570
FILE_BYTES_READ=2909134
File Output Format Counters
Bytes Written=8
(Assuming that your goal is to print word frequencies which have frequencies > 5)
Current implementation of combiner totally breaks semantics of your program. You need either to remove it or reimplement:
Currently it passes only those words to reducer which have frequencies of at least 5. Combiner works per-mapper, this means, for example, if only single document is scheduled into some mapper, then this mapper/combiner won't emit words which have frequencies in this document less than 6 (even if other documents in other mappers have lots of occurencies of these words). You need to remove check n > 5 in combiner (but not in reducer).
Because now reducer input values are not neccessarily all "ones", you should increment n by value amount instead of n++.

Input of the reduce phase is not what I expect in Hadoop (Java)

I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph):
1 3
3 1
3 2
2 3
Now, I want to use MapReduce to count the triangles in this graph (obviously one). It is still work in progress and in the first phase, I try to get a list of all neighbors for each vertex.
My main class looks like the following:
public class TriangleCount {
public static void main( String[] args ) throws Exception {
// remove the old output directory
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path("output/"), true);
JobConf firstPhaseJob = new JobConf(FirstPhase.class);
firstPhaseJob.setOutputKeyClass(IntWritable.class);
firstPhaseJob.setOutputValueClass(IntWritable.class);
firstPhaseJob.setMapperClass(FirstPhase.Map.class);
firstPhaseJob.setCombinerClass(FirstPhase.Reduce.class);
firstPhaseJob.setReducerClass(FirstPhase.Reduce.class);
FileInputFormat.setInputPaths(firstPhaseJob, new Path("input/"));
FileOutputFormat.setOutputPath(firstPhaseJob, new Path("output/"));
JobClient.runJob(firstPhaseJob);
}
}
My Mapper and Reducer implementations look like this, they are both very easy:
public class FirstPhase {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
#Override
public void map(LongWritable longWritable, Text graphLine, OutputCollector<IntWritable, IntWritable> outputCollector, Reporter reporter) throws IOException {
StringTokenizer tokenizer = new StringTokenizer(graphLine.toString());
int n1 = Integer.parseInt(tokenizer.nextToken());
int n2 = Integer.parseInt(tokenizer.nextToken());
if(n1 > n2) {
System.out.println("emitting (" + new IntWritable(n1) + ", " + new IntWritable(n2) + ")");
outputCollector.collect(new IntWritable(n1), new IntWritable(n2));
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, IntWritable, Text> {
#Override
public void reduce(IntWritable key, Iterator<IntWritable> iterator, OutputCollector<IntWritable, Text> outputCollector, Reporter reporter) throws IOException {
List<IntWritable> nNodes = new ArrayList<>();
while(iterator.hasNext()) {
nNodes.add(iterator.next());
}
System.out.println("key: " + key + ", list: " + nNodes);
// create pairs and emit these
for(IntWritable n1 : nNodes) {
for(IntWritable n2 : nNodes) {
outputCollector.collect(key, new Text(n1.toString() + " " + n2.toString()));
}
}
}
}
}
I've added some logging to the program. In the map phase, I print which pairs I'm emitting. In the reduce phase, I print the input of the reduce. I get the following output:
emitting (3, 1)
emitting (3, 2)
key: 3, list: [1, 1]
The input for the reduce function is not what I expect. I expect it to be [1, 2] and not [1, 1]. I believe that Hadoop automatically combines all my emitted pairs from the output of the map phase but am I missing something here? Any help or explanation would be appreciated.
This is a typical problem for people beginning with Hadoop MapReduce.
The problem is in your reducer. When looping through the given Iterator<IntWritable>, each IntWritable instance is re-used, so it only keeps one instance around at a given time.
That means when you call iterator.next() your first saved IntWritable instance is set with the new value.
You can read more about this problem here
https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/

failed to set a KeyComparator function

I'm trying to sort the data by value
The method i use is to combine the key and value to a composite key
e.g (key,value) -> ({key,value},value)
and define my KeyComaparator which is compare the value part in the key
my data is a paragraph that i should count the words
and i done two job, the first one do the wordCount, but combine the key to composite key in reducer.
this is the result
is,4 4
the,15 15
ECA,1 1
to,6 6
.....
and in the second job, I try to use the composite key to sort by the value
this is my mapper2
public static class Map2 extends MapReduceBase
implements Mapper<LongWritable,Text,Text,IntWritable>{
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String w1[] = line.split("\t");
word.set(w1[0]);
output.collect(word,new IntWritable(Integer.valueOf(w1[1])));
}
}
and here is my Keycomparator
public static final class KeyComparator extends WritableComparator {
public KeyComparator(){
super(Text.class,true);
}
#Override
public int compare(WritableComparable tp1, WritableComparable tp2) {
Text t1 = (Text)tp1;
Text t2 = (Text)tp2;
String a[] = t1.toString().split(",");
String b[] = t2.toString().split(",");
return a[1].compareTo(b[1]);
}
this is my reducer2
public static class Reduce2 extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int sum=0;
while( values.hasNext()){
sum+= values.next().get();
}
//String cpKey[] = key.toString().split(",");
Text outputKey = new Text();
//outputKey.set(cpKey[0]);
output.collect(key, new IntWritable(sum));
}
}
here is my main function
public static void main(String[] args) throws Exception {
int reduceTasks = 1;
int mapTasks = 3;
System.out.println("1. New JobConf...");
JobConf conf = new JobConf(WordCountV2.class);
conf.setJobName("WordCount");
System.out.println("2. Setting output key and value...");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
System.out.println("3. Setting Mapper and Reducer classes...");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
// set numbers of reducers
System.out.println("4. Setting number of reduce and map tasks...");
conf.setNumReduceTasks(reduceTasks);
conf.setNumMapTasks(mapTasks);
System.out.println("5. Setting input and output formats...");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
System.out.println("6. Setting input and output paths...");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
String TempDir = "temp" + Integer.toString(new Random().nextInt(1000)+1);
FileOutputFormat.setOutputPath(conf, new Path(TempDir));
//FileOutputFormat.setOutputPath(conf,new Path(args[1]));
System.out.println("7. Running job...");
JobClient.runJob(conf);
JobConf sort = new JobConf(WordCountV2.class);
sort.setJobName("sort");
sort.setMapOutputKeyClass(Text.class);
sort.setMapOutputValueClass(IntWritable.class);
sort.setOutputKeyComparatorClass(KeyComparator.class);
sort.setMapperClass(Map2.class);
sort.setReducerClass(Reduce2.class);
sort.setNumReduceTasks(reduceTasks);
sort.setNumMapTasks(mapTasks);
sort.setInputFormat(TextInputFormat.class);
sort.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(sort,TempDir);
FileOutputFormat.setOutputPath(sort, new Path(args[1]));
JobClient.runJob(sort);
}
but the result is kind of this
is 13
the 32
ECA 21
to 14
.
.
.
and lost many word
but if i didn't use my Keycomparator
it returns to the result which is not sorted, just like the first one i mentioned
any ideas to solve the problem? thanks!
I'm not sure where you are making mistake.
But what you are trying to do is called Secondary Sort Sorting based on value.
It's not a trivial job to do, but you need to create more classes for patition,aggregation and other stuff which is clearly explained Here and Here
Just following the instructions in those blogs will surely help you.

How to traverse an iterator of Text values twice in a Mapreduce program?

In my MapReduce program, I have a reducer function which counts the number of items in a Iterator of Text values and then for each item in the iterator outputs the item as key and the count as value. Thus i need to use the iterator twice. But once the iterator has reached the end I cannot get to iterate from the first. How do i solve this problem?
I tried the following code for my reduce function:
public static class ReduceA extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException
{
Text t;
int count =0;
String[] attr = key.toString().split(",");
while(values.hasNext())
{
values.next();
count++;
}
//Maybe i need to reset my iterator here and start from the beginning but how do i do it?
String v=Integer.toString(count);
while(values.hasNext())
{
t=values.next();
output.collect(t,new Text(v));
}
}
}
The above code produced empty results.I had tried by inserting the values of the iterator in a list but since I need to deal with many GBs of data,I am getting java heap space error for using the list. Please help me to modify my code so that I can traverse the iterator twice.
You could always do it the simple way : declare a List and cache the value as you iterate through the first time. You could consequently iterate through your List and write out your output. You should have something similar to this :
public static class ReduceA extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text t;
int count = 0;
String[] attr = key.toString().split(",");
List<Text> cache = new ArrayList<Text>();
while (values.hasNext()) {
cache.add(values.next());
count++;
}
// Maybe i need to reset my iterator here and start from the beginning
// but how do i do it?
String v = Integer.toString(count);
for (Text text : cache) {
output.collect(text, new Text(v));
}
}
}

Categories

Resources