i am working on WordsCount problem with MapReduce. I have used txt file of Lewis Carroll’s famous Through the Looking-Glass. Its pretty big file. I ran my MapReduce code and its working fine. Now i need i find out top 10 most frequent words excluding “the”, “am”, “is”, and “are”. I have no idea how to handle this issue.
Here is my code
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[^a-zA-Z0-9]", " ").trim().toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
/* job.setSortComparatorClass(Text.Comparator.class);*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm personally not going to write code until I see an attempt on your end that requires more effort than Wordcount.
You need a second mapper and reducer to perform a Top N operation. If you used a higher level language such as Pig, Hive, Spark, etc. that's what it would do.
For starters, you can at least filter out the words from itr.nextToken() to prevent the first mapper from ever seeing them.
Then, in the reducer, your output will be unsorted, but you are already getting the sum for all words to some output directory, which is a necessary first step in getting the top words.
The solution to the problem requires you to create a new Job object to read that first output directory, write to a new output directory, and for each line of text in the mapper generate null, line as the output (use NullWritable and Text).
With this, in the reducer, all lines of text will be sent into one reducer iterator, so in order to get the Top N items, you can create a TreeMap<Integer, String> to sort words by the count (refer Sorting Descending order: Java Map). While inserting elements, larger values will automatically get pushed to the top of the tree. You can optionally optimize this by tracking the smallest element in the tree as well, and only inserting items larger than it, and/or track the tree size and only insert items larger than the N'th item (this helps if you potentially have hundreds of thousands of words).
After the loop that adds all elements to the tree, take all the top N of the string values and their counts (the tree is already sorted for you), and write them out from the reducer. With that, you should end up with the Top N items.
My mapper class will output key-value pairs like:
abc 1
abc 2
abc 1
And I want to merge the values and calculate the occurrence of same pair in reducer class using HashMap, the output is like:
abc 1:2 2:1
But my output result is:
abc 1:2:1 2:1:1
It feels like there are additional Strings concatenated with the output, but I don't know why.
Here is my code:
Text combiner = new Text();
StringBuilder strBuilder = new StringBuilder();
#Override
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
HashMap<Text, Integer> result = new HashMap<Text, Integer>();
for (Text val : values) {
if(result.containsKey(val)){
int newVal = result.get(val) + 1;
result.put(val, newVal);
}else{
result.put(val, 1);
}
}
for(Map.Entry<Text, Integer> entry: result.entrySet()){
strBuilder.append(entry.getKey().toString());
strBuilder.append(":");
strBuilder.append(entry.getValue());
strBuilder.append("\t");
}
combiner.set(strBuilder.toString());
context.write(key, combiner);
}
I tested this code an it looks ok. The most likely reason you're getting output like this is because you're running this reducer as your combiner as well, which would explain why you're getting three values. The combine does the first concatenation, followed by the reduce which does a second.
You need to make sure a combiner isn't being configured in your job setup.
I would also suggest you change your code to make sure you store new versions of the Text values in your HashMap, remember Hadoop will be reusing the objects. So you should really be doing something like:
result.put(new Text(val), newVal);
or change your HashMap to store Strings, which is safe since they're immutable.
I've been trying to debug this error for a while now. Basically, I've confirmed that my reduce class is writing the correct output to its context, but for some reason I'm always getting a zero bytes output file.
My mapper class:
public class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Document t = Jsoup.parse(value.toString());
String text = t.body().text();
String[] content = text.split(" ");
for (String s : content) {
context.write(new Text(s), new IntWritable(1));
}
}
}
My reducer class:
public class FrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int n = 0;
for (IntWritable i : values) {
n++;
}
if (n > 5) { // Do we need this check?
context.write(key, new IntWritable(n));
System.out.println("<" + key + ", " + n + ">");
}
}
}
and my driver:
public class FrequencyMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration(true);
// setup the job
Job job = Job.getInstance(conf, "FrequencyCount");
job.setJarByClass(FrequencyMain.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And for some reason "reduce output records" is always
Job complete: job_local805637130_0001
Counters: 17
Map-Reduce Framework
Spilled Records=250
Map output materialized bytes=1496
Reduce input records=125
Map input records=6
SPLIT_RAW_BYTES=1000
Map output bytes=57249
Reduce shuffle bytes=0
Reduce input groups=75
Combine output records=125
Reduce output records=0
Map output records=5400
Combine input records=5400
Total committed heap usage (bytes)=3606577152
File Input Format Counters
Bytes Read=509446
FileSystemCounters
FILE_BYTES_WRITTEN=385570
FILE_BYTES_READ=2909134
File Output Format Counters
Bytes Written=8
(Assuming that your goal is to print word frequencies which have frequencies > 5)
Current implementation of combiner totally breaks semantics of your program. You need either to remove it or reimplement:
Currently it passes only those words to reducer which have frequencies of at least 5. Combiner works per-mapper, this means, for example, if only single document is scheduled into some mapper, then this mapper/combiner won't emit words which have frequencies in this document less than 6 (even if other documents in other mappers have lots of occurencies of these words). You need to remove check n > 5 in combiner (but not in reducer).
Because now reducer input values are not neccessarily all "ones", you should increment n by value amount instead of n++.
In my MapReduce program, I have a reducer function which counts the number of items in a Iterator of Text values and then for each item in the iterator outputs the item as key and the count as value. Thus i need to use the iterator twice. But once the iterator has reached the end I cannot get to iterate from the first. How do i solve this problem?
I tried the following code for my reduce function:
public static class ReduceA extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException
{
Text t;
int count =0;
String[] attr = key.toString().split(",");
while(values.hasNext())
{
values.next();
count++;
}
//Maybe i need to reset my iterator here and start from the beginning but how do i do it?
String v=Integer.toString(count);
while(values.hasNext())
{
t=values.next();
output.collect(t,new Text(v));
}
}
}
The above code produced empty results.I had tried by inserting the values of the iterator in a list but since I need to deal with many GBs of data,I am getting java heap space error for using the list. Please help me to modify my code so that I can traverse the iterator twice.
You could always do it the simple way : declare a List and cache the value as you iterate through the first time. You could consequently iterate through your List and write out your output. You should have something similar to this :
public static class ReduceA extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text t;
int count = 0;
String[] attr = key.toString().split(",");
List<Text> cache = new ArrayList<Text>();
while (values.hasNext()) {
cache.add(values.next());
count++;
}
// Maybe i need to reset my iterator here and start from the beginning
// but how do i do it?
String v = Integer.toString(count);
for (Text text : cache) {
output.collect(text, new Text(v));
}
}
}
Say I have a input as follows:
(1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Ouput is expected as follows:
(1,(2,3,4)) -> (1,3) //second index is total friend #
(2,(1,3,4)) -> (2,3)
(3,(1,2)) -> (3,2)
(4,(1,2)) -> (4,2)
I know how to do this with hashset in java. But don't know how this work with mapreduce model. Can any one throw any ideas or sample code on this problem? I will appreciate this.
------------------------------------------------------------------------------------
Here is my naive solution: 1 mapper, two reducer.
The mapper will organize input(1,2),(2,1),(1,3);
Organize output as
*(1,hashset<2>),(2,hashSet<1>),(1,hashset<2>),(2,hashset<1>),(1,hashset<3>),(3,hashset<1>).*
Reducer1:
take mapper's output as input and output as:
*(1,hashset<2,3>), (3,hashset<1>)and (2,hashset<1>)*
Reducer2:
take reducer1's output as input and output as:
*(1,2),(3,1) and (2,1)*
This is only my naive solution. I'm not sure if this can be done by hadoop's code.
I think there should be an easy way to solve this problem.
Mapper Input: (1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Just emit two records for each pair like this:
Mapper Output/ Reducer Input:
Key => Value
1 => 2
2 => 1
2 => 1
1 => 2
1 => 3
3 => 1
3 => 2
2 => 3
2 => 4
4 => 2
4 => 1
1 => 1
At reducer side, you'll get 4 different groups like this:
Reducer Output:
Key => Values
1 => [2,3,4]
2 => [1,3,4]
3 => [1,2]
4 => [1,2]
Now, you are good to format your result as you want. :)
Let me know if anybody can see any issue in this approach
1) Intro / Problem
Before going ahead with the job driver, it is important to understand that in a simple-minded approach, the values of the reducers should be sorted in an ascending order. The first thought is to pass the value list unsorted and do some sorting in the reducer per key. This has two disadvantages:
1) It is most probably not efficient for large Value Lists
and
2) How will the framework know if (1,4) is equal to (4,1) if these pairs are processed in different parts of the cluster?
2) Solution in theory
The way to do it in Hadoop is to "mock" the framework in a way by creating a synthetic key.
So our map function instead of the "conceptually more appropriate" (if I may say that)
map(k1, v1) -> list(k2, v2)
is the following:
map(k1, v1) -> list(ksynthetic, null)
As you notice we discard the usage of values (the reducer still gets a list of null values but we don't really care about them). What happens here is that these values are actually included in ksynthetic. Here is an example for the problem in question:
`map(1, 2) -> list([1,2], null)
However, some more operations need to be done so that the keys are grouped and partitioned appropriately and we achieve the correct result in the reducer.
3) Hadoop Implementation
We will implement a class called FFGroupKeyComparator and a class FindFriendPartitioner.
Here is our FFGroupKeyComparator:
public static class FFGroupComparator extends WritableComparator
{
protected FFGroupComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(",");
String[] t2Items = t2.toString().split(",");
String t1Base = t1Items[0];
String t2Base = t2Items[0];
int comp = t1Base.compareTo(t2Base); // We compare using "real" key part of our synthetic key
return comp;
}
}
This class will act as our Grouping Comparator class. It controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) This is very important as it ensures that each reducer gets the appropriate synthetic keys ( judging by the real key).
Due to the fact that Hadoop runs in a cluster with many nodes it is important to ensure that there as many reduce tasks as partitions. Their number should be the same as of the real keys (not synthetic). So, usually we do this with hash values. In our case, what we need to do is compute the partition that a synthetic key belongs based on the hash value of the real key (before the comma). So our FindFriendPartitioner is as follows:
public static class FindFriendPartitioner extends Partitioner implements Configurable
{
#Override
public int getPartition(Text key, Text NullWritable, int numPartitions)
{
String[] keyItems = key.toString().split(",");
String keyBase = keyItems[0];
int part = keyBase.hashCode() % numPartitions;
return part;
}
So now we are all set to write the actual job and solve our problem.
I am assuming your input file looks like this:
1,2
2,1
1,3
3,2
2,4
4,1
We will use the TextInputFormat.
Here's the code for the job driver using Hadoop 1.0.4:
public class FindFriendTwo
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable() );
String tempStrings[] = value.toString().split(",");
Text value2 = new Text(tempStrings[1] + "," + tempStrings[0]); //reverse relationship
context.write(value2, new NullWritable());
}
}
Notice that we also passed the reverse relationships in the map function.
For example if the input string is (1,4) we must not forget (4,1).
public static class FindFriendReducer extends Reducer<Text, NullWritable, IntWritable, IntWritable> {
private Set<String> friendsSet;
public void setup(Context context)
{
friendSet = new LinkedHashSet<String>();
}
public void reduce(Text syntheticKey, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
String tempKeys[] = syntheticKey.toString().split(",");
friendsSet.add(tempKeys[1]);
if( friendsList.size() == 2 )
{
IntWritable key = Integer.parseInt(tempKeys[0]);
IntWritable value = Integer.parseInt(tempKeys[1]);
write(key, value);
}
}
}
Finally, we must remember to include the following in our Main Class, so that the framework uses our classes.
jobConf.setGroupingComparatorClass(FFGroupComparator.class);
jobConf.setPartitionerClass(FindFriendPartitioner.class);
I would approach this problem as follows.
Make sure we have all the relations and have them exactly once each.
Simply count the
Notes on my aproach:
My notation for key value pairs is : K -> V
Both key and value are a almost always a datastructure (not just a string or int)
I never use the key for data. The key is ONLY there to control the flow from mappers towards the right reducer. In all other places I do not look at the key at all. The framework does require a key everywhere. With '()' I mean to say that there is a key that I ignore completely.
The key about my aproach is that it never needs 'all friends' in memory at the same moment (so it works also in the really big situations).
We start with a lot of
(x,y)
and we know that we do not have all relationships in the dataset.
Mapper: Create all relations
Input: () -> (x,y)
Output: (x,y) -> (x,y)
(y,x) -> (y,x)
Reducer: Remove duplicates (simply only output the first one from the iterator)
Input: (x,y) -> [(x,y),(x,y),(x,y),(x,y),.... ]
Output: () -> (x,y)
Mapper: "Wordcount"
Input: () -> (x,y)
Output: (x) -> (x,1)
Reducer: Count them
Input: (x) -> [(x,1),(x,1),(x,1),(x,1),.... ]
Output: () -> (x,N)
Being helped by so many excellent engineers, I finally tried out the solution.
Only one Mapper and one Reducer. No combiner here.
input of Mapper:
1,2
2,1
1,3
3,1
3,2
3,4
5,1
Output of Mapper:
1,2
2,1
1,2
2,1
1,3
3,1
1,3
3,1
4,3
3,4
1,5
5,1
Output Of Reducer:
1 3
2 2
3 3
4 1
5 1
The first col is user, the second is friend#.
On the reducer stage, I add hashSet to assistant analysis.
Thanks #Artem Tsikiridis #Ashish
Your answer gave me a nice clue.
Edited:
Added Code:
//mapper
public static class TokenizerMapper extends
Mapper<Object, Text, Text, Text> {
private Text word1 = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line,",");
if(itr.hasMoreElements()){
word1.set(itr.nextToken().toLowerCase());
}
if(itr.hasMoreElements()){
word2.set(itr.nextToken().toLowerCase());
}
context.write(word1, word2);
context.write(word2, word1);
//
}
}
//reducer
public static class IntSumReducer extends
Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
HashSet<Text> set = new HashSet<Text>();
int sum = 0;
for (Text val : values) {
if(!set.contains(val)){
set.add(val);
sum++;
}
}
result.set(sum);
context.write(key, result);
}
}