i am working on WordsCount problem with MapReduce. I have used txt file of Lewis Carroll’s famous Through the Looking-Glass. Its pretty big file. I ran my MapReduce code and its working fine. Now i need i find out top 10 most frequent words excluding “the”, “am”, “is”, and “are”. I have no idea how to handle this issue.
Here is my code
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[^a-zA-Z0-9]", " ").trim().toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
/* job.setSortComparatorClass(Text.Comparator.class);*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm personally not going to write code until I see an attempt on your end that requires more effort than Wordcount.
You need a second mapper and reducer to perform a Top N operation. If you used a higher level language such as Pig, Hive, Spark, etc. that's what it would do.
For starters, you can at least filter out the words from itr.nextToken() to prevent the first mapper from ever seeing them.
Then, in the reducer, your output will be unsorted, but you are already getting the sum for all words to some output directory, which is a necessary first step in getting the top words.
The solution to the problem requires you to create a new Job object to read that first output directory, write to a new output directory, and for each line of text in the mapper generate null, line as the output (use NullWritable and Text).
With this, in the reducer, all lines of text will be sent into one reducer iterator, so in order to get the Top N items, you can create a TreeMap<Integer, String> to sort words by the count (refer Sorting Descending order: Java Map). While inserting elements, larger values will automatically get pushed to the top of the tree. You can optionally optimize this by tracking the smallest element in the tree as well, and only inserting items larger than it, and/or track the tree size and only insert items larger than the N'th item (this helps if you potentially have hundreds of thousands of words).
After the loop that adds all elements to the tree, take all the top N of the string values and their counts (the tree is already sorted for you), and write them out from the reducer. With that, you should end up with the Top N items.
I'm trying to create a plugin where I'm storing some Minecraft items' data along with some properties.
This is my YAML file's content:
rates:
- 391:
mul: 10000
store: 5000
- 392:
mul: 9000
store: 5000
So it's basically a list of maps of maps(I think so at least).
This is my JAVA code where I'm trying to access the key 'mul' of '391':
List<Map<?,?>> rates;
rates= getConfig().getMapList("rates");
for(Map<?,?> mp : rates){
Map <?,?> test = (Map<?,?>) mp.get("" + item);
player.sendMessage(test.toString());// HERE I get null pointer exception, and the following lines if this line wasn't there in the first place
player.sendMessage("Mul is: " + test.get("mul"));
player.sendMessage("Store is: " + test.get("store"));
}
As per suggested answer, here is my test code, where I still get NullPointerException:
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Map;
import net.sourceforge.yamlbeans.YamlException;
import net.sourceforge.yamlbeans.YamlReader;
public class Test {
public static void main(String[] args) throws FileNotFoundException, YamlException{
YamlReader reader = new YamlReader(new FileReader("config.yml"));
Map map = (Map) reader.read();
Map itemMap = (Map) map.get("391");
System.out.println(itemMap.get("mul"));//This is where I get the exception now
System.out.println(itemMap.get("store"));
}
}
Parsing yaml by hand can be tedious and error prone. It might be easier to use a library like yamlbeans.
http://yamlbeans.sourceforge.net/
package com.jbirdvegas.q41267676;
import com.esotericsoftware.yamlbeans.YamlReader;
import java.io.StringReader;
import java.util.List;
import java.util.Map;
public class YamlExample {
public static void main(String[] args) throws Exception {
String yamlInput =
"rates:\n" +
"- 391:\n" +
" mul: 10000\n" +
" store: 5000\n" +
"- 392:\n" +
" mul: 9000\n" +
" store: 5000";
YamlReader reader = new YamlReader(new StringReader(yamlInput));
Map map = (Map) reader.read();
// rates is a list
List<Map> rates = (List<Map>) map.get("rates");
// each item in the list is a map
for (Map itemMap : rates) {
// each map contains a single map the key is [ 391|392 ]
itemMap.forEach((key, value) -> {
System.out.println("Key: " + key);
// the value in in this map is itself a map
Map embededMap = (Map) value;
// this map contains the values you want
System.out.println(embededMap.get("mul"));
System.out.println(embededMap.get("store"));
});
}
}
}
This prints:
Key: 391
10000
5000
Key: 392
9000
5000
This is a simple usecase but yamlbeans also provides GSON like reflective class model population if that would be better suited for your needs.
Your assumptions that that YAML is a list of maps is wrong. At the top level is is a map with a single key-value pair, the key of which is rates and the value is a sequence with two elements.
Each of those elements is mapping with a single key value pair, the key of which is a number (391, 392) and the value of which is a mapping with two key-value pairs each.
That there is a dash (-) at the left hand doesn't imply the that at the top level there is a sequence (there are no lists in a YAML file, that is construct in your programming language). If a sequence is a value for a specific key, those sequence elements can be at the same indentation level as the key, as in your YAML file.
I've been trying to debug this error for a while now. Basically, I've confirmed that my reduce class is writing the correct output to its context, but for some reason I'm always getting a zero bytes output file.
My mapper class:
public class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Document t = Jsoup.parse(value.toString());
String text = t.body().text();
String[] content = text.split(" ");
for (String s : content) {
context.write(new Text(s), new IntWritable(1));
}
}
}
My reducer class:
public class FrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int n = 0;
for (IntWritable i : values) {
n++;
}
if (n > 5) { // Do we need this check?
context.write(key, new IntWritable(n));
System.out.println("<" + key + ", " + n + ">");
}
}
}
and my driver:
public class FrequencyMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration(true);
// setup the job
Job job = Job.getInstance(conf, "FrequencyCount");
job.setJarByClass(FrequencyMain.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And for some reason "reduce output records" is always
Job complete: job_local805637130_0001
Counters: 17
Map-Reduce Framework
Spilled Records=250
Map output materialized bytes=1496
Reduce input records=125
Map input records=6
SPLIT_RAW_BYTES=1000
Map output bytes=57249
Reduce shuffle bytes=0
Reduce input groups=75
Combine output records=125
Reduce output records=0
Map output records=5400
Combine input records=5400
Total committed heap usage (bytes)=3606577152
File Input Format Counters
Bytes Read=509446
FileSystemCounters
FILE_BYTES_WRITTEN=385570
FILE_BYTES_READ=2909134
File Output Format Counters
Bytes Written=8
(Assuming that your goal is to print word frequencies which have frequencies > 5)
Current implementation of combiner totally breaks semantics of your program. You need either to remove it or reimplement:
Currently it passes only those words to reducer which have frequencies of at least 5. Combiner works per-mapper, this means, for example, if only single document is scheduled into some mapper, then this mapper/combiner won't emit words which have frequencies in this document less than 6 (even if other documents in other mappers have lots of occurencies of these words). You need to remove check n > 5 in combiner (but not in reducer).
Because now reducer input values are not neccessarily all "ones", you should increment n by value amount instead of n++.
My job contains a mapper and a reducer. The reducer emits key value pairs where the key is the name of the student and the value is the gpa. The reducer computes the gpa. How can I make it so the reducer outputs are sorted on the value(gpa)?
Reducer code:
public class ReducerClass extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int i = 0;
int total = 0;
for (IntWritable value : values) {
i++;
total = total + value.get();
}
context.write(key, new IntWritable(total));
}
}
One way of doing is using secondary sort.here. The idea is adding the value also inside the reducer key(a composite key), and allow hadoop do the sorting at the output of map. This requires more changes to your existing design.
Another way(may be more easier), is once your current job is done, you can give the output of first job to second job, and interchange the key and value. In this case, the second job can have only a map, and the output will be shown as sorted based on gpa. Any repeated students, with same gpa, can come as a list for a specific gpa.
You can also try to sort the output in the reducer's cleanup method.
Say I have a input as follows:
(1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Ouput is expected as follows:
(1,(2,3,4)) -> (1,3) //second index is total friend #
(2,(1,3,4)) -> (2,3)
(3,(1,2)) -> (3,2)
(4,(1,2)) -> (4,2)
I know how to do this with hashset in java. But don't know how this work with mapreduce model. Can any one throw any ideas or sample code on this problem? I will appreciate this.
------------------------------------------------------------------------------------
Here is my naive solution: 1 mapper, two reducer.
The mapper will organize input(1,2),(2,1),(1,3);
Organize output as
*(1,hashset<2>),(2,hashSet<1>),(1,hashset<2>),(2,hashset<1>),(1,hashset<3>),(3,hashset<1>).*
Reducer1:
take mapper's output as input and output as:
*(1,hashset<2,3>), (3,hashset<1>)and (2,hashset<1>)*
Reducer2:
take reducer1's output as input and output as:
*(1,2),(3,1) and (2,1)*
This is only my naive solution. I'm not sure if this can be done by hadoop's code.
I think there should be an easy way to solve this problem.
Mapper Input: (1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Just emit two records for each pair like this:
Mapper Output/ Reducer Input:
Key => Value
1 => 2
2 => 1
2 => 1
1 => 2
1 => 3
3 => 1
3 => 2
2 => 3
2 => 4
4 => 2
4 => 1
1 => 1
At reducer side, you'll get 4 different groups like this:
Reducer Output:
Key => Values
1 => [2,3,4]
2 => [1,3,4]
3 => [1,2]
4 => [1,2]
Now, you are good to format your result as you want. :)
Let me know if anybody can see any issue in this approach
1) Intro / Problem
Before going ahead with the job driver, it is important to understand that in a simple-minded approach, the values of the reducers should be sorted in an ascending order. The first thought is to pass the value list unsorted and do some sorting in the reducer per key. This has two disadvantages:
1) It is most probably not efficient for large Value Lists
and
2) How will the framework know if (1,4) is equal to (4,1) if these pairs are processed in different parts of the cluster?
2) Solution in theory
The way to do it in Hadoop is to "mock" the framework in a way by creating a synthetic key.
So our map function instead of the "conceptually more appropriate" (if I may say that)
map(k1, v1) -> list(k2, v2)
is the following:
map(k1, v1) -> list(ksynthetic, null)
As you notice we discard the usage of values (the reducer still gets a list of null values but we don't really care about them). What happens here is that these values are actually included in ksynthetic. Here is an example for the problem in question:
`map(1, 2) -> list([1,2], null)
However, some more operations need to be done so that the keys are grouped and partitioned appropriately and we achieve the correct result in the reducer.
3) Hadoop Implementation
We will implement a class called FFGroupKeyComparator and a class FindFriendPartitioner.
Here is our FFGroupKeyComparator:
public static class FFGroupComparator extends WritableComparator
{
protected FFGroupComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(",");
String[] t2Items = t2.toString().split(",");
String t1Base = t1Items[0];
String t2Base = t2Items[0];
int comp = t1Base.compareTo(t2Base); // We compare using "real" key part of our synthetic key
return comp;
}
}
This class will act as our Grouping Comparator class. It controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) This is very important as it ensures that each reducer gets the appropriate synthetic keys ( judging by the real key).
Due to the fact that Hadoop runs in a cluster with many nodes it is important to ensure that there as many reduce tasks as partitions. Their number should be the same as of the real keys (not synthetic). So, usually we do this with hash values. In our case, what we need to do is compute the partition that a synthetic key belongs based on the hash value of the real key (before the comma). So our FindFriendPartitioner is as follows:
public static class FindFriendPartitioner extends Partitioner implements Configurable
{
#Override
public int getPartition(Text key, Text NullWritable, int numPartitions)
{
String[] keyItems = key.toString().split(",");
String keyBase = keyItems[0];
int part = keyBase.hashCode() % numPartitions;
return part;
}
So now we are all set to write the actual job and solve our problem.
I am assuming your input file looks like this:
1,2
2,1
1,3
3,2
2,4
4,1
We will use the TextInputFormat.
Here's the code for the job driver using Hadoop 1.0.4:
public class FindFriendTwo
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable() );
String tempStrings[] = value.toString().split(",");
Text value2 = new Text(tempStrings[1] + "," + tempStrings[0]); //reverse relationship
context.write(value2, new NullWritable());
}
}
Notice that we also passed the reverse relationships in the map function.
For example if the input string is (1,4) we must not forget (4,1).
public static class FindFriendReducer extends Reducer<Text, NullWritable, IntWritable, IntWritable> {
private Set<String> friendsSet;
public void setup(Context context)
{
friendSet = new LinkedHashSet<String>();
}
public void reduce(Text syntheticKey, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
String tempKeys[] = syntheticKey.toString().split(",");
friendsSet.add(tempKeys[1]);
if( friendsList.size() == 2 )
{
IntWritable key = Integer.parseInt(tempKeys[0]);
IntWritable value = Integer.parseInt(tempKeys[1]);
write(key, value);
}
}
}
Finally, we must remember to include the following in our Main Class, so that the framework uses our classes.
jobConf.setGroupingComparatorClass(FFGroupComparator.class);
jobConf.setPartitionerClass(FindFriendPartitioner.class);
I would approach this problem as follows.
Make sure we have all the relations and have them exactly once each.
Simply count the
Notes on my aproach:
My notation for key value pairs is : K -> V
Both key and value are a almost always a datastructure (not just a string or int)
I never use the key for data. The key is ONLY there to control the flow from mappers towards the right reducer. In all other places I do not look at the key at all. The framework does require a key everywhere. With '()' I mean to say that there is a key that I ignore completely.
The key about my aproach is that it never needs 'all friends' in memory at the same moment (so it works also in the really big situations).
We start with a lot of
(x,y)
and we know that we do not have all relationships in the dataset.
Mapper: Create all relations
Input: () -> (x,y)
Output: (x,y) -> (x,y)
(y,x) -> (y,x)
Reducer: Remove duplicates (simply only output the first one from the iterator)
Input: (x,y) -> [(x,y),(x,y),(x,y),(x,y),.... ]
Output: () -> (x,y)
Mapper: "Wordcount"
Input: () -> (x,y)
Output: (x) -> (x,1)
Reducer: Count them
Input: (x) -> [(x,1),(x,1),(x,1),(x,1),.... ]
Output: () -> (x,N)
Being helped by so many excellent engineers, I finally tried out the solution.
Only one Mapper and one Reducer. No combiner here.
input of Mapper:
1,2
2,1
1,3
3,1
3,2
3,4
5,1
Output of Mapper:
1,2
2,1
1,2
2,1
1,3
3,1
1,3
3,1
4,3
3,4
1,5
5,1
Output Of Reducer:
1 3
2 2
3 3
4 1
5 1
The first col is user, the second is friend#.
On the reducer stage, I add hashSet to assistant analysis.
Thanks #Artem Tsikiridis #Ashish
Your answer gave me a nice clue.
Edited:
Added Code:
//mapper
public static class TokenizerMapper extends
Mapper<Object, Text, Text, Text> {
private Text word1 = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line,",");
if(itr.hasMoreElements()){
word1.set(itr.nextToken().toLowerCase());
}
if(itr.hasMoreElements()){
word2.set(itr.nextToken().toLowerCase());
}
context.write(word1, word2);
context.write(word2, word1);
//
}
}
//reducer
public static class IntSumReducer extends
Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
HashSet<Text> set = new HashSet<Text>();
int sum = 0;
for (Text val : values) {
if(!set.contains(val)){
set.add(val);
sum++;
}
}
result.set(sum);
context.write(key, result);
}
}