I have a Mapper<AvroKey<Email>, NullWritable, Text, Text> which effectively takes in an Email and multiple times spits out a key of an email address and a value of the field it was found on (from, to, cc, etc).
Then I have a Reducer<Text, Text, NullWritable, Text> that takes in the email address and field name. It spits out a NullWritable key and a count of how many times the address is present in a given field. e.g...
{
"address": "joe.bloggs#gmail.com",
"toCount": 12,
"fromCount": 4
}
I'm using FileUtil.copyMerge to conflate the output from the jobs but (obviously) the results from different reducers aren't merged, so in practice I see:
{
"address": "joe.bloggs#gmail.com",
"toCount": 12,
"fromCount": 0
}, {
"address": "joe.bloggs#gmail.com",
"toCount": 0,
"fromCount": 4
}
Is there a more sensible way of approaching this problem so I can get a single result per email address? (I gather a combiner running pre-reduce phase is only run on a subset of the data and not guaranteed to give the results I want)?
Edit:
Reducer code would be something like:
public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {
private static final ObjectMapper mapper = new ObjectMapper();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Map<String, Object>> results = new HashMap<>();
for (Text value : values) {
if (!results.containsKey(value.toString())) {
Map<String, Object> result = new HashMap<>();
result.put("address", key.toString());
result.put("to", 0);
result.put("from", 0);
results.put(value.toString(), result);
}
Map<String, Object> result = results.get(value.toString());
switch (value.toString()) {
case "TO":
result.put("to", ((int) result.get("to")) + 1);
break;
case "FROM":
result.put("from", ((int) result.get("from")) + 1);
break;
}
results.values().forEach(result -> {
context.write(NullWritable.get(), new Text(mapper.writeValueAsString(result)));
});
}
}
Each input key of the reducer corresponds to a unique email address, so you don't need the results collection. Each time the reduce method is called, it is for a distinct email address, so my suggestion is:
public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {
private static final ObjectMapper mapper = new ObjectMapper();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Object> result = new HashMap<>();
result.put("address", key.toString());
result.put("to", 0);
result.put("from", 0);
for (Text value : values) {
switch (value.toString()) {
case "TO":
result.put("to", ((int) result.get("to")) + 1);
break;
case "FROM":
result.put("from", ((int) result.get("from")) + 1);
break;
}
context.write(NullWritable.get(), new Text(mapper.writeValueAsString(result)));
}
}
I am not sure what the ObjectMapper class does, but I suppose that you need it to format the output. Otherwise, I would print the input key as the output key (i.e., the email address) and two concatenated counts for the "from" and "to" fields of each email address.
If your input is a data collection (i.e., not streams, or smth similar), then you should get each email address only once. If your input is given in streams and you need to incrementally build your final output, then the output of one job can be the input of another. If such is the case, I suggest using MultipleInputs, in which one Mapper is the one that you described earlier and another IdentityMapper, forwards the output of a previous job to the Reducer. This way, again, the same email address is handled by the same reduce task.
Related
I've been trying to debug this error for a while now. Basically, I've confirmed that my reduce class is writing the correct output to its context, but for some reason I'm always getting a zero bytes output file.
My mapper class:
public class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Document t = Jsoup.parse(value.toString());
String text = t.body().text();
String[] content = text.split(" ");
for (String s : content) {
context.write(new Text(s), new IntWritable(1));
}
}
}
My reducer class:
public class FrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int n = 0;
for (IntWritable i : values) {
n++;
}
if (n > 5) { // Do we need this check?
context.write(key, new IntWritable(n));
System.out.println("<" + key + ", " + n + ">");
}
}
}
and my driver:
public class FrequencyMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration(true);
// setup the job
Job job = Job.getInstance(conf, "FrequencyCount");
job.setJarByClass(FrequencyMain.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And for some reason "reduce output records" is always
Job complete: job_local805637130_0001
Counters: 17
Map-Reduce Framework
Spilled Records=250
Map output materialized bytes=1496
Reduce input records=125
Map input records=6
SPLIT_RAW_BYTES=1000
Map output bytes=57249
Reduce shuffle bytes=0
Reduce input groups=75
Combine output records=125
Reduce output records=0
Map output records=5400
Combine input records=5400
Total committed heap usage (bytes)=3606577152
File Input Format Counters
Bytes Read=509446
FileSystemCounters
FILE_BYTES_WRITTEN=385570
FILE_BYTES_READ=2909134
File Output Format Counters
Bytes Written=8
(Assuming that your goal is to print word frequencies which have frequencies > 5)
Current implementation of combiner totally breaks semantics of your program. You need either to remove it or reimplement:
Currently it passes only those words to reducer which have frequencies of at least 5. Combiner works per-mapper, this means, for example, if only single document is scheduled into some mapper, then this mapper/combiner won't emit words which have frequencies in this document less than 6 (even if other documents in other mappers have lots of occurencies of these words). You need to remove check n > 5 in combiner (but not in reducer).
Because now reducer input values are not neccessarily all "ones", you should increment n by value amount instead of n++.
I need to find the most common key emitted by Mapper in the Reducer. My reducer works fine in this way:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points= new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#"); //format of value from mapper: "Key#1.2345"
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value)); //finds the K smallest distances
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
for (Text t : k_closest_points.values()) //it perfectly emits the K smallest distances and keys
context.write(NullWritable.get(), t);
}
}
It finds the K instances with the smallest distances and writes to the output file. But I need to find the most common key in my TreeMap. So I'm trying it like below:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value));
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String[] tmp = value.toString().split("#");
if (class_counts.containsKey(tmp[0]))
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
else
class_counts.put(tmp[0], 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Then I get this error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at KNN$MyReducer.reduce(KNN.java:108)
at KNN$MyReducer.reduce(KNN.java:98)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
Can you please help me to fix this?
A few things... first, your problem is here:
double distance = Double.parseDouble(v[1]);
You're splitting on "#" and it may not be in the string. If it's not, it will throw the OutOfBoundsException. I would add a clause like:
if(v.length < 2)
continue;
Second (and this shouldn't even compile unless I'm crazy), tmp is a String[], and yet here you're actually just concatenating '1' to it in the put operation (it's a parenthesis issue):
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
It should be:
class_counts.put(tmp[0], class_counts.get(tmp[0]) + 1);
It's also expensive to look the key up twice in a potentially large Map. Here's how I'd re-write your reducer based on what you've given us (this is totally untested):
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
if(v.length < 2)
continue; // consider adding an enum counter
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(v[0])); // you've already split once, why do it again later?
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
// exit early if nothing found
if(k_closest_points.isEmpty())
return;
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String tmp = value.toString();
Integer current_count = class_counts.get(tmp);
if (null != current_count) // avoid second lookup
class_counts.put(tmp, current_count + 1);
else
class_counts.put(tmp, 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Next, and more semantically, you're performing a KNN operation using a TreeMap as your datastructure of choice. While this makes sense in that it internally stores keys in comparative order, it doesn't make sense to use a Map for an operation that will almost undoubtedly be required to break ties. Here's why:
int k = 2;
TreeMap<Double, Text> map = new TreeMap<>();
map.put(1.0, new Text("close"));
map.put(1.0, new Text("equally close"));
map.put(1500.0, new Text("super far"));
// ... your popping logic...
Which are the two closest points you've retained? "equally close" and "super far". This is due to the fact that you can't have two instance of the same key. Thus, your algorithm is incapable of breaking ties. There are a few things you could do to fix that:
First, if you're set on performing this operation in the Reducer and you know your incoming data will not cause an OutOfMemoryError, consider using a different sorted structure, like a TreeSet and build a custom Comparable object that it will sort:
static class KNNEntry implements Comparable<KNNEntry> {
final Text text;
final Double dist;
KNNEntry(Text text, Double dist) {
this.text = text;
this.dist = dist;
}
#Override
public int compareTo(KNNEntry other) {
int comp = this.dist.compareTo(other.dist);
if(0 == comp)
return this.text.compareTo(other.text);
return comp;
}
}
And then instead of your TreeMap, use a TreeSet<KNNEntry>, which will internally sort itself based on the Comparator logic we just built above. Then after you've gone through all the keys, just iterate through the first k, retaining them in order. This has a drawback, though: if your data is truly big, you can overflow the heapspace by loading all of the values from the reducer into memory.
Second option: make the KNNEntry we built above implement WritableComparable, and emit that from your Mapper, then use secondary sorting to handle the sorting of your entries. This gets a bit more hairy, as you'd have to use lots of mappers and then only one reducer to capture the first k. If your data is small enough, try the first option to allow for tie breaking.
But, back to your original question, you're getting an OutOfBoundsException because the index you're trying to access does not exist, i.e., there is no "#" in the input String.
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph):
1 3
3 1
3 2
2 3
Now, I want to use MapReduce to count the triangles in this graph (obviously one). It is still work in progress and in the first phase, I try to get a list of all neighbors for each vertex.
My main class looks like the following:
public class TriangleCount {
public static void main( String[] args ) throws Exception {
// remove the old output directory
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path("output/"), true);
JobConf firstPhaseJob = new JobConf(FirstPhase.class);
firstPhaseJob.setOutputKeyClass(IntWritable.class);
firstPhaseJob.setOutputValueClass(IntWritable.class);
firstPhaseJob.setMapperClass(FirstPhase.Map.class);
firstPhaseJob.setCombinerClass(FirstPhase.Reduce.class);
firstPhaseJob.setReducerClass(FirstPhase.Reduce.class);
FileInputFormat.setInputPaths(firstPhaseJob, new Path("input/"));
FileOutputFormat.setOutputPath(firstPhaseJob, new Path("output/"));
JobClient.runJob(firstPhaseJob);
}
}
My Mapper and Reducer implementations look like this, they are both very easy:
public class FirstPhase {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
#Override
public void map(LongWritable longWritable, Text graphLine, OutputCollector<IntWritable, IntWritable> outputCollector, Reporter reporter) throws IOException {
StringTokenizer tokenizer = new StringTokenizer(graphLine.toString());
int n1 = Integer.parseInt(tokenizer.nextToken());
int n2 = Integer.parseInt(tokenizer.nextToken());
if(n1 > n2) {
System.out.println("emitting (" + new IntWritable(n1) + ", " + new IntWritable(n2) + ")");
outputCollector.collect(new IntWritable(n1), new IntWritable(n2));
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, IntWritable, Text> {
#Override
public void reduce(IntWritable key, Iterator<IntWritable> iterator, OutputCollector<IntWritable, Text> outputCollector, Reporter reporter) throws IOException {
List<IntWritable> nNodes = new ArrayList<>();
while(iterator.hasNext()) {
nNodes.add(iterator.next());
}
System.out.println("key: " + key + ", list: " + nNodes);
// create pairs and emit these
for(IntWritable n1 : nNodes) {
for(IntWritable n2 : nNodes) {
outputCollector.collect(key, new Text(n1.toString() + " " + n2.toString()));
}
}
}
}
}
I've added some logging to the program. In the map phase, I print which pairs I'm emitting. In the reduce phase, I print the input of the reduce. I get the following output:
emitting (3, 1)
emitting (3, 2)
key: 3, list: [1, 1]
The input for the reduce function is not what I expect. I expect it to be [1, 2] and not [1, 1]. I believe that Hadoop automatically combines all my emitted pairs from the output of the map phase but am I missing something here? Any help or explanation would be appreciated.
This is a typical problem for people beginning with Hadoop MapReduce.
The problem is in your reducer. When looping through the given Iterator<IntWritable>, each IntWritable instance is re-used, so it only keeps one instance around at a given time.
That means when you call iterator.next() your first saved IntWritable instance is set with the new value.
You can read more about this problem here
https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
After a map side join the data I am getting in Reducer is
key------ book
values
6
eraser=>book 2
pen=>book 4
pencil=>book 5
What I basically want to do is
eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6
What I initially did is like
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
System.out.println("key------ "+key);
System.out.println("Values");
for(Text value : values){
System.out.println("\t"+value.toString());
String v = value.toString();
double BsupportCnt = 0;
double UsupportCnt = 0;
double res = 0;
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
}
else{
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
}
// calculate here
res = UsupportCnt/BsupportCnt;
}
If incoming data is as above then this works fine
But if the incoming data from mapper is
key------ book
values
eraser=>book 2
pen=>book 4
pencil=>book 5
6
This wont work
Or else I need to store all => in a List (If the incoming data is a large data, the list may caught Heap space) and once I get a number I should do the calculation.
UPDATE
As Vefthym asked to do secondary sorting the values before it reaches the reducer.
I used htuple to do the same.
I reffered this link
In mapper1 emits eraser=>book 2 as value
So
public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configurations and other stuffs
//allWords is an ArrayList
if(allWords.size()<=2)
{
Tuple outputKey = new Tuple();
String LHS1 = allWords.get(1);
String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
outputKey.set(TupleFields.ALPHA, RHS1);
context.write(new Text(LHS1), outputKey);
}
//other stuffs
Mapper2 emits numbers as value
public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
Text valEmit = new Text();
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configuration and other stuffs
if(cnt != supCnt && cnt < supCnt){
System.out.println("emit");
Tuple outputKey = new Tuple();
outputKey.set(TupleFields.NUMBER, value);
System.out.println("v---"+value);
System.out.println("outputKey.toString()---"+outputKey.toString());
context.write(key, outputKey);
}
Reducer I simply tried to print key and values
But this caught error
Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Err is at context.write(new Text(LHS1), outputKey); from AprioriItemMapper1.java:49
but the above printing details are from Mapper 2
Any better way to do this
Please Suggest.
I would suggest using secondary sorting, which would guarantee that the first value (sorted lexicographically) is a numeric one, supposing there are no words starting with a number.
If this cannot work, then, bearing the scalability limitations that you mention, I would store the reducer's values in a HashMap<String,Double> buffer with keys being the left parts of "=>" and values being their numeric values.
You can store the values, until you get the value of the denominator BsupportCnt. Then you can emit all the buffer's contents with the correct score and all the remaining values, as they come one-by-one, without the need to use the buffer again (since you now know the denominator). Something like that:
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
Map<String,Double> buffer = new HashMap<>();
double BsupportCnt = 0;
double UsupportCnt;
double res;
for(Text value : values){
String v = value.toString();
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
} else {
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
if (BsupportCnt != 0) { //no need to add things to the buffer any more
res = UsupportCnt/BsupportCnt;
context.write(new Text(v), new DoubleWritable(res));
} else {
buffer.put(parts[0], UsupportCnt);
}
}
}
//now emit the buffer's contents
for (Map<String,Double>.Entry entry : buffer) {
context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
}
}
You could gain some more space by storing only the left part of "=>" as keys of the HashMap, as the right part is always the reducer's input key.
In my MapReduce program, I have a reducer function which counts the number of items in a Iterator of Text values and then for each item in the iterator outputs the item as key and the count as value. Thus i need to use the iterator twice. But once the iterator has reached the end I cannot get to iterate from the first. How do i solve this problem?
I tried the following code for my reduce function:
public static class ReduceA extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException
{
Text t;
int count =0;
String[] attr = key.toString().split(",");
while(values.hasNext())
{
values.next();
count++;
}
//Maybe i need to reset my iterator here and start from the beginning but how do i do it?
String v=Integer.toString(count);
while(values.hasNext())
{
t=values.next();
output.collect(t,new Text(v));
}
}
}
The above code produced empty results.I had tried by inserting the values of the iterator in a list but since I need to deal with many GBs of data,I am getting java heap space error for using the list. Please help me to modify my code so that I can traverse the iterator twice.
You could always do it the simple way : declare a List and cache the value as you iterate through the first time. You could consequently iterate through your List and write out your output. You should have something similar to this :
public static class ReduceA extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text t;
int count = 0;
String[] attr = key.toString().split(",");
List<Text> cache = new ArrayList<Text>();
while (values.hasNext()) {
cache.add(values.next());
count++;
}
// Maybe i need to reset my iterator here and start from the beginning
// but how do i do it?
String v = Integer.toString(count);
for (Text text : cache) {
output.collect(text, new Text(v));
}
}
}