I'm trying to make 2 keys from my dataset, which has 2 columns of numbers separated by tab. I know how to make 1 key/value, but not sure how to make a second pair of key/value. In essence I want to make a key/value for each of the columns. Then in the reducer part, take the difference of the counts of each key.
Here's what I have for the mapper part:
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable one = new IntWritable(1);
private Text nodeX = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
StringTokenizer itr = new StringTokenizer(data);
while(itr.hasMoreTokens()){
nodeX.set(node0);
context.write(nodeX, one)
nodeY.set(node1);
context.write(nodeY, one)
}
}
Here's the reducer:
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum0 = 0;
for (IntWritable val : values) {
sum0 += val.get()
}
int sum1 = 0;
for (IntWritable val : values) {
sum1 += val.get()
}
diff = sum0 - sum1;
result.set(diff);
context.write(key, diff);
}
}
I think I did something in the part where the data is passed from mapper to reducer, might need 2 keys. New to Java and not sure how to get this correctly.
My input data looks like this:
123 543
123 234
543 135
135 123
And I would like the output to be, where I'm taking the difference of sum of the occurrences of col1 key and of col2 key.
123 1
543 0
135 0
234 -1
I think you wanted split the lines to words and let the word to be a number and then Calculated the difference . you can use NLineInputFormat that the key is the row number , split the value and calculate. otherwise . you can Definite a static long type to log the row number.
public static class TokenizerMapper extends
Mapper<LongWritable, Text, LongWritable, IntWritable>
{
private IntWritable diffen = new IntWritable();
private static long row_num= 0;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
int dif = Integer.parseInt(node1)-Integer.parseInt(node0);
diffen.set(dif);
row_num++;
context.write(new LongWritable(row_num), diffen);
}
}
you can also write the value to reduce and split to two part and Calculate the different .ALL is ok;
Related
I need to find the most common key emitted by Mapper in the Reducer. My reducer works fine in this way:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points= new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#"); //format of value from mapper: "Key#1.2345"
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value)); //finds the K smallest distances
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
for (Text t : k_closest_points.values()) //it perfectly emits the K smallest distances and keys
context.write(NullWritable.get(), t);
}
}
It finds the K instances with the smallest distances and writes to the output file. But I need to find the most common key in my TreeMap. So I'm trying it like below:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value));
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String[] tmp = value.toString().split("#");
if (class_counts.containsKey(tmp[0]))
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
else
class_counts.put(tmp[0], 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Then I get this error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at KNN$MyReducer.reduce(KNN.java:108)
at KNN$MyReducer.reduce(KNN.java:98)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
Can you please help me to fix this?
A few things... first, your problem is here:
double distance = Double.parseDouble(v[1]);
You're splitting on "#" and it may not be in the string. If it's not, it will throw the OutOfBoundsException. I would add a clause like:
if(v.length < 2)
continue;
Second (and this shouldn't even compile unless I'm crazy), tmp is a String[], and yet here you're actually just concatenating '1' to it in the put operation (it's a parenthesis issue):
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
It should be:
class_counts.put(tmp[0], class_counts.get(tmp[0]) + 1);
It's also expensive to look the key up twice in a potentially large Map. Here's how I'd re-write your reducer based on what you've given us (this is totally untested):
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
if(v.length < 2)
continue; // consider adding an enum counter
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(v[0])); // you've already split once, why do it again later?
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
// exit early if nothing found
if(k_closest_points.isEmpty())
return;
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String tmp = value.toString();
Integer current_count = class_counts.get(tmp);
if (null != current_count) // avoid second lookup
class_counts.put(tmp, current_count + 1);
else
class_counts.put(tmp, 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Next, and more semantically, you're performing a KNN operation using a TreeMap as your datastructure of choice. While this makes sense in that it internally stores keys in comparative order, it doesn't make sense to use a Map for an operation that will almost undoubtedly be required to break ties. Here's why:
int k = 2;
TreeMap<Double, Text> map = new TreeMap<>();
map.put(1.0, new Text("close"));
map.put(1.0, new Text("equally close"));
map.put(1500.0, new Text("super far"));
// ... your popping logic...
Which are the two closest points you've retained? "equally close" and "super far". This is due to the fact that you can't have two instance of the same key. Thus, your algorithm is incapable of breaking ties. There are a few things you could do to fix that:
First, if you're set on performing this operation in the Reducer and you know your incoming data will not cause an OutOfMemoryError, consider using a different sorted structure, like a TreeSet and build a custom Comparable object that it will sort:
static class KNNEntry implements Comparable<KNNEntry> {
final Text text;
final Double dist;
KNNEntry(Text text, Double dist) {
this.text = text;
this.dist = dist;
}
#Override
public int compareTo(KNNEntry other) {
int comp = this.dist.compareTo(other.dist);
if(0 == comp)
return this.text.compareTo(other.text);
return comp;
}
}
And then instead of your TreeMap, use a TreeSet<KNNEntry>, which will internally sort itself based on the Comparator logic we just built above. Then after you've gone through all the keys, just iterate through the first k, retaining them in order. This has a drawback, though: if your data is truly big, you can overflow the heapspace by loading all of the values from the reducer into memory.
Second option: make the KNNEntry we built above implement WritableComparable, and emit that from your Mapper, then use secondary sorting to handle the sorting of your entries. This gets a bit more hairy, as you'd have to use lots of mappers and then only one reducer to capture the first k. If your data is small enough, try the first option to allow for tie breaking.
But, back to your original question, you're getting an OutOfBoundsException because the index you're trying to access does not exist, i.e., there is no "#" in the input String.
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph):
1 3
3 1
3 2
2 3
Now, I want to use MapReduce to count the triangles in this graph (obviously one). It is still work in progress and in the first phase, I try to get a list of all neighbors for each vertex.
My main class looks like the following:
public class TriangleCount {
public static void main( String[] args ) throws Exception {
// remove the old output directory
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path("output/"), true);
JobConf firstPhaseJob = new JobConf(FirstPhase.class);
firstPhaseJob.setOutputKeyClass(IntWritable.class);
firstPhaseJob.setOutputValueClass(IntWritable.class);
firstPhaseJob.setMapperClass(FirstPhase.Map.class);
firstPhaseJob.setCombinerClass(FirstPhase.Reduce.class);
firstPhaseJob.setReducerClass(FirstPhase.Reduce.class);
FileInputFormat.setInputPaths(firstPhaseJob, new Path("input/"));
FileOutputFormat.setOutputPath(firstPhaseJob, new Path("output/"));
JobClient.runJob(firstPhaseJob);
}
}
My Mapper and Reducer implementations look like this, they are both very easy:
public class FirstPhase {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
#Override
public void map(LongWritable longWritable, Text graphLine, OutputCollector<IntWritable, IntWritable> outputCollector, Reporter reporter) throws IOException {
StringTokenizer tokenizer = new StringTokenizer(graphLine.toString());
int n1 = Integer.parseInt(tokenizer.nextToken());
int n2 = Integer.parseInt(tokenizer.nextToken());
if(n1 > n2) {
System.out.println("emitting (" + new IntWritable(n1) + ", " + new IntWritable(n2) + ")");
outputCollector.collect(new IntWritable(n1), new IntWritable(n2));
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, IntWritable, Text> {
#Override
public void reduce(IntWritable key, Iterator<IntWritable> iterator, OutputCollector<IntWritable, Text> outputCollector, Reporter reporter) throws IOException {
List<IntWritable> nNodes = new ArrayList<>();
while(iterator.hasNext()) {
nNodes.add(iterator.next());
}
System.out.println("key: " + key + ", list: " + nNodes);
// create pairs and emit these
for(IntWritable n1 : nNodes) {
for(IntWritable n2 : nNodes) {
outputCollector.collect(key, new Text(n1.toString() + " " + n2.toString()));
}
}
}
}
}
I've added some logging to the program. In the map phase, I print which pairs I'm emitting. In the reduce phase, I print the input of the reduce. I get the following output:
emitting (3, 1)
emitting (3, 2)
key: 3, list: [1, 1]
The input for the reduce function is not what I expect. I expect it to be [1, 2] and not [1, 1]. I believe that Hadoop automatically combines all my emitted pairs from the output of the map phase but am I missing something here? Any help or explanation would be appreciated.
This is a typical problem for people beginning with Hadoop MapReduce.
The problem is in your reducer. When looping through the given Iterator<IntWritable>, each IntWritable instance is re-used, so it only keeps one instance around at a given time.
That means when you call iterator.next() your first saved IntWritable instance is set with the new value.
You can read more about this problem here
https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
I'm trying to sort the data by value
The method i use is to combine the key and value to a composite key
e.g (key,value) -> ({key,value},value)
and define my KeyComaparator which is compare the value part in the key
my data is a paragraph that i should count the words
and i done two job, the first one do the wordCount, but combine the key to composite key in reducer.
this is the result
is,4 4
the,15 15
ECA,1 1
to,6 6
.....
and in the second job, I try to use the composite key to sort by the value
this is my mapper2
public static class Map2 extends MapReduceBase
implements Mapper<LongWritable,Text,Text,IntWritable>{
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String w1[] = line.split("\t");
word.set(w1[0]);
output.collect(word,new IntWritable(Integer.valueOf(w1[1])));
}
}
and here is my Keycomparator
public static final class KeyComparator extends WritableComparator {
public KeyComparator(){
super(Text.class,true);
}
#Override
public int compare(WritableComparable tp1, WritableComparable tp2) {
Text t1 = (Text)tp1;
Text t2 = (Text)tp2;
String a[] = t1.toString().split(",");
String b[] = t2.toString().split(",");
return a[1].compareTo(b[1]);
}
this is my reducer2
public static class Reduce2 extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int sum=0;
while( values.hasNext()){
sum+= values.next().get();
}
//String cpKey[] = key.toString().split(",");
Text outputKey = new Text();
//outputKey.set(cpKey[0]);
output.collect(key, new IntWritable(sum));
}
}
here is my main function
public static void main(String[] args) throws Exception {
int reduceTasks = 1;
int mapTasks = 3;
System.out.println("1. New JobConf...");
JobConf conf = new JobConf(WordCountV2.class);
conf.setJobName("WordCount");
System.out.println("2. Setting output key and value...");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
System.out.println("3. Setting Mapper and Reducer classes...");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
// set numbers of reducers
System.out.println("4. Setting number of reduce and map tasks...");
conf.setNumReduceTasks(reduceTasks);
conf.setNumMapTasks(mapTasks);
System.out.println("5. Setting input and output formats...");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
System.out.println("6. Setting input and output paths...");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
String TempDir = "temp" + Integer.toString(new Random().nextInt(1000)+1);
FileOutputFormat.setOutputPath(conf, new Path(TempDir));
//FileOutputFormat.setOutputPath(conf,new Path(args[1]));
System.out.println("7. Running job...");
JobClient.runJob(conf);
JobConf sort = new JobConf(WordCountV2.class);
sort.setJobName("sort");
sort.setMapOutputKeyClass(Text.class);
sort.setMapOutputValueClass(IntWritable.class);
sort.setOutputKeyComparatorClass(KeyComparator.class);
sort.setMapperClass(Map2.class);
sort.setReducerClass(Reduce2.class);
sort.setNumReduceTasks(reduceTasks);
sort.setNumMapTasks(mapTasks);
sort.setInputFormat(TextInputFormat.class);
sort.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(sort,TempDir);
FileOutputFormat.setOutputPath(sort, new Path(args[1]));
JobClient.runJob(sort);
}
but the result is kind of this
is 13
the 32
ECA 21
to 14
.
.
.
and lost many word
but if i didn't use my Keycomparator
it returns to the result which is not sorted, just like the first one i mentioned
any ideas to solve the problem? thanks!
I'm not sure where you are making mistake.
But what you are trying to do is called Secondary Sort Sorting based on value.
It's not a trivial job to do, but you need to create more classes for patition,aggregation and other stuff which is clearly explained Here and Here
Just following the instructions in those blogs will surely help you.
After a map side join the data I am getting in Reducer is
key------ book
values
6
eraser=>book 2
pen=>book 4
pencil=>book 5
What I basically want to do is
eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6
What I initially did is like
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
System.out.println("key------ "+key);
System.out.println("Values");
for(Text value : values){
System.out.println("\t"+value.toString());
String v = value.toString();
double BsupportCnt = 0;
double UsupportCnt = 0;
double res = 0;
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
}
else{
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
}
// calculate here
res = UsupportCnt/BsupportCnt;
}
If incoming data is as above then this works fine
But if the incoming data from mapper is
key------ book
values
eraser=>book 2
pen=>book 4
pencil=>book 5
6
This wont work
Or else I need to store all => in a List (If the incoming data is a large data, the list may caught Heap space) and once I get a number I should do the calculation.
UPDATE
As Vefthym asked to do secondary sorting the values before it reaches the reducer.
I used htuple to do the same.
I reffered this link
In mapper1 emits eraser=>book 2 as value
So
public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configurations and other stuffs
//allWords is an ArrayList
if(allWords.size()<=2)
{
Tuple outputKey = new Tuple();
String LHS1 = allWords.get(1);
String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
outputKey.set(TupleFields.ALPHA, RHS1);
context.write(new Text(LHS1), outputKey);
}
//other stuffs
Mapper2 emits numbers as value
public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
Text valEmit = new Text();
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configuration and other stuffs
if(cnt != supCnt && cnt < supCnt){
System.out.println("emit");
Tuple outputKey = new Tuple();
outputKey.set(TupleFields.NUMBER, value);
System.out.println("v---"+value);
System.out.println("outputKey.toString()---"+outputKey.toString());
context.write(key, outputKey);
}
Reducer I simply tried to print key and values
But this caught error
Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Err is at context.write(new Text(LHS1), outputKey); from AprioriItemMapper1.java:49
but the above printing details are from Mapper 2
Any better way to do this
Please Suggest.
I would suggest using secondary sorting, which would guarantee that the first value (sorted lexicographically) is a numeric one, supposing there are no words starting with a number.
If this cannot work, then, bearing the scalability limitations that you mention, I would store the reducer's values in a HashMap<String,Double> buffer with keys being the left parts of "=>" and values being their numeric values.
You can store the values, until you get the value of the denominator BsupportCnt. Then you can emit all the buffer's contents with the correct score and all the remaining values, as they come one-by-one, without the need to use the buffer again (since you now know the denominator). Something like that:
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
Map<String,Double> buffer = new HashMap<>();
double BsupportCnt = 0;
double UsupportCnt;
double res;
for(Text value : values){
String v = value.toString();
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
} else {
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
if (BsupportCnt != 0) { //no need to add things to the buffer any more
res = UsupportCnt/BsupportCnt;
context.write(new Text(v), new DoubleWritable(res));
} else {
buffer.put(parts[0], UsupportCnt);
}
}
}
//now emit the buffer's contents
for (Map<String,Double>.Entry entry : buffer) {
context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
}
}
You could gain some more space by storing only the left part of "=>" as keys of the HashMap, as the right part is always the reducer's input key.
In my MapReduce program, I have a reducer function which counts the number of items in a Iterator of Text values and then for each item in the iterator outputs the item as key and the count as value. Thus i need to use the iterator twice. But once the iterator has reached the end I cannot get to iterate from the first. How do i solve this problem?
I tried the following code for my reduce function:
public static class ReduceA extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException
{
Text t;
int count =0;
String[] attr = key.toString().split(",");
while(values.hasNext())
{
values.next();
count++;
}
//Maybe i need to reset my iterator here and start from the beginning but how do i do it?
String v=Integer.toString(count);
while(values.hasNext())
{
t=values.next();
output.collect(t,new Text(v));
}
}
}
The above code produced empty results.I had tried by inserting the values of the iterator in a list but since I need to deal with many GBs of data,I am getting java heap space error for using the list. Please help me to modify my code so that I can traverse the iterator twice.
You could always do it the simple way : declare a List and cache the value as you iterate through the first time. You could consequently iterate through your List and write out your output. You should have something similar to this :
public static class ReduceA extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text t;
int count = 0;
String[] attr = key.toString().split(",");
List<Text> cache = new ArrayList<Text>();
while (values.hasNext()) {
cache.add(values.next());
count++;
}
// Maybe i need to reset my iterator here and start from the beginning
// but how do i do it?
String v = Integer.toString(count);
for (Text text : cache) {
output.collect(text, new Text(v));
}
}
}