How to manage Joins in hadoop - MultipleInputPath

How to manage Joins in hadoop - MultipleInputPath - java

After a map side join the data I am getting in Reducer is
key------ book
values
6
eraser=>book 2
pen=>book 4
pencil=>book 5
What I basically want to do is
eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6
What I initially did is like
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
System.out.println("key------ "+key);
System.out.println("Values");
for(Text value : values){
System.out.println("\t"+value.toString());
String v = value.toString();
double BsupportCnt = 0;
double UsupportCnt = 0;
double res = 0;
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
}
else{
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
}
// calculate here
res = UsupportCnt/BsupportCnt;
}
If incoming data is as above then this works fine
But if the incoming data from mapper is
key------ book
values
eraser=>book 2
pen=>book 4
pencil=>book 5
6
This wont work
Or else I need to store all => in a List (If the incoming data is a large data, the list may caught Heap space) and once I get a number I should do the calculation.
UPDATE
As Vefthym asked to do secondary sorting the values before it reaches the reducer.
I used htuple to do the same.
I reffered this link
In mapper1 emits eraser=>book 2 as value
So
public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configurations and other stuffs
//allWords is an ArrayList
if(allWords.size()<=2)
{
Tuple outputKey = new Tuple();
String LHS1 = allWords.get(1);
String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
outputKey.set(TupleFields.ALPHA, RHS1);
context.write(new Text(LHS1), outputKey);
}
//other stuffs
Mapper2 emits numbers as value
public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
Text valEmit = new Text();
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configuration and other stuffs
if(cnt != supCnt && cnt < supCnt){
System.out.println("emit");
Tuple outputKey = new Tuple();
outputKey.set(TupleFields.NUMBER, value);
System.out.println("v---"+value);
System.out.println("outputKey.toString()---"+outputKey.toString());
context.write(key, outputKey);
}
Reducer I simply tried to print key and values
But this caught error
Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Err is at context.write(new Text(LHS1), outputKey); from AprioriItemMapper1.java:49
but the above printing details are from Mapper 2
Any better way to do this
Please Suggest.

I would suggest using secondary sorting, which would guarantee that the first value (sorted lexicographically) is a numeric one, supposing there are no words starting with a number.
If this cannot work, then, bearing the scalability limitations that you mention, I would store the reducer's values in a HashMap<String,Double> buffer with keys being the left parts of "=>" and values being their numeric values.
You can store the values, until you get the value of the denominator BsupportCnt. Then you can emit all the buffer's contents with the correct score and all the remaining values, as they come one-by-one, without the need to use the buffer again (since you now know the denominator). Something like that:
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
Map<String,Double> buffer = new HashMap<>();
double BsupportCnt = 0;
double UsupportCnt;
double res;
for(Text value : values){
String v = value.toString();
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
} else {
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
if (BsupportCnt != 0) { //no need to add things to the buffer any more
res = UsupportCnt/BsupportCnt;
context.write(new Text(v), new DoubleWritable(res));
} else {
buffer.put(parts[0], UsupportCnt);
}
}
}
//now emit the buffer's contents
for (Map<String,Double>.Entry entry : buffer) {
context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
}
}
You could gain some more space by storing only the left part of "=>" as keys of the HashMap, as the right part is always the reducer's input key.

Related

Mapreduce Mapper create 2 keys for reducer calculation

I'm trying to make 2 keys from my dataset, which has 2 columns of numbers separated by tab. I know how to make 1 key/value, but not sure how to make a second pair of key/value. In essence I want to make a key/value for each of the columns. Then in the reducer part, take the difference of the counts of each key.
Here's what I have for the mapper part:
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable one = new IntWritable(1);
private Text nodeX = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
StringTokenizer itr = new StringTokenizer(data);
while(itr.hasMoreTokens()){
nodeX.set(node0);
context.write(nodeX, one)
nodeY.set(node1);
context.write(nodeY, one)
}
}
Here's the reducer:
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum0 = 0;
for (IntWritable val : values) {
sum0 += val.get()
}
int sum1 = 0;
for (IntWritable val : values) {
sum1 += val.get()
}
diff = sum0 - sum1;
result.set(diff);
context.write(key, diff);
}
}
I think I did something in the part where the data is passed from mapper to reducer, might need 2 keys. New to Java and not sure how to get this correctly.
My input data looks like this:
123 543
123 234
543 135
135 123
And I would like the output to be, where I'm taking the difference of sum of the occurrences of col1 key and of col2 key.
123 1
543 0
135 0
234 -1

I think you wanted split the lines to words and let the word to be a number and then Calculated the difference . you can use NLineInputFormat that the key is the row number , split the value and calculate. otherwise . you can Definite a static long type to log the row number.
public static class TokenizerMapper extends
Mapper<LongWritable, Text, LongWritable, IntWritable>
{
private IntWritable diffen = new IntWritable();
private static long row_num= 0;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
int dif = Integer.parseInt(node1)-Integer.parseInt(node0);
diffen.set(dif);
row_num++;
context.write(new LongWritable(row_num), diffen);
}
}
you can also write the value to reduce and split to two part and Calculate the different .ALL is ok;

Getting all combination of 2 elements from a string array java

Lets say I have this array list ['a', 'b', 'xx'].
I want to extract every 2 strings combination (for every 2 elements). for example ['a','b'] ['a', 'xx'] ['b', 'a'] ['b', 'xx'] ['xx', 'a'] ['xx', 'b'].
I have written this code, but when the array gets really big (10k for
example) the GC runs out of memory.
private Text empty = new Text("");
public void start(Iterable<Text> values,Context context) throws {
List<String> sitesArr = new ArrayList<String>();
HashMap<String, String> hmapPairs = new HashMap<String, String>();
for (Text site : values){
sitesArr.add(site.toString());
}
insertPairsToHash(hmapPairs, sitesArr);
writeContextFromHash(hmapPairs, context);
}
private void insertPairsToHash(HashMap<String, String> hmapPairs, List<String> sitesArr) {
for (int i=0; i<sitesArr.size(); i++) {
for (int j=i+1; j<sitesArr.size(); j++) {
String firstPair = sitesArr.get(i) + "_" + sitesArr.get(j);
String secondPair = sitesArr.get(j) + "_" + sitesArr.get(i);
hmapPairs.put(firstPair,secondPair);
}
}
}
private void writeContextFromHash(HashMap<String, String> hmapPairs, Context context) throws IOException, InterruptedException {
Text textTowriteToFile = new Text("");
for(Map.Entry<String, String> entry : hmapPairs.entrySet()) {
textTowriteToFile.set(entry.getKey());
context.write(textTowriteToFile, empty);
textTowriteToFile.set(entry.getValue());
context.write(textTowriteToFile, empty);
}
}
I use 2 for loops and in each iteration I insert 2 combinations ( ['a', 'b'] and ['b','a'] first element is the key and the second is the value so in ['a','b'] 'a' would be the key and 'b' would be the value and vice versa) to the hash.
Then I iterate once over the hash to send the values.
How can I make it faster while using less memory?

Why not just call "writeContextFromHash" right in the nested for loop and not create HashMap?

You should probably add some more information to your question. But basically with this kind of programme you will always run into memory problems as your input gets larger. With 10k entries you end up at about 100m combinations resulting in 50m Map entries. Multiplied with the size of the data structure (depending on your input) this uses a lot of memory.
If you know the rough size of your input beforehand you might just assign enough memory to your jvm (unless your machine is to small). If this doesn't solve the problem you cannot keep all results in memory. Either swap out to disk or as suggested above write your results directly to the console instead of keeping them in memory.

You can simply refactor your class streaming results. So you do not keep the whole list of result of your combining elements.
private Text empty = new Text("");
public void start(Iterable<Text> values,Context context) throws IOException, InterruptedException {
List<String> sitesArr = new ArrayList<String>();
for (Text site : values){
sitesArr.add(site.toString());
}
insertPairsToHash(sitesArr,context);
}
private void insertPairsToHash(List<String> sitesArr, Context context) {
for (int i=0; i<sitesArr.size(); i++) {
for (int j=i+1; j<sitesArr.size(); j++) {
String firstPair = sitesArr.get(i) + "_" + sitesArr.get(j);
String secondPair = sitesArr.get(j) + "_" + sitesArr.get(i);
doWrite(context, firstPair, secondPair);
}
}
}
private void doWrite(Context context, String firstPair, String secondPair) {
Text textTowriteToFile = new Text("");
textTowriteToFile.set(firstPair);
context.write(textTowriteToFile, empty);
textTowriteToFile.set(secondPair);
context.write(textTowriteToFile, empty);
}
This will lower you memory usage.
In general you try to stream results if your input is big or unbounded, streaming add some complexity but keeps the memory usage independent from the size of you input.
EDIT (After comment):
You can drop used elements by removing them from the list.
You should in this case use a LinkedList instead of an ArrayList, because removing the head element from an array list would involve much more GC and CPU time then the same operation from a linked list.
This however will not lower the peak memory usage, only the usage over time (you will require less memory as the process goes on).
It could still be useful if other components consume more memory as the process progresses.
List<String> sitesArr = new LinkedList<>();
private void insertPairsToHash(List<String> sitesArr, Context context) {
while (!sitesArr.isEmpty()) {
String left = sitesArr.remove(0);
for (String right : sitesArr) {
String firstPair = left + "_" + right;
String secondPair = right + "_" + left;
doWrite(context, firstPair, secondPair);
}
}
}

Finding most common key in Reducer, Error: java.lang.ArrayIndexOutOfBoundsException: 1

I need to find the most common key emitted by Mapper in the Reducer. My reducer works fine in this way:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points= new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#"); //format of value from mapper: "Key#1.2345"
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value)); //finds the K smallest distances
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
for (Text t : k_closest_points.values()) //it perfectly emits the K smallest distances and keys
context.write(NullWritable.get(), t);
}
}
It finds the K instances with the smallest distances and writes to the output file. But I need to find the most common key in my TreeMap. So I'm trying it like below:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value));
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String[] tmp = value.toString().split("#");
if (class_counts.containsKey(tmp[0]))
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
else
class_counts.put(tmp[0], 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Then I get this error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at KNN$MyReducer.reduce(KNN.java:108)
at KNN$MyReducer.reduce(KNN.java:98)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
Can you please help me to fix this?

A few things... first, your problem is here:
double distance = Double.parseDouble(v[1]);
You're splitting on "#" and it may not be in the string. If it's not, it will throw the OutOfBoundsException. I would add a clause like:
if(v.length < 2)
continue;
Second (and this shouldn't even compile unless I'm crazy), tmp is a String[], and yet here you're actually just concatenating '1' to it in the put operation (it's a parenthesis issue):
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
It should be:
class_counts.put(tmp[0], class_counts.get(tmp[0]) + 1);
It's also expensive to look the key up twice in a potentially large Map. Here's how I'd re-write your reducer based on what you've given us (this is totally untested):
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
if(v.length < 2)
continue; // consider adding an enum counter
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(v[0])); // you've already split once, why do it again later?
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
// exit early if nothing found
if(k_closest_points.isEmpty())
return;
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String tmp = value.toString();
Integer current_count = class_counts.get(tmp);
if (null != current_count) // avoid second lookup
class_counts.put(tmp, current_count + 1);
else
class_counts.put(tmp, 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Next, and more semantically, you're performing a KNN operation using a TreeMap as your datastructure of choice. While this makes sense in that it internally stores keys in comparative order, it doesn't make sense to use a Map for an operation that will almost undoubtedly be required to break ties. Here's why:
int k = 2;
TreeMap<Double, Text> map = new TreeMap<>();
map.put(1.0, new Text("close"));
map.put(1.0, new Text("equally close"));
map.put(1500.0, new Text("super far"));
// ... your popping logic...
Which are the two closest points you've retained? "equally close" and "super far". This is due to the fact that you can't have two instance of the same key. Thus, your algorithm is incapable of breaking ties. There are a few things you could do to fix that:
First, if you're set on performing this operation in the Reducer and you know your incoming data will not cause an OutOfMemoryError, consider using a different sorted structure, like a TreeSet and build a custom Comparable object that it will sort:
static class KNNEntry implements Comparable<KNNEntry> {
final Text text;
final Double dist;
KNNEntry(Text text, Double dist) {
this.text = text;
this.dist = dist;
}
#Override
public int compareTo(KNNEntry other) {
int comp = this.dist.compareTo(other.dist);
if(0 == comp)
return this.text.compareTo(other.text);
return comp;
}
}
And then instead of your TreeMap, use a TreeSet<KNNEntry>, which will internally sort itself based on the Comparator logic we just built above. Then after you've gone through all the keys, just iterate through the first k, retaining them in order. This has a drawback, though: if your data is truly big, you can overflow the heapspace by loading all of the values from the reducer into memory.
Second option: make the KNNEntry we built above implement WritableComparable, and emit that from your Mapper, then use secondary sorting to handle the sorting of your entries. This gets a bit more hairy, as you'd have to use lots of mappers and then only one reducer to capture the first k. If your data is small enough, try the first option to allow for tie breaking.
But, back to your original question, you're getting an OutOfBoundsException because the index you're trying to access does not exist, i.e., there is no "#" in the input String.

Combining results from hadoop map-reduce

I have a Mapper<AvroKey<Email>, NullWritable, Text, Text> which effectively takes in an Email and multiple times spits out a key of an email address and a value of the field it was found on (from, to, cc, etc).
Then I have a Reducer<Text, Text, NullWritable, Text> that takes in the email address and field name. It spits out a NullWritable key and a count of how many times the address is present in a given field. e.g...
{
"address": "joe.bloggs#gmail.com",
"toCount": 12,
"fromCount": 4
}
I'm using FileUtil.copyMerge to conflate the output from the jobs but (obviously) the results from different reducers aren't merged, so in practice I see:
{
"address": "joe.bloggs#gmail.com",
"toCount": 12,
"fromCount": 0
}, {
"address": "joe.bloggs#gmail.com",
"toCount": 0,
"fromCount": 4
}
Is there a more sensible way of approaching this problem so I can get a single result per email address? (I gather a combiner running pre-reduce phase is only run on a subset of the data and not guaranteed to give the results I want)?
Edit:
Reducer code would be something like:
public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {
private static final ObjectMapper mapper = new ObjectMapper();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Map<String, Object>> results = new HashMap<>();
for (Text value : values) {
if (!results.containsKey(value.toString())) {
Map<String, Object> result = new HashMap<>();
result.put("address", key.toString());
result.put("to", 0);
result.put("from", 0);
results.put(value.toString(), result);
}
Map<String, Object> result = results.get(value.toString());
switch (value.toString()) {
case "TO":
result.put("to", ((int) result.get("to")) + 1);
break;
case "FROM":
result.put("from", ((int) result.get("from")) + 1);
break;
}
results.values().forEach(result -> {
context.write(NullWritable.get(), new Text(mapper.writeValueAsString(result)));
});
}
}

Each input key of the reducer corresponds to a unique email address, so you don't need the results collection. Each time the reduce method is called, it is for a distinct email address, so my suggestion is:
public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {
private static final ObjectMapper mapper = new ObjectMapper();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Object> result = new HashMap<>();
result.put("address", key.toString());
result.put("to", 0);
result.put("from", 0);
for (Text value : values) {
switch (value.toString()) {
case "TO":
result.put("to", ((int) result.get("to")) + 1);
break;
case "FROM":
result.put("from", ((int) result.get("from")) + 1);
break;
}
context.write(NullWritable.get(), new Text(mapper.writeValueAsString(result)));
}
}
I am not sure what the ObjectMapper class does, but I suppose that you need it to format the output. Otherwise, I would print the input key as the output key (i.e., the email address) and two concatenated counts for the "from" and "to" fields of each email address.
If your input is a data collection (i.e., not streams, or smth similar), then you should get each email address only once. If your input is given in streams and you need to incrementally build your final output, then the output of one job can be the input of another. If such is the case, I suggest using MultipleInputs, in which one Mapper is the one that you described earlier and another IdentityMapper, forwards the output of a previous job to the Reducer. This way, again, the same email address is handled by the same reduce task.

Writing only the Value on Mapper Job

I am currently working on a MapReduce Job which I am only using the mapper without the reducer. I do not need to write the key out because I only need the values which are stored in an array and want to write it out as my final output file. How can achieve this on Hadoop? Instead of writing to the output both the key and the value, I am only interested in writing out only the values. The values are in an array. Thanks
public void pfor(TestFor pfor,LongWritable key, Text value, Context context, int times) throws IOException, InterruptedException{
int n = 0;
while(n < times){
pfor.pforMap(key,value, context);
n++;
}
for(int i =0;i<uv.length; i++){
LOG.info(uv[i].get() + " Final output");
}
IntArrayWritable edge = new IntArrayWritable();
edge.set(uv);
context.write(new IntWritable(java.lang.Math.abs(randGen.nextInt())), edge);
uv= null;
}

Use NullWritable as value and emit your "edge" as key.
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/NullWritable.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to manage Joins in hadoop - MultipleInputPath - java

Related

Mapreduce Mapper create 2 keys for reducer calculation

Getting all combination of 2 elements from a string array java

Finding most common key in Reducer, Error: java.lang.ArrayIndexOutOfBoundsException: 1

Combining results from hadoop map-reduce

Writing only the Value on Mapper Job

Categories

Resources