Finding most common key in Reducer, Error: java.lang.ArrayIndexOutOfBoundsException: 1

Finding most common key in Reducer, Error: java.lang.ArrayIndexOutOfBoundsException: 1 - java

I need to find the most common key emitted by Mapper in the Reducer. My reducer works fine in this way:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points= new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#"); //format of value from mapper: "Key#1.2345"
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value)); //finds the K smallest distances
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
for (Text t : k_closest_points.values()) //it perfectly emits the K smallest distances and keys
context.write(NullWritable.get(), t);
}
}
It finds the K instances with the smallest distances and writes to the output file. But I need to find the most common key in my TreeMap. So I'm trying it like below:
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(value));
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String[] tmp = value.toString().split("#");
if (class_counts.containsKey(tmp[0]))
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
else
class_counts.put(tmp[0], 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Then I get this error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at KNN$MyReducer.reduce(KNN.java:108)
at KNN$MyReducer.reduce(KNN.java:98)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
Can you please help me to fix this?

A few things... first, your problem is here:
double distance = Double.parseDouble(v[1]);
You're splitting on "#" and it may not be in the string. If it's not, it will throw the OutOfBoundsException. I would add a clause like:
if(v.length < 2)
continue;
Second (and this shouldn't even compile unless I'm crazy), tmp is a String[], and yet here you're actually just concatenating '1' to it in the put operation (it's a parenthesis issue):
class_counts.put(tmp[0], class_counts.get(tmp[0] + 1));
It should be:
class_counts.put(tmp[0], class_counts.get(tmp[0]) + 1);
It's also expensive to look the key up twice in a potentially large Map. Here's how I'd re-write your reducer based on what you've given us (this is totally untested):
public static class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Text result = new Text();
private TreeMap<Double, Text> k_closest_points = new TreeMap<Double, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int K = Integer.parseInt(conf.get("K"));
for (Text value : values) {
String v[] = value.toString().split("#");
if(v.length < 2)
continue; // consider adding an enum counter
double distance = Double.parseDouble(v[1]);
k_closest_points.put(distance, new Text(v[0])); // you've already split once, why do it again later?
if (k_closest_points.size() > K)
k_closest_points.remove(k_closest_points.lastKey());
}
// exit early if nothing found
if(k_closest_points.isEmpty())
return;
TreeMap<String, Integer> class_counts = new TreeMap<String, Integer>();
for (Text value : k_closest_points.values()) {
String tmp = value.toString();
Integer current_count = class_counts.get(tmp);
if (null != current_count) // avoid second lookup
class_counts.put(tmp, current_count + 1);
else
class_counts.put(tmp, 1);
}
context.write(NullWritable.get(), new Text(class_counts.lastKey()));
}
}
Next, and more semantically, you're performing a KNN operation using a TreeMap as your datastructure of choice. While this makes sense in that it internally stores keys in comparative order, it doesn't make sense to use a Map for an operation that will almost undoubtedly be required to break ties. Here's why:
int k = 2;
TreeMap<Double, Text> map = new TreeMap<>();
map.put(1.0, new Text("close"));
map.put(1.0, new Text("equally close"));
map.put(1500.0, new Text("super far"));
// ... your popping logic...
Which are the two closest points you've retained? "equally close" and "super far". This is due to the fact that you can't have two instance of the same key. Thus, your algorithm is incapable of breaking ties. There are a few things you could do to fix that:
First, if you're set on performing this operation in the Reducer and you know your incoming data will not cause an OutOfMemoryError, consider using a different sorted structure, like a TreeSet and build a custom Comparable object that it will sort:
static class KNNEntry implements Comparable<KNNEntry> {
final Text text;
final Double dist;
KNNEntry(Text text, Double dist) {
this.text = text;
this.dist = dist;
}
#Override
public int compareTo(KNNEntry other) {
int comp = this.dist.compareTo(other.dist);
if(0 == comp)
return this.text.compareTo(other.text);
return comp;
}
}
And then instead of your TreeMap, use a TreeSet<KNNEntry>, which will internally sort itself based on the Comparator logic we just built above. Then after you've gone through all the keys, just iterate through the first k, retaining them in order. This has a drawback, though: if your data is truly big, you can overflow the heapspace by loading all of the values from the reducer into memory.
Second option: make the KNNEntry we built above implement WritableComparable, and emit that from your Mapper, then use secondary sorting to handle the sorting of your entries. This gets a bit more hairy, as you'd have to use lots of mappers and then only one reducer to capture the first k. If your data is small enough, try the first option to allow for tie breaking.
But, back to your original question, you're getting an OutOfBoundsException because the index you're trying to access does not exist, i.e., there is no "#" in the input String.

Related

Mapreduce Mapper create 2 keys for reducer calculation

I'm trying to make 2 keys from my dataset, which has 2 columns of numbers separated by tab. I know how to make 1 key/value, but not sure how to make a second pair of key/value. In essence I want to make a key/value for each of the columns. Then in the reducer part, take the difference of the counts of each key.
Here's what I have for the mapper part:
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable one = new IntWritable(1);
private Text nodeX = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
StringTokenizer itr = new StringTokenizer(data);
while(itr.hasMoreTokens()){
nodeX.set(node0);
context.write(nodeX, one)
nodeY.set(node1);
context.write(nodeY, one)
}
}
Here's the reducer:
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum0 = 0;
for (IntWritable val : values) {
sum0 += val.get()
}
int sum1 = 0;
for (IntWritable val : values) {
sum1 += val.get()
}
diff = sum0 - sum1;
result.set(diff);
context.write(key, diff);
}
}
I think I did something in the part where the data is passed from mapper to reducer, might need 2 keys. New to Java and not sure how to get this correctly.
My input data looks like this:
123 543
123 234
543 135
135 123
And I would like the output to be, where I'm taking the difference of sum of the occurrences of col1 key and of col2 key.
123 1
543 0
135 0
234 -1

I think you wanted split the lines to words and let the word to be a number and then Calculated the difference . you can use NLineInputFormat that the key is the row number , split the value and calculate. otherwise . you can Definite a static long type to log the row number.
public static class TokenizerMapper extends
Mapper<LongWritable, Text, LongWritable, IntWritable>
{
private IntWritable diffen = new IntWritable();
private static long row_num= 0;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("\\t");
String node0 = data[0];
String node1 = data[1];
int dif = Integer.parseInt(node1)-Integer.parseInt(node0);
diffen.set(dif);
row_num++;
context.write(new LongWritable(row_num), diffen);
}
}
you can also write the value to reduce and split to two part and Calculate the different .ALL is ok;

Creating a smart data structure in Java

So I'm trying to create a smart data structure based off AVL tree and Hash Table.
I'm making sure I need to check first which implementation the data type will have depending on the size the list given to it.
For example, if I have a list n of size 1000, it'll be implemented using a Hash table. For anything more than 1000, using an AVL tree.
Code for this:
public class SmartULS<K,V> {
protected TreeMap<K,V> tree = new TreeMap<>();
protected AbstractHashMap<K,V> hashMap = new AbstractHashMap<K,V>();
public void setSmartThresholdULS(size){
int threshold = 1000;
if (size >= threshold) {
map = new AbtractMap<K,V>();
}
else
map = new TreeMap<K,V>();
}
}
Now after this, I should be writing the standard methods such as
get(SmartULS, Key), add(SmartULS, Key, Value), remove(SmartULS,Key), nextKey(Key), previousKey(Key), etc.
I'm really lost as to how to start this? I've thought about creating these methods like this(written in pseudo):
Algorithm add(SmartULS, Key, Value):
i<- 0
If SmartULS instanceof AbstractHashMap then
For i to SmartULS.size do
If Key equals to SmartULS[i] then
SmartULS.get(Key).setValue(Value)
Else
SmartULS.add(Key, Value)
Else if SmartULS instanceof TreeMap then
Entry newAdd equals new MapEntry(Key, Value)
Position<Entry> p = treeSearch(root( ), Key)

You're on the correct track, this is how I understood your question and implemented it:
public class SmartULS<K, V> {
Map<K,V> map;
public static final int THRESHOLD = 1000;
public SmartULS(int size) {
if(size < THRESHOLD) {
map = new HashMap();
} else {
map = new TreeMap();
}
}
public V get(K key) {
return map.get(key);
}
public V put(K key, V value) {
return map.put(key, value);
}
public V remove(K key) {
return map.remove(key);
}
}
Based on the initial size given, the constructor decides if to initialize a hash table or a tree. I also added a the get, put and remove functions and used the Map's interface functions.
I didn't understand what the nextKey and previousKey functions are suppose to do or return, so couldn't help there.
A way of using the class would be as follows:
public static void main(String[] args) {
SmartULS<String, String> smartULS = new SmartULS(952);
smartULS.put("firstKey", "firstValue");
smartULS.put("secondKey", "secondsValue");
String value = smartULS.get("firstKey");
smartULS.remove("secondKey");
}
Hope this helps:)

HashMap contains 4 elements but only 3 are shown in debug

I have defined an HashMap with the following code:
final Map<OrderItemEntity, OrderItemEntity> savedOrderItems = new HashMap<OrderItemEntity, OrderItemEntity>();
final ListIterator<DiscreteOrderItemEntity> li = ((BundleOrderItemEntity) oi).getDiscreteOrderItems().listIterator();
while (li.hasNext()) {
final DiscreteOrderItemEntity doi = li.next();
final DiscreteOrderItemEntity savedDoi = (DiscreteOrderItemEntity) orderItemService.saveOrderItem(doi);
savedOrderItems.put(doi, savedDoi);
li.remove();
}
((BundleOrderItemEntity) oi).getDiscreteOrderItems().addAll(doisToAdd);
final BundleOrderItemEntity savedBoi = (BundleOrderItemEntity) orderItemService.saveOrderItem(oi);
savedOrderItems.put(oi, savedBoi);
I put 4 items into the HashMap. When I debug, even if the size is 4, it only shows 3 elements:
This is the list of the elements it contains.
{DiscreteOrderItemEntity#1c29ef3c=DiscreteOrderItemEntity#41949d95, DiscreteOrderItemEntity#2288b93c=DiscreteOrderItemEntity#2288b93c, BundleOrderItemEntity#1b500292=BundleOrderItemEntity#d0f29ce5, DiscreteOrderItemEntity#9203174a=DiscreteOrderItemEntity#9203174a}
What can be the problem?

Hashmaps handle collisions.
Since your HashMap is composed by only 16 buckets, the hash of the element must be reduced to a number that spans between 0 and 15 (e.g. hash % 16). So two elements may be in the same bucket (the same HashMapNode).
You can inspect each HashMapNode to find out which one contains two elements.

The mechanism is explained as enrico.bacis, There is an example to reproduce it:
public class TestJava {
static class TT {
private String field;
#Override
public int hashCode() {
return 1;
}
}
public static void main(String[] args) {
Map<TT, String> test = new HashMap<>();
TT t1 = new TT();
TT t2 = new TT();
test.put(t1, "test2");
test.put(t2, "test2");
test.put(null, "test2");
test.put(null, "test2");
System.out.println(test.toString());
System.out.println(test.size());
}
}
In there we override hashCode and hard code return 1 that all objects of TT will return same hashCode 1.
and we can dig into HashMap.java:
public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
we can found when we put key/value pair into HashMap, it will calculate hash number by object's hashcode to locate the element's location in hash table.
so if the objects hash code are same, they will be stored in the same bucket in hash table. but these confilct elements still will be stored, because their key are not same.

How to manage Joins in hadoop - MultipleInputPath

After a map side join the data I am getting in Reducer is
key------ book
values
6
eraser=>book 2
pen=>book 4
pencil=>book 5
What I basically want to do is
eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6
What I initially did is like
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
System.out.println("key------ "+key);
System.out.println("Values");
for(Text value : values){
System.out.println("\t"+value.toString());
String v = value.toString();
double BsupportCnt = 0;
double UsupportCnt = 0;
double res = 0;
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
}
else{
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
}
// calculate here
res = UsupportCnt/BsupportCnt;
}
If incoming data is as above then this works fine
But if the incoming data from mapper is
key------ book
values
eraser=>book 2
pen=>book 4
pencil=>book 5
6
This wont work
Or else I need to store all => in a List (If the incoming data is a large data, the list may caught Heap space) and once I get a number I should do the calculation.
UPDATE
As Vefthym asked to do secondary sorting the values before it reaches the reducer.
I used htuple to do the same.
I reffered this link
In mapper1 emits eraser=>book 2 as value
So
public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configurations and other stuffs
//allWords is an ArrayList
if(allWords.size()<=2)
{
Tuple outputKey = new Tuple();
String LHS1 = allWords.get(1);
String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
outputKey.set(TupleFields.ALPHA, RHS1);
context.write(new Text(LHS1), outputKey);
}
//other stuffs
Mapper2 emits numbers as value
public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
Text valEmit = new Text();
public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
//Configuration and other stuffs
if(cnt != supCnt && cnt < supCnt){
System.out.println("emit");
Tuple outputKey = new Tuple();
outputKey.set(TupleFields.NUMBER, value);
System.out.println("v---"+value);
System.out.println("outputKey.toString()---"+outputKey.toString());
context.write(key, outputKey);
}
Reducer I simply tried to print key and values
But this caught error
Mapper 2:
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6,
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Err is at context.write(new Text(LHS1), outputKey); from AprioriItemMapper1.java:49
but the above printing details are from Mapper 2
Any better way to do this
Please Suggest.

I would suggest using secondary sorting, which would guarantee that the first value (sorted lexicographically) is a numeric one, supposing there are no words starting with a number.
If this cannot work, then, bearing the scalability limitations that you mention, I would store the reducer's values in a HashMap<String,Double> buffer with keys being the left parts of "=>" and values being their numeric values.
You can store the values, until you get the value of the denominator BsupportCnt. Then you can emit all the buffer's contents with the correct score and all the remaining values, as they come one-by-one, without the need to use the buffer again (since you now know the denominator). Something like that:
public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
Map<String,Double> buffer = new HashMap<>();
double BsupportCnt = 0;
double UsupportCnt;
double res;
for(Text value : values){
String v = value.toString();
if(!v.contains("=>")){
BsupportCnt = Double.parseDouble(v);
} else {
String parts[] = v.split(" ");
UsupportCnt = Double.parseDouble(parts[1]);
if (BsupportCnt != 0) { //no need to add things to the buffer any more
res = UsupportCnt/BsupportCnt;
context.write(new Text(v), new DoubleWritable(res));
} else {
buffer.put(parts[0], UsupportCnt);
}
}
}
//now emit the buffer's contents
for (Map<String,Double>.Entry entry : buffer) {
context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
}
}
You could gain some more space by storing only the left part of "=>" as keys of the HashMap, as the right part is always the reducer's input key.

How to traverse an iterator of Text values twice in a Mapreduce program?

In my MapReduce program, I have a reducer function which counts the number of items in a Iterator of Text values and then for each item in the iterator outputs the item as key and the count as value. Thus i need to use the iterator twice. But once the iterator has reached the end I cannot get to iterate from the first. How do i solve this problem?
I tried the following code for my reduce function:
public static class ReduceA extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException
{
Text t;
int count =0;
String[] attr = key.toString().split(",");
while(values.hasNext())
{
values.next();
count++;
}
//Maybe i need to reset my iterator here and start from the beginning but how do i do it?
String v=Integer.toString(count);
while(values.hasNext())
{
t=values.next();
output.collect(t,new Text(v));
}
}
}
The above code produced empty results.I had tried by inserting the values of the iterator in a list but since I need to deal with many GBs of data,I am getting java heap space error for using the list. Please help me to modify my code so that I can traverse the iterator twice.

You could always do it the simple way : declare a List and cache the value as you iterate through the first time. You could consequently iterate through your List and write out your output. You should have something similar to this :
public static class ReduceA extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text t;
int count = 0;
String[] attr = key.toString().split(",");
List<Text> cache = new ArrayList<Text>();
while (values.hasNext()) {
cache.add(values.next());
count++;
}
// Maybe i need to reset my iterator here and start from the beginning
// but how do i do it?
String v = Integer.toString(count);
for (Text text : cache) {
output.collect(text, new Text(v));
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding most common key in Reducer, Error: java.lang.ArrayIndexOutOfBoundsException: 1 - java

Related

Mapreduce Mapper create 2 keys for reducer calculation

Creating a smart data structure in Java

HashMap contains 4 elements but only 3 are shown in debug

How to manage Joins in hadoop - MultipleInputPath

How to traverse an iterator of Text values twice in a Mapreduce program?

Categories

Resources