Merge str value in Hadoop Reducer

Merge str value in Hadoop Reducer - java

My mapper class will output key-value pairs like:
abc 1
abc 2
abc 1
And I want to merge the values and calculate the occurrence of same pair in reducer class using HashMap, the output is like:
abc 1:2 2:1
But my output result is:
abc 1:2:1 2:1:1
It feels like there are additional Strings concatenated with the output, but I don't know why.
Here is my code:
Text combiner = new Text();
StringBuilder strBuilder = new StringBuilder();
#Override
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
HashMap<Text, Integer> result = new HashMap<Text, Integer>();
for (Text val : values) {
if(result.containsKey(val)){
int newVal = result.get(val) + 1;
result.put(val, newVal);
}else{
result.put(val, 1);
}
}
for(Map.Entry<Text, Integer> entry: result.entrySet()){
strBuilder.append(entry.getKey().toString());
strBuilder.append(":");
strBuilder.append(entry.getValue());
strBuilder.append("\t");
}
combiner.set(strBuilder.toString());
context.write(key, combiner);
}

I tested this code an it looks ok. The most likely reason you're getting output like this is because you're running this reducer as your combiner as well, which would explain why you're getting three values. The combine does the first concatenation, followed by the reduce which does a second.
You need to make sure a combiner isn't being configured in your job setup.
I would also suggest you change your code to make sure you store new versions of the Text values in your HashMap, remember Hadoop will be reusing the objects. So you should really be doing something like:
result.put(new Text(val), newVal);
or change your HashMap to store Strings, which is safe since they're immutable.

Related

How can I unit test a function that makes a string based on HashMap values?

I am trying to unit test a function that takes a HashMap and concatenates the keys into a comma separated string. The problem is that when I iterate through the HashMap using entrySet (or keySet or valueSet) the values are not in the order I .put() them in. IE:
testData = new HashMap<String, String>(0);
testData.put("colA", "valA");
testData.put("colB", "valB");
testData.put("colC", "valC");
for (Map.Entry<String, String> entry : testData.entrySet()) {
System.out.println("TestMapping " + entry.getKey());
}
Gives me the following output:
TestMapping colB
TestMapping colC
TestMapping colA
The string created by the SUT is ColB,ColC,ColA
How can I unit test this, since keySet(), valueSet(), etc are somewhat arbitrary in their order?
This is the function I am trying to test:
public String getColumns() {
String str = "";
for (String key : data.keySet()) {
str += ", " + key;
}
return str.substring(1);
}

There is no point in iterating over the HashMap in this case. The only reason to iterate over it would be to construct the expected String, in other words, perform the same operation as the method under test, so if you made an error implementing the method, you are likely to repeat the error when implementing the same for the unit test, failing to spot the error.
You should focus on the validity of the output. One way to test it, is to split it into the keys and check whether they match the keys of the source map:
testData = new HashMap<>();
testData.put("colA", "valA");
testData.put("colB", "valB");
testData.put("colC", "valC");
String result = getColumn();
assertEquals(testData.keySet(), new HashSet<>(Arrays.asList(result.split(", "))));
You are in control of the test data, so you can ensure that no ", " appears within the key strings.
Note that in its current form, your question’s method would fail, because the result String has an additional leading space. You have to decide whether it is intentional (in this case, you have to change the test to assertEquals(testData.keySet(), new HashSet<>(Arrays.asList(result.substring(1) .split(", "))));) or a spotted bug (then, you have to change the method’s last line to return str.substring(2);).
Don’t forget to make a testcase for an empty map…

HashMap does not maintain insertion order....If you want insertion order to be maintained use a linkedhashmap

Find friend# of all users: How to implement with Hadoop Mapreduce?

Say I have a input as follows:
(1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Ouput is expected as follows:
(1,(2,3,4)) -> (1,3) //second index is total friend #
(2,(1,3,4)) -> (2,3)
(3,(1,2)) -> (3,2)
(4,(1,2)) -> (4,2)
I know how to do this with hashset in java. But don't know how this work with mapreduce model. Can any one throw any ideas or sample code on this problem? I will appreciate this.
------------------------------------------------------------------------------------
Here is my naive solution: 1 mapper, two reducer.
The mapper will organize input(1,2),(2,1),(1,3);
Organize output as
*(1,hashset<2>),(2,hashSet<1>),(1,hashset<2>),(2,hashset<1>),(1,hashset<3>),(3,hashset<1>).*
Reducer1:
take mapper's output as input and output as:
*(1,hashset<2,3>), (3,hashset<1>)and (2,hashset<1>)*
Reducer2:
take reducer1's output as input and output as:
*(1,2),(3,1) and (2,1)*
This is only my naive solution. I'm not sure if this can be done by hadoop's code.

I think there should be an easy way to solve this problem.
Mapper Input: (1,2)(2,1)(1,3)(3,2)(2,4)(4,1)
Just emit two records for each pair like this:
Mapper Output/ Reducer Input:
Key => Value
1 => 2
2 => 1
2 => 1
1 => 2
1 => 3
3 => 1
3 => 2
2 => 3
2 => 4
4 => 2
4 => 1
1 => 1
At reducer side, you'll get 4 different groups like this:
Reducer Output:
Key => Values
1 => [2,3,4]
2 => [1,3,4]
3 => [1,2]
4 => [1,2]
Now, you are good to format your result as you want. :)
Let me know if anybody can see any issue in this approach

1) Intro / Problem
Before going ahead with the job driver, it is important to understand that in a simple-minded approach, the values of the reducers should be sorted in an ascending order. The first thought is to pass the value list unsorted and do some sorting in the reducer per key. This has two disadvantages:
1) It is most probably not efficient for large Value Lists
and
2) How will the framework know if (1,4) is equal to (4,1) if these pairs are processed in different parts of the cluster?
2) Solution in theory
The way to do it in Hadoop is to "mock" the framework in a way by creating a synthetic key.
So our map function instead of the "conceptually more appropriate" (if I may say that)
map(k1, v1) -> list(k2, v2)
is the following:
map(k1, v1) -> list(ksynthetic, null)
As you notice we discard the usage of values (the reducer still gets a list of null values but we don't really care about them). What happens here is that these values are actually included in ksynthetic. Here is an example for the problem in question:
`map(1, 2) -> list([1,2], null)
However, some more operations need to be done so that the keys are grouped and partitioned appropriately and we achieve the correct result in the reducer.
3) Hadoop Implementation
We will implement a class called FFGroupKeyComparator and a class FindFriendPartitioner.
Here is our FFGroupKeyComparator:
public static class FFGroupComparator extends WritableComparator
{
protected FFGroupComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(",");
String[] t2Items = t2.toString().split(",");
String t1Base = t1Items[0];
String t2Base = t2Items[0];
int comp = t1Base.compareTo(t2Base); // We compare using "real" key part of our synthetic key
return comp;
}
}
This class will act as our Grouping Comparator class. It controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) This is very important as it ensures that each reducer gets the appropriate synthetic keys ( judging by the real key).
Due to the fact that Hadoop runs in a cluster with many nodes it is important to ensure that there as many reduce tasks as partitions. Their number should be the same as of the real keys (not synthetic). So, usually we do this with hash values. In our case, what we need to do is compute the partition that a synthetic key belongs based on the hash value of the real key (before the comma). So our FindFriendPartitioner is as follows:
public static class FindFriendPartitioner extends Partitioner implements Configurable
{
#Override
public int getPartition(Text key, Text NullWritable, int numPartitions)
{
String[] keyItems = key.toString().split(",");
String keyBase = keyItems[0];
int part = keyBase.hashCode() % numPartitions;
return part;
}
So now we are all set to write the actual job and solve our problem.
I am assuming your input file looks like this:
1,2
2,1
1,3
3,2
2,4
4,1
We will use the TextInputFormat.
Here's the code for the job driver using Hadoop 1.0.4:
public class FindFriendTwo
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable() );
String tempStrings[] = value.toString().split(",");
Text value2 = new Text(tempStrings[1] + "," + tempStrings[0]); //reverse relationship
context.write(value2, new NullWritable());
}
}
Notice that we also passed the reverse relationships in the map function.
For example if the input string is (1,4) we must not forget (4,1).
public static class FindFriendReducer extends Reducer<Text, NullWritable, IntWritable, IntWritable> {
private Set<String> friendsSet;
public void setup(Context context)
{
friendSet = new LinkedHashSet<String>();
}
public void reduce(Text syntheticKey, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
String tempKeys[] = syntheticKey.toString().split(",");
friendsSet.add(tempKeys[1]);
if( friendsList.size() == 2 )
{
IntWritable key = Integer.parseInt(tempKeys[0]);
IntWritable value = Integer.parseInt(tempKeys[1]);
write(key, value);
}
}
}
Finally, we must remember to include the following in our Main Class, so that the framework uses our classes.
jobConf.setGroupingComparatorClass(FFGroupComparator.class);
jobConf.setPartitionerClass(FindFriendPartitioner.class);

I would approach this problem as follows.
Make sure we have all the relations and have them exactly once each.
Simply count the
Notes on my aproach:
My notation for key value pairs is : K -> V
Both key and value are a almost always a datastructure (not just a string or int)
I never use the key for data. The key is ONLY there to control the flow from mappers towards the right reducer. In all other places I do not look at the key at all. The framework does require a key everywhere. With '()' I mean to say that there is a key that I ignore completely.
The key about my aproach is that it never needs 'all friends' in memory at the same moment (so it works also in the really big situations).
We start with a lot of
(x,y)
and we know that we do not have all relationships in the dataset.
Mapper: Create all relations
Input: () -> (x,y)
Output: (x,y) -> (x,y)
(y,x) -> (y,x)
Reducer: Remove duplicates (simply only output the first one from the iterator)
Input: (x,y) -> [(x,y),(x,y),(x,y),(x,y),.... ]
Output: () -> (x,y)
Mapper: "Wordcount"
Input: () -> (x,y)
Output: (x) -> (x,1)
Reducer: Count them
Input: (x) -> [(x,1),(x,1),(x,1),(x,1),.... ]
Output: () -> (x,N)

Being helped by so many excellent engineers, I finally tried out the solution.
Only one Mapper and one Reducer. No combiner here.
input of Mapper:
1,2
2,1
1,3
3,1
3,2
3,4
5,1
Output of Mapper:
1,2
2,1
1,2
2,1
1,3
3,1
1,3
3,1
4,3
3,4
1,5
5,1
Output Of Reducer:
1 3
2 2
3 3
4 1
5 1
The first col is user, the second is friend#.
On the reducer stage, I add hashSet to assistant analysis.
Thanks #Artem Tsikiridis #Ashish
Your answer gave me a nice clue.
Edited:
Added Code:
//mapper
public static class TokenizerMapper extends
Mapper<Object, Text, Text, Text> {
private Text word1 = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line,",");
if(itr.hasMoreElements()){
word1.set(itr.nextToken().toLowerCase());
}
if(itr.hasMoreElements()){
word2.set(itr.nextToken().toLowerCase());
}
context.write(word1, word2);
context.write(word2, word1);
//
}
}
//reducer
public static class IntSumReducer extends
Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
HashSet<Text> set = new HashSet<Text>();
int sum = 0;
for (Text val : values) {
if(!set.contains(val)){
set.add(val);
sum++;
}
}
result.set(sum);
context.write(key, result);
}
}

Parsing 2 Files Line-By-Line and need to avoid Duplicates (in special cases)

I have 2 files that I'm parsing line-by-line adding the information to 2 separate ArrayList<String> containers. I'm trying to create a final container "finalPNList" that reflects the 'Resulting File/ArrayList' below.
Issue is that I'm not successfully avoiding duplicates. I've changed the code various ways without success. Sometimes I restrict the condition too much, and avoid all duplicates, and sometimes I leave it too loose and include all duplicates. I can't seem to find the conditions to make it just right.
Here is the code so far -- in this case, seeing the contents of processLine() ins't truly relevant, just know that you're getting a map with 2 ArrayLists<String>
public static Map<String, List<String>> masterList = new HashMap<String, List<String>>();
public static List<String> finalPNList = new ArrayList<String>();
public static List<String> modifier = new ArrayList<String>();
public static List<String> skipped = new ArrayList<String>();
for (Entry<String, String> e : tab1.entrySet()) {
String key = e.getKey();
String val = e.getValue();
// returns BufferedReader to start line processing
inputStream = getFileHandle(val);
// builds masterList containing all data
masterList.put(key, processLine(inputStream));
}
for (Entry<String, List<String>> e : masterList.entrySet()) {
String key = e.getKey();
List<String> val = e.getValue();
System.out.println(modifier.size());
for (String s : val) {
if (modifier.size() == 0)
finalPNList.add(s);
if (!modifier.isEmpty() && finalPNList.contains(s)
&& !modifier.contains(key)) {
// s has been added by parent process so SKIP!
skipped.add(s);
} else
finalPNList.add(s);
}
modifier.add(key);
}
Here is what the data may look like (extremely simplified dealing with about 20K lines about 10K lines in each file):
File A
123;data
123;data
456,data
File B
123;data
789,data
789,data
Resulting File/ArrayList
123;data
123;data
789,data
789,data

!modifier.contains(key) is always true, it can be removed from your if-statement.
modifier.size() == 0 can be replaced with modifier.isEmpty().
Since you seem to want to add duplicates from File B, you need to check File A, not finalPNList when checking for existence (I just checked the applicable list in masterList, feel free to change this to something more appropriate / efficient).
You need to have an else after your first if-statement, otherwise you're adding items from File A twice.
I assumed you just missed 456 in your output, otherwise I might not quite understand.
Modified code with your file-IO replaced with something that's more in the spirit of an SSCCE:
masterList.put("A", Arrays.asList("123","123","456"));
masterList.put("B", Arrays.asList("123","789","789"));
for (Map.Entry<String, List<String>> e : masterList.entrySet()) {
String key = e.getKey();
List<String> val = e.getValue();
System.out.println(modifier.size());
for (String s : val) {
if (modifier.isEmpty())
finalPNList.add(s);
else if (!modifier.isEmpty() && masterList.get("A").contains(s)) {
// s has been added by parent process so SKIP!
skipped.add(s);
} else
finalPNList.add(s);
}
modifier.add(key);
}
Test.

Comparing Strings and Returning Boolean

I am currently working on one of the usecases where you are given 6 strings which has 3 oldValues and 3 newValues like given below:
String oldFirstName = "Yogend"
String oldLastName = "Jos"
String oldUserName = "YNJos"
String newFirstName = "Yogendra"
String newLastName ="Joshi"
String newUserName = "YNJoshi"
now what I basically want to do is compare each of the oldValue with its corresponding new value and return true if they are not equal i.e
if(!oldFirstName.equalsIgnoreCase(newFirstName)) {
return true;
}
Now, since I am having 3 fields and it could very well happen that in future we might have more Strings with old and new value I am looking for an optimum solution which could work in all cases no matter how many old and new values are added and without having gazillions of if else clauses.
One possibility I thought was of having Old values as OldArrayList and new values as newArraylist and then use removeAll where it would remove the duplicate values but that is not working in some cases.
Can anyone on stack help me out with some pointers on how to optimum way get this done.
Thanks,
Yogendra N Joshi

you can use lambdaj (download here,website) and hamcrest (download here,website), this libraries are very powerfull for managing collections, the following code is very simple and works perfectly:
import static ch.lambdaj.Lambda.filter;
import static ch.lambdaj.Lambda.having;
import static ch.lambdaj.Lambda.on;
import static org.hamcrest.Matchers.isIn;
import java.util.Arrays;
import java.util.List;
public class Test {
public static void main(String[] args) {
List<String> oldNames = Arrays.asList("nameA","nameE","nameC","namec","NameC");
List<String> newNames = Arrays.asList("nameB","nameD","nameC","nameE");
List<String> newList = filter(having(on(String.class), isIn(oldNames)),newNames);
System.out.print(newList);
//print nameC, nameE
}
}
With this libraries you can solve your problem in one line. You must add to your project: hamcrest-all-1.3.jar and lambdaj-2.4.jar Hope this help serve.
NOTE: This will help you assuming you can have alternatives to your code.

You can use two HashMap<yourFieldName, yourFieldValue> instead of two Arrays / Lists / Sets of Strings (or multiple random Strings);
Then you need a method to compare each value of both maps by their keys;
The result will be an HashMap<String,Boolean> containing the name of each field key, and true if the value is equal in both maps, while false if it is different.
No matter how many fields you will add in the future, the method won't change, while the result will.
Running Example: https://ideone.com/dIaYsK
Code
private static Map<String,Boolean> scanForDifferences(Map<String,Object> mapOne,
Map<String,Object> mapTwo){
Map<String,Boolean> retMap = new HashMap<String,Boolean>();
Iterator<Map.Entry<String, Object>> it = mapOne.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String,Object> entry = (Map.Entry<String,Object>)it.next();
if (mapTwo.get(entry.getKey()).equals(entry.getValue()))
retMap.put(entry.getKey(), new Boolean(Boolean.TRUE));
else
retMap.put(entry.getKey(), new Boolean(Boolean.FALSE));
it.remove(); // prevent ConcurrentModificationException
}
return retMap;
}
Test Case Input
Map<String,Object> oldMap = new HashMap<String,Object>();
Map<String,Object> newMap = new HashMap<String,Object>();
oldMap.put("initials","Y. J.");
oldMap.put("firstName","Yogend");
oldMap.put("lastName","Jos");
oldMap.put("userName","YNJos");
oldMap.put("age","33");
newMap.put("initials","Y. J.");
newMap.put("firstName","Yogendra");
newMap.put("lastName","Joshi");
newMap.put("userName","YNJoshi");
newMap.put("age","33");
Test Case Run
Map<String,Boolean> diffMap = Main.scanForDifferences(oldMap, newMap);
Iterator<Map.Entry<String, Boolean>> it = diffMap.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String,Boolean> entry = (Map.Entry<String,Boolean>)it.next();
System.out.println("Field [" + entry.getKey() +"] is " +
(entry.getValue()?"NOT ":"") + "different" );
}
You should check too if a value is present in one map and not in another one.
You could return an ENUM instead of a Boolean with something like EQUAL, DIFFERENT, NOT PRESENT ...

You should convert your String to some Set.
One set for OLD and another for NEW. And your goal of varity number of elements will also be resolved using same.
As it's set order of it will be same.

Changing LinkedHashMapValues

Below is data from 2 linkedHashMaps:
valueMap: { y=9.0, c=2.0, m=3.0, x=2.0}
formulaMap: { y=null, ==null, m=null, *=null, x=null, +=null, c=null, -=null, (=null, )=null, /=null}
What I want to do is input the the values from the first map into the corresponding positions in the second map. Both maps take String,Double as parameters.
Here is my attempt so far:
for(Map.Entry<String,Double> entryNumber: valueMap.entrySet()){
double doubleOfValueMap = entryNumber.getValue();
for(String StringFromValueMap: strArray){
for(Map.Entry<String,Double> entryFormula: formulaMap.entrySet()){
String StringFromFormulaMap = entryFormula.toString();
if(StringFromFormulaMap.contains(StringFromValueMap)){
entryFormula.setValue(doubleOfValueMap);
}
}
}
}
The problem with doing this is that it will set all of the values i.e. y,m,x,c to the value of the last double. Iterating through the values won't work either as the values are normally in a different order those in the formulaMap. Ideally what I need is to say is if the string in formulaMap is the same as the string in valueMap, set the value in formulaMap to the same value as in valueMap.
Let me know if you have any ideas as to what I can do?

This is quite simple:
formulaMap.putAll(valueMap);
If your value map contains key which are not contained in formulaMap, and you don't want to alter the original, do:
final Map<String, Double> map = new LinkedHashMap<String, Double>(valueMap);
map.keySet().retainAll(formulaMap.keySet());
formulaMap.putAll(map);
Edit due to comment It appears the problem was not at all what I thought, so here goes:
// The result map
for (final String key: formulaMap.keySet()) {
map.put(formulaMap.get(key), valueMap.get(key));
// Either return the new map, or do:
valueMap.clear();
valueMap.putAll(map);

for(Map.Entry<String,Double> valueFormula: valueMap.entrySet()){
formulaMap.put(valueFormula.getKey(), valueFormula.value());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Merge str value in Hadoop Reducer - java

Related

How can I unit test a function that makes a string based on HashMap values?

Find friend# of all users: How to implement with Hadoop Mapreduce?

Parsing 2 Files Line-By-Line and need to avoid Duplicates (in special cases)

Comparing Strings and Returning Boolean

Changing LinkedHashMapValues

Categories

Resources