Mapreduce Combiner - java

I have a simple mapreduce code with mapper, reducer and combiner.
The output from mapper is passed to combiner. But to the reducer, instead of output from combiner,output from mapper is passed.
Kindly help
Code:
package Combiner;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class AverageSalary
{
public static class Map extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] empDetails= value.toString().split(",");
Text unit_key = new Text(empDetails[1]);
DoubleWritable salary_value = new DoubleWritable(Double.parseDouble(empDetails[2]));
context.write(unit_key,salary_value);
}
}
public static class Combiner extends Reducer<Text,DoubleWritable, Text,Text>
{
public void reduce(final Text key, final Iterable<DoubleWritable> values, final Context context)
{
String val;
double sum=0;
int len=0;
while (values.iterator().hasNext())
{
sum+=values.iterator().next().get();
len++;
}
val=String.valueOf(sum)+":"+String.valueOf(len);
try {
context.write(key,new Text(val));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text, Text,Text>
{
public void reduce (final Text key, final Text values, final Context context)
{
//String[] sumDetails=values.toString().split(":");
//double average;
//average=Double.parseDouble(sumDetails[0]);
try {
context.write(key,values);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static void main(String args[])
{
Configuration conf = new Configuration();
try
{
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: Main <in> <out>");
System.exit(-1); }
Job job = new Job(conf, "Average salary");
//job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
job.setJarByClass(AverageSalary.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Combiner.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : -1);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

The #1 rule of Combiners are: do not assume that the combiner will run. Treat the combiner only as an optimization.
The Combiner is not guaranteed to run over all of your data. In some cases when the data doesn't need to be spilled to disk, MapReduce will skip using the Combiner entirely. Note also that the Combiner may be ran multiple times over subsets of the data! It'll run once per spill.
In your case, you are making this bad assumption. You should be doing the sum in the Combiner AND the Reducer.
Also, you should follow #user987339's answer as well. The input and output of the combiner needs to be identical (Text,Double -> Text,Double) and it needs to match up with the output of the Mapper and the input of the Reducer.

It seems that you forgot about important property of a combiner:
the input types for the key/value and the output types of the
key/value need to be the same.
You can't take in a Text/DoubleWritable and return a Text/Text. I suggest you to use Text Instead DoubleWritable, and do proper parsing inside Combiner.

If a combine function is used, then it is the same form as the reduce function (and is
an implementation of Reducer), except its output types are the intermediate key and
value types (K2 and V2), so they can feed the reduce function:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Often the combine and reduce functions are the same, in which case, K3 is the same as
K2, and V3 is the same as V2.

Combiner will not work always when you run mapreduce.
If there is at least three spill files (output of mapper written to local-disk) the combiner will execute so that the size of file can be reduced so that it can be easily transferred to reduce node.
The number of spills for which a combiner need to run can be set through min.num.spills.for.combine property

Related

Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable

I am new to hadoop and trying to run a sample program from book. I am facing error
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
Below is my code
package com.hadoop.employee.salary;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class AvgMapper extends Mapper<LongWritable,Text,Text,FloatWritable>{
public void Map(LongWritable key,Text empRec,Context con) throws IOException,InterruptedException{
String[] word = empRec.toString().split("\\t");
String sex = word[3];
Float salary = Float.parseFloat(word[8]);
try {
con.write(new Text(sex), new FloatWritable(salary));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
package com.hadoop.employee.salary;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class AvgSalReducer extends Reducer<Text,FloatWritable,Text,Text> {
public void reduce(Text key,Iterable<FloatWritable> valuelist,Context con)
throws IOException,
InterruptedException
{
float total =(float)0;
int count =0;
for(FloatWritable var:valuelist)
{
total += var.get();
System.out.println("reducer"+var.get());
count++;
}
float avg =(float) total/count;
String out = "Total: " + total + " :: " + "Average: " + avg;
try {
con.write(key,new Text(out));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
package com.hadoop.employee.salary;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvgSalary {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
if(args.length!=2)
{
System.out.println("Please provide twp parameters");
}
Job job = new Job();
job.setJarByClass(AvgSalary.class);//helps hadoop in finding the relevant jar if there are multiple jars
job.setJobName("Avg Salary");
job.setMapperClass(AvgMapper.class);
job.setReducerClass(AvgSalReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
try {
System.exit(job.waitForCompletion(true)?0:1);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
In your mapper you've called the map method Map, it should be map. Because of this it will be calling the default implementation, since you aren't overriding the map method. Which results in the same input key/value types coming in being emitted, thus they key is a LongWritable.
Changing the name to map should fix this error.

Use dkpro semantic similarity with uby

I want to compute similarity between strings with dkpro similarity (https://dkpro.github.io/dkpro-similarity/), it works, like so:
import org.dkpro.similarity.algorithms.api.SimilarityException;
import org.dkpro.similarity.algorithms.api.TextSimilarityMeasure;
import org.dkpro.similarity.algorithms.lsr.LexSemResourceComparator;
import org.dkpro.similarity.algorithms.lsr.gloss.GlossOverlapComparator;
import org.dkpro.similarity.algorithms.lsr.path.JiangConrathComparator;
import org.dkpro.similarity.algorithms.lsr.path.LeacockChodorowComparator;
import org.dkpro.similarity.algorithms.lsr.path.LinComparator;
import org.dkpro.similarity.algorithms.lsr.path.ResnikComparator;
import org.dkpro.similarity.algorithms.lsr.path.WuPalmerComparator;
import de.tudarmstadt.ukp.dkpro.lexsemresource.LexicalSemanticResource;
import de.tudarmstadt.ukp.dkpro.lexsemresource.core.ResourceFactory;
import de.tudarmstadt.ukp.dkpro.lexsemresource.exception.LexicalSemanticResourceException;
import de.tudarmstadt.ukp.dkpro.lexsemresource.exception.ResourceLoaderException;
import learninggoals.analysis.controller.settingtypes.SimilarityAlgorithm;
public class SemResourceComparator implements WordsComparator{
private LexicalSemanticResource resource;
private LexSemResourceComparator comparator;
//en lang
public SemResourceComparator(String resourcetype, SimilarityAlgorithm algorithm, String lang) {
try {
resource = ResourceFactory.getInstance().get(resourcetype, lang);
} catch (ResourceLoaderException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
switch(algorithm){
/*case ESA://this is vector
comparator = new GlossOverlapComparator(resource, false);
break;*/
case GLOSSOVERLAP:
comparator = new GlossOverlapComparator(resource, false);
break;
case JIANG_CONRATH:
comparator = new JiangConrathComparator(resource, resource.getRoot());
break;
case LEACOCK_CHODOROW:
comparator = new LeacockChodorowComparator(resource);
break;
case LIN:
comparator = new LinComparator(resource, resource.getRoot());
break;
case RESNIK:
comparator = new ResnikComparator(resource, resource.getRoot());
break;
case WUPALMER:
comparator = new WuPalmerComparator(resource, resource.getRoot());
break;
default:
break;
}
} catch (LexicalSemanticResourceException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
#Override
public double compareWords(String w1, String w2) {
try {
return comparator.getSimilarity(resource.getEntity(w1), resource.getEntity(w2));
} catch (SimilarityException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LexicalSemanticResourceException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return 0;
}
i use the class like this:
double intermscore = comparator.compareWords(word1, word2);
I use LexicalSemanticResource as resource for comparing, it can be wordnet, wikipedia, germanet etc. Now i noticed that all of the resources i need are in uby (https://www.ukp.tu-darmstadt.de/data/lexical-resources/uby/, https://github.com/dkpro/dkpro-uby/blob/master/de.tudarmstadt.ukp.uby.lmf.api-asl/src/main/java/de/tudarmstadt/ukp/lmf/api/Uby.java).
My question is: can i replace the resource with a resource from uby so I don't have to include a new resource again each time i need one? so instead of ResourceFactory.getInstance().get("wordnet"), i want to use the uby resource, so sth like new Uby().getLexicalResource("wordnet") - however lexicalresource from uby is not the same as LexicalSemanticResource i use now for semantic comparison. So: Instead of using e.g. LexicalSemanticResource wordnet, i want to use wordnet from uby for the comparators. Is there a way to do this?
There currently is no way to do this. Uby resources and LSR resources are not compatible.
There were plans to switch, but the issue has been open for a while:
https://github.com/dkpro/dkpro-similarity/issues/39

How to save received messages in separate files with messagelistener

Disk.class implementation
package server;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageListener;
import javax.jms.ObjectMessage;
import services.CustomerData;
public class Disk implements MessageListener{
private int index;
private FileWriter f;
private BufferedWriter b;
public Disk(int i){
this.index=i;
try {
f = new FileWriter("disk"+i+".txt",true);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
b = new BufferedWriter(f);
}
#Override
public void onMessage(Message m) {
try {
if(m instanceof ObjectMessage){
CustomerData c = (CustomerData) ((ObjectMessage) m).getObject();
b.write(c.getSurname()+" "+c.getName()+" "+c.getAge());
b.newLine();
b.flush();
System.out.println("disk"+index+".txt saved");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JMSException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
So, what happens is that every message received by every message listener is saved in the same file (the last indexed disk.txt file) but I want to save them in every single file, from 0 to N. N txt files are created but they are not modified except the last one.
EDIT: I added the FileWriter and BufferedWriter in the Disk contructor but it will create N files but modify the last one only.
Main class there Disk is created:
package server;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.rmi.RemoteException;
import java.util.Hashtable;
import javax.jms.JMSException;
import javax.jms.Session;
import javax.jms.Topic;
import javax.jms.TopicConnection;
import javax.jms.TopicConnectionFactory;
import javax.jms.TopicSession;
import javax.jms.TopicSubscriber;
import javax.naming.Context;
import javax.naming.InitialContext;
import javax.naming.NamingException;
public class StorageServer {
public static final int N = 10;
public static void main(String[] args) throws RemoteException {
Hashtable<String,String> prop = new Hashtable<String,String>();
prop.put("java.naming.factory.initial", "org.apache.activemq.jndi.ActiveMQInitialContextFactory");
prop.put("java.naming.provider.url", "tcp://127.0.0.1:61616");
prop.put("topic.req", "requests");
System.setProperty("org.apache.activemq.SERIALIZABLE_PACKAGES","*");
try {
Context jndiCon = new InitialContext(prop);
TopicConnectionFactory tConnFact = (TopicConnectionFactory) jndiCon.lookup("TopicConnectionFactory");
TopicConnection tConn = tConnFact.createTopicConnection();
TopicSession tSess = tConn.createTopicSession(false, Session.AUTO_ACKNOWLEDGE);
Topic topic = (Topic) jndiCon.lookup("req");
TopicSubscriber subscriber = tSess.createSubscriber(topic);
tConn.start();
for(int i=0; i<N; i++){
//FileWriter file = new FileWriter("disk"+i+".txt",true);
subscriber.setMessageListener(new Disk(i));
System.out.println("New disk"+i+" started");
}
} catch (NamingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JMSException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
You have a single TopicSubscriber which has a single MessageListener (hence the setMessageListener and not addMessageListener). You need to create a separate TopicSubscriber for each listener with
for(int i=0; i<N; i++){
TopicSubscriber subscriber = tSess.createSubscriber(topic);
subscriber.setMessageListener(new Disk(i));
System.out.println("New disk"+i+" started");
}
I'd also recommend avoiding using the FileWriter (and FileReader) class, because it uses the platform encoding. This can cause surprises when platform (or its encoding) changes. The equivalent, but longer and safer way is:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("whatever.txt"), "UTF-8"));
With UTF-8 being a safe encoding to use.

Duplicated result on bolt reception of storm topology

I'm using a bolt that receives tuples from another bolt (exclamation bolt ) and writes it on a file , the problem I got is that I have duplicated results four times , like when I emit a word , I found the word Written four times . where's the problem possibly could be ?
public class PrinterBolty extends BaseBasicBolt {
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try {
BufferedWriter output;
output = new BufferedWriter(new FileWriter("/root/src/storm-starter/hh.txt", true));
output.newLine();
output.append(tuple.getString(0));
output.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer ofd) {
}
}
The solution was to specify 1 spout in the main class :
builder.setSpout("spout", new RandomSentenceSpout(), 1);

Java program terminating after ObjectMapper.writeValue(System.out, responseData) - Jackson Library

I'm using the Jackson library to create JSON objects, but when I use the mapper.writeValue(System.out, responseData) function, the program terminates. Here is my code:
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.codehaus.jackson.JsonGenerationException;
import org.codehaus.jackson.map.JsonMappingException;
import org.codehaus.jackson.map.ObjectMapper;
public class Test {
public static void main(String[] args){
new Test().test();
}
public void test() {
ObjectMapper mapper = new ObjectMapper();
Map<String, Object> responseData = new HashMap<String, Object>();
responseData.put("id", 1);
try {
mapper.writeValue(System.out, responseData);
System.out.println("done");
} catch (JsonGenerationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (JsonMappingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}.
}
After this executes, the console shows {"id":1}, but does not show "done".
The problem is with the Jackson implementation, as ObjectMapper._configAndWriteValue calls UtfGenerator.close(), which calls PrintStream.close().
I'd log an issue at https://jira.codehaus.org/browse/JACKSON
To change the default behavior of target being closed you can do the following:
mapper.configure(JsonGenerator.Feature.AUTO_CLOSE_TARGET, false);
While declaring variable names in your data files/getter files, the first letter should be small.

Categories

Resources