I've written some code in Hadoop that should do the following tasks:
In the Mapper: Records are read one by one from input splits and some processing is performed on them. Then, in accordance with the result of the work done, Some records are pruned and save in a set. At the end of the mapper this set must be sent to reducer.
In the Reducer: All of received sets from all Mappers are processed and final result is generated.
My question is: how can I delay sending mentioned set to the Reducer until processing of the last record in each of mappers. By default, all code that are written in Mapper runs as the number of input records (correct if wrong); So sets are sent to reducer multiple time (as the number of input records). How can I recognize end of processing of the input splits in each mapper?
(Now I use an if-condition with a counter for counting the number of processed records, but I think there must be better ways. Also if I don't know total number of records in files, this method does not work)
This is flowchart of the job :
If you look at the Mapper class (Javadoc) you can see it has four methods available:
cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)
run(org.apache.hadoop.mapreduce.Mapper.Context context)
setup(org.apache.hadoop.mapreduce.Mapper.Context context)
The default implementation of run() looks like:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
This illustrates the order/when each of the methods are called. Typically you'll override the map() method. Doing some work at the start/end of a mapper running can be achieved using setup() and cleanup().
The code shows the map() method will be called once for each key/value pair entering the mapper. setup() and cleanup() will each be called just once at the start and end of the key/values being processed.
In your case you can use cleanup() to output the set of values once, when all the key/values have been processed.
Related
JavaRDD<Text> tx= counts2.map(new Function<Object, Text>() {
#Override
public Text call(Object o) throws Exception {
// TODO Auto-generated method stub
if (o.getClass() == Dict.class) {
Dict rkd = (Dict) o;
return new Text(rkd.getId());
} else {
return null ;
}
}
});
tx.saveAsTextFile("/rowkey/Rowkey_new");
I am new to Spark, I want to save this file, but I got the Null exception. I don't want to use return new Text() to replace return null,because it will insert a blank line to my file. So how can I solve this problem?
Instead of putting an if condition in your map, you simply use that if condition to build a RDD filter. The Spark Quick Start is a good place to start. There is also a nice overview of other transformations and actions.
Basically your code can look as follows (if you are using Java 8):
counts2
.filter((o)->o instanceof Dict)
.map(o->new Text(((Dict)o).getId()))
.saveAsTextFile("/rowkey/Rowkey_new");
You had the intention to map one incoming record to either zero or one outgoing record. This cannot be done with a map. However, filter maps to zero or one records with incoming record matches outgoing record, and flatMap gives you some more flexibility by allowing to map to zero or more outgoing records of any type.
It is strange, but not inconceivable, you create non-Dict objects that are going to be filters out further downstream anyhow. Possibly you can consider to push your filter even further upstream to make sure you only create Dict instances. Without knowing the rest of your code, this is only a assumption of course, and is not part of your original question anyhow.
I have a task which writes avro output in multiple directories organized by few fields of the input records.
For example :
Process records of countries across years
and write in a directory structure of country/year
eg:
outputs/usa/2015/outputs_usa_2015.avro
outputs/uk/2014/outputs_uk_2014.avro
AvroMultipleOutputs multipleOutputs=new AvroMultipleOutputs(context);
....
....
multipleOutputs.write("output", avroKey, NullWritable.get(),
OUTPUT_DIR + "/" + record.getCountry() + "/" + record.getYear() + "/outputs_" +record.getCountry()+"_"+ record.getYear());
What output commiter would the below code use to write the output.Is it not safe to be used with speculative execution?
With speculative execution this causes(may cause) org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException
In this post
Hadoop Reducer: How can I output to multiple directories using speculative execution?
It is suggested to use a custom output committer
The below code from hadoop AvroMultipleOutputs does not state any problem with speculative execution
private synchronized RecordWriter getRecordWriter(TaskAttemptContext taskContext,
String baseFileName) throws IOException, InterruptedException {
writer =
((OutputFormat) ReflectionUtils.newInstance(taskContext.getOutputFormatClass(),
taskContext.getConfiguration())).getRecordWriter(taskContext);
...
}
Neither does the write method document any issues if baseoutput path is outside the job directory
public void write(String namedOutput, Object key, Object value, String baseOutputPath)
Is there a real issue with AvroMultipleOutputs (an other outputs) with speculative execution when writing outside the job directory?
If,then how do i override AvroMultipleOutputs to have it's own output committer.I can't see any outputformat inside AvroMultipleOutputs whose output committer it uses
AvroMultipleOutputs will use the OutputFormat which you have registered to Job configurations while adding named output e.g using addNamedOutput API from AvroMultipleOutputs (e.g. AvroKeyValueOutputFormat).
With AvroMultipleOutputs, you might not be able to use speculative task execution feature. Even overriding it either would not help or would not be simple.
Instead you should write your own OutputFormat (most probably extending one of the available Avro output formats e.g. AvroKeyValueOutputFormat), and override/implement its getRecordWriter API, where it would return one RecordWriter instance say MainRecordWriter (just for reference).
This MainRecordWriterwould maintain a map of RecordWriter (e.g. AvroKeyValueRecordWriter) instances. Each of these RecordWriter instances would belong to one of the output file. In write API of MainRecordWriter, you would get the actual RecordWriter instance from the map (based on the record you are going to write), and write the record using this record writer. So MainRecordWriter would be just working as a wrapper over multiple RecordWriter instances.
For some similar implementation, you might like to study the code of MultiStorage class from piggybank library.
When you add a named output to AvroMultipleOutputs, it will call either AvroKeyOutputFormat.getRecordWriter() or AvroKeyValueOutputFormat.getRecordWriter(), which call AvroOutputFormatBase.getAvroFileOutputStream(), whose content is
protected OutputStream getAvroFileOutputStream(TaskAttemptContext context) throws IOException {
Path path = new Path(((FileOutputCommitter)getOutputCommitter(context)).getWorkPath(),
getUniqueFile(context,context.getConfiguration().get("avro.mo.config.namedOutput","part"),org.apache.avro.mapred.AvroOutputFormat.EXT));
return path.getFileSystem(context.getConfiguration()).create(path);
}
And AvroOutputFormatBase extends FileOutputFormat (the getOutputCommitter() in the above method is in fact a call to FileOutputFormat.getOutputCommitter(). Hence, AvroMultipleOutputs should have the same constraints as MultipleOutputs.
I have some counters I created at my Mapper class:
(example written using the appengine-mapreduce Java library v.0.5)
#Override
public void map(Entity entity) {
getContext().incrementCounter("analyzed");
if (isSpecial(entity)){
getContext().incrementCounter("special");
}
}
(The method isSpecial just returns true or false depending on the state of the entity, not relevant to the question)
I want to access those counters when I finish processing the whole stuff, at the finish method of the Output class:
#Override
public Summary finish(Collection<? extends OutputWriter<Entity>> writers) {
//get the counters and save/return the summary
int analyzed = 0; //getCounter("analyzed");
int special = 0; //getCounter("special");
Summary summary = new Summary(analyzed, special);
save(summary);
return summary;
}
... but the method getCounter is only available from the MapperContext class, which is accessible only from Mappers/Reducers getContext() method.
How can I access my counters at the Output stage?
Side note: I can't send the counters values to my outputted class because the whole Map/Reduce is about transforming a set of Entities to another set (in other words: the counters are not the main purpose of the Map/Reduce). The counters are just for control - it makes sense I compute them here instead of creating another process just to make the counts.
Thanks.
There is not a way to do this inside of output today. But feel free to request it here:
https://code.google.com/p/appengine-mapreduce/issues/list
What you can do however is to chain a job to run after your map-reduce that will receive it's output and counters. There is an example of this here:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/java/example/src/com/google/appengine/demos/mapreduce/entitycount/ChainedMapReduceJob.java
In the above example it is running 3 MapReduce jobs in a row. Note that these don't have to be MapReduce jobs, you can create your own class that extends Job and has a run method which creates your Summary object.
I do two MapReduce job, and I want for the second job to be able to write my result into two different files, in two different directories.
I would like something similar to FileInputFormat.addInputPath(.., multiple input path) in a sense, but for the output.
I'm completely new to MapReduce, and I have a specificity to write my code in Hadoop 0.21.0
I use context.write(..) in my Reduce step, but I don't see how to control multiple output paths...
Thanks for your time !
My reduceCode from my first job, to show you I only know how to output (it goes into a /../part* file. But now what I would like is to be able to specify two precises files for different output, depending on the key) :
public static class NormalizeReducer extends Reducer<LongWritable, NetflixRating, LongWritable, NetflixUser> {
public void reduce(LongWritable key, Iterable<NetflixRating> values, Context context) throws IOException, InterruptedException {
NetflixUser user = new NetflixUser(key.get());
for(NetflixRating r : values) {
user.addRating(new NetflixRating(r));
}
user.normalizeRatings();
user.reduceRatings();
context.write(key, user);
}
}
EDIT: so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.
so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.
We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.
You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths.
These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured).
So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:
public class DummyFileInputFormat extends FileInputFormat {
public List<FileStatus> listStatus(Context mapContext) throws IOException {
return super.listStatus(mapContext);
}
#Override
public RecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
// dummy input format, so this will never be called
return null;
}
}
EDIT: In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)
Here's the JIRA ticket
you can setup a configuration in your job with the number of input paths. just like
jobConf.setInt("numberOfPaths",paths.length);
just put the code in that place where you configure your job. After that read it out of the configuration in your Mapper.setup(Mapper.Context context) by getting it from the context.