How to execute actions before MapReduce job starts using JobControl

How to execute actions before MapReduce job starts using JobControl - java

I've got JobControl that controls chain of n jobs.
for (int i = 0; i < iterations; i++) {
Job eStep = EStepJob.createJob(config);
Job mStep = MStepJob.createJob(config);
emChain.add(new ControlledJob(eStep, getDeps(emChain)));
emChain.add(new ControlledJob(mStep, getDeps(emChain)));
}
jobControl.addJobCollection(emChain);
I would like to clean output directories only and only before each job starts;
But the directories must not be cleaned at the time the jobs initialized.
My current solution is to place clearing code into map phase, thats drastically slows the execution.
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
if (fs.exists(new Path(context.getConfiguration().get(
AR_PROBS_OUTPUT)))) {
fs.delete(
new Path(context.getConfiguration()
.get(AR_PROBS_OUTPUT)), true);
}
Are there any more adequate methods?

You can use the Mapper.setup() method for the same. It is a method that is executed before any map task is started at any node.
I believe you are using HDFS when you initialize the FileSystem in your code.
Anyway the code should work the same way. But the number of times it gets executed will be equal to the number of Mapper Tasks generated and not the number of times each Mapper task gets executed!

You can store output into temporary directory when you initialise the job.
After the job completion you can remove temporary directory.
Then you can check, is output need to commit? If yes, then using OutputCommitter you can commit the output.
Please check below link:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputCommitter.html

Related

How to call a custom method after Mapreduce Job completition using Hadoop java api?

I am trying to run a mapreduce program, just WordCount for my better understanding. Everything is working pretty fine like it suppose. I want to call a function after the completion of a MapReduce program and in that function I want to merge all the part-files made in the reduce step to a single textfile containing the contents of all the part-files. I have seen related problem and people suggested to use FileUtil.copyMerge function. My question is how to make function call such that it gets executed after whole mapreduce process.
public class mapreducetask {
private void filesmerger(){
// I want to merge partfiles here in the function(maybe using FileUtils.copyMerge)
}
public static void main(String [] args) throws Exception{
Configuration cnf = new Configuration();
cnf.set("mapreduce.output.textoutputformat.seperator",":");
Integer numberOfReducers = 3;
Job jb = new Job(cnf,"mapreducejob");
jb.setJarByClass(mapreducetask.class);
jb.setMapperClass(mapper.class);
jb.setNumReduceTasks(numberOfReducers);
jb.setReducerClass(reducer.class);
jb.setOutputKeyClass(Text.class);
jb.setOutputValueClass(IntWritable.class);
jb.setInputFormatClass(customfileinputformat.class);
Path input = new Path("Input");
Path output = new Path ("Output");
FileInputFormat.addInputPath(jb, input);
FileOutputFormat.setOutputPath(jb, output);
// Should I call my merger function here. Location 1
System.exit(jb.waitForCompletion(true)?0:1);
}
}
When I'm making a call from Location 1(see the code) it seems to get executed even before mapreduce program which I don't want. How can I call function after completion of a Mapreduce process.

You're calling the code in Location 1 before you call jb.waitForCompletion(true). You need to call it after (and obviously not call System.exit()). So:
jb.waitForCompletion(true);
//Run your code

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});

Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Hadoop multiple outputs with speculative execution

I have a task which writes avro output in multiple directories organized by few fields of the input records.
For example :
Process records of countries across years
and write in a directory structure of country/year
eg:
outputs/usa/2015/outputs_usa_2015.avro
outputs/uk/2014/outputs_uk_2014.avro
AvroMultipleOutputs multipleOutputs=new AvroMultipleOutputs(context);
....
....
multipleOutputs.write("output", avroKey, NullWritable.get(),
OUTPUT_DIR + "/" + record.getCountry() + "/" + record.getYear() + "/outputs_" +record.getCountry()+"_"+ record.getYear());
What output commiter would the below code use to write the output.Is it not safe to be used with speculative execution?
With speculative execution this causes(may cause) org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException
In this post
Hadoop Reducer: How can I output to multiple directories using speculative execution?
It is suggested to use a custom output committer
The below code from hadoop AvroMultipleOutputs does not state any problem with speculative execution
private synchronized RecordWriter getRecordWriter(TaskAttemptContext taskContext,
String baseFileName) throws IOException, InterruptedException {
writer =
((OutputFormat) ReflectionUtils.newInstance(taskContext.getOutputFormatClass(),
taskContext.getConfiguration())).getRecordWriter(taskContext);
...
}
Neither does the write method document any issues if baseoutput path is outside the job directory
public void write(String namedOutput, Object key, Object value, String baseOutputPath)
Is there a real issue with AvroMultipleOutputs (an other outputs) with speculative execution when writing outside the job directory?
If,then how do i override AvroMultipleOutputs to have it's own output committer.I can't see any outputformat inside AvroMultipleOutputs whose output committer it uses

AvroMultipleOutputs will use the OutputFormat which you have registered to Job configurations while adding named output e.g using addNamedOutput API from AvroMultipleOutputs (e.g. AvroKeyValueOutputFormat).
With AvroMultipleOutputs, you might not be able to use speculative task execution feature. Even overriding it either would not help or would not be simple.
Instead you should write your own OutputFormat (most probably extending one of the available Avro output formats e.g. AvroKeyValueOutputFormat), and override/implement its getRecordWriter API, where it would return one RecordWriter instance say MainRecordWriter (just for reference).
This MainRecordWriterwould maintain a map of RecordWriter (e.g. AvroKeyValueRecordWriter) instances. Each of these RecordWriter instances would belong to one of the output file. In write API of MainRecordWriter, you would get the actual RecordWriter instance from the map (based on the record you are going to write), and write the record using this record writer. So MainRecordWriter would be just working as a wrapper over multiple RecordWriter instances.
For some similar implementation, you might like to study the code of MultiStorage class from piggybank library.

When you add a named output to AvroMultipleOutputs, it will call either AvroKeyOutputFormat.getRecordWriter() or AvroKeyValueOutputFormat.getRecordWriter(), which call AvroOutputFormatBase.getAvroFileOutputStream(), whose content is
protected OutputStream getAvroFileOutputStream(TaskAttemptContext context) throws IOException {
Path path = new Path(((FileOutputCommitter)getOutputCommitter(context)).getWorkPath(),
getUniqueFile(context,context.getConfiguration().get("avro.mo.config.namedOutput","part"),org.apache.avro.mapred.AvroOutputFormat.EXT));
return path.getFileSystem(context.getConfiguration()).create(path);
}
And AvroOutputFormatBase extends FileOutputFormat (the getOutputCommitter() in the above method is in fact a call to FileOutputFormat.getOutputCommitter(). Hence, AvroMultipleOutputs should have the same constraints as MultipleOutputs.

How can I access the Mapper/Reducer counters on the Output stage?

I have some counters I created at my Mapper class:
(example written using the appengine-mapreduce Java library v.0.5)
#Override
public void map(Entity entity) {
getContext().incrementCounter("analyzed");
if (isSpecial(entity)){
getContext().incrementCounter("special");
}
}
(The method isSpecial just returns true or false depending on the state of the entity, not relevant to the question)
I want to access those counters when I finish processing the whole stuff, at the finish method of the Output class:
#Override
public Summary finish(Collection<? extends OutputWriter<Entity>> writers) {
//get the counters and save/return the summary
int analyzed = 0; //getCounter("analyzed");
int special = 0; //getCounter("special");
Summary summary = new Summary(analyzed, special);
save(summary);
return summary;
}
... but the method getCounter is only available from the MapperContext class, which is accessible only from Mappers/Reducers getContext() method.
How can I access my counters at the Output stage?
Side note: I can't send the counters values to my outputted class because the whole Map/Reduce is about transforming a set of Entities to another set (in other words: the counters are not the main purpose of the Map/Reduce). The counters are just for control - it makes sense I compute them here instead of creating another process just to make the counts.
Thanks.

There is not a way to do this inside of output today. But feel free to request it here:
https://code.google.com/p/appengine-mapreduce/issues/list
What you can do however is to chain a job to run after your map-reduce that will receive it's output and counters. There is an example of this here:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/java/example/src/com/google/appengine/demos/mapreduce/entitycount/ChainedMapReduceJob.java
In the above example it is running 3 MapReduce jobs in a row. Note that these don't have to be MapReduce jobs, you can create your own class that extends Job and has a run method which creates your Summary object.

Passing value between two separate MapReduce jobs

I have an Hadoop program, where I need to pass a single output which is generated from first MapReduce task to a second MapReduce task.
Ex.
MapReduce -1 -> Writes double value to the hdfs (file name is similar to part-00000).
In the second MapReduce job I want to use the double value from part-00000 file.
How can I do it. Can anyone please give some code snippet.

Wait for first job to finish and then run second one on the output of the first. You can do it:
1) In the Driver:
int code = firstJob.waitForCompletion(true) ? 0 : 1;
if (code) {
Job secondJob = new Job(new Configuration(), "JobChaining-Second");
TextInputFormat.addInputPath(secondJob, outputDirOfFirstJob);
...
}
2) Use JobControl and ControlledJob:
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
To use JobControl, start by wrapping your jobs with ControlledJob. Doing this is
relatively simple: you create your job like you usually would, except you also create a
ControlledJob that takes in your Job or Configuration as a parameter, along with a
list of its dependencies (other ControlledJobs). Then, you add them one-by-one to the
JobControl object, which handles the rest.
3) Externally (e.g. from shell script). Pass input/output paths as arguments.
4) Use Apache Oozie. You will specify your jobs in XML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.