Multiple output path (Java - Hadoop - MapReduce)

Multiple output path (Java - Hadoop - MapReduce) - java

I do two MapReduce job, and I want for the second job to be able to write my result into two different files, in two different directories.
I would like something similar to FileInputFormat.addInputPath(.., multiple input path) in a sense, but for the output.
I'm completely new to MapReduce, and I have a specificity to write my code in Hadoop 0.21.0
I use context.write(..) in my Reduce step, but I don't see how to control multiple output paths...
Thanks for your time !
My reduceCode from my first job, to show you I only know how to output (it goes into a /../part* file. But now what I would like is to be able to specify two precises files for different output, depending on the key) :
public static class NormalizeReducer extends Reducer<LongWritable, NetflixRating, LongWritable, NetflixUser> {
public void reduce(LongWritable key, Iterable<NetflixRating> values, Context context) throws IOException, InterruptedException {
NetflixUser user = new NetflixUser(key.get());
for(NetflixRating r : values) {
user.addRating(new NetflixRating(r));
}
user.normalizeRatings();
user.reduceRatings();
context.write(key, user);
}
}
EDIT: so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.

so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.

Related

Possible side effects when several CommandLine instance "work" on the same instance of an annotated class?

picoCLI's #-file mechanism is almost what I need, but not exactly. The reason is that I want to control the exact location of additional files parsed -- depending on previous option values.
Example: When called with the options
srcfolder=/a/b optionfile=of.txt, my program should see the additional options read from /a/b/of.txt, but when called with srcfolder=../c optionfile=of.txt, it should see those from ../c/of.txt.
The #-file mechanism can't do that, because it expands ALL the option files (always relative to the current folder, if they're relative) prior to processing ANY option values.
So I'd like to have picoCLI...
process options "from left to right",
recursively parse an option file when it's mentioned in an optionfile option,
and after that continue with the following options.
I might be able to solve this by recursively starting to parse from within the annotated setter method:
...
Config cfg = new Config();
CommandLine cmd = new CommandLine(cfg);
cmd.parseArgs(a);
...
public class Config {
#Option(names="srcfolder")
public void setSrcfolder(String path) {
this.srcfolder=path;
}
#Option(names="optionfile")
public void parseOptionFile(String pathAndName) {
// validate path, do some other housekeeping...
CommandLine cmd = new CommandLine(this /* same Config instance! */ );
cmd.parseArgs(new String[] { "#"+this.srcfolder + pathAndName });
}
...
This way several CommandLine instances would call setter methods on the same Config instance, recursively "interrupting" each other. Now comes the actual question: Is that a problem?
Of course my Config class has state. But do CommandLine instances also have state that might get messed up if other CommandLine instances also modify cfg "in between options"?
Thanks for any insights!
Edited to add: I tried, and I'm getting an UnmatchedArgumentException on the #-file option:
Exception in thread "main" picocli.CommandLine$UnmatchedArgumentException: Unmatched argument at index 0: '#/path/to/configfile'
at picocli.CommandLine$Interpreter.validateConstraints(CommandLine.java:13490)
...
So first I have to get around this: Obviously picoCLI doesn't expand the #-file option unless it's coming directly from the command line.

I did get it to work: several CommandLine instance can indeed work on the same instance of an annotated class, without interfering with each other.
There are some catches and I had to work around a strange picoCLI quirk, but that's not exactly part of an answer to this question, so I explain them in this other question.

How to log file names which were processed successfully

I have a Spring Batch application where I am processing multiple .txt files in parallel. My simple job configuration looks like below:
#Value("file:input/*.txt")
private Resource[] inputResources;
#Bean("partitioner")
#StepScope
public Partitioner partitioner() {
log.info("In Partitioner");
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setResources(inputResources);
partitioner.partition(10);
return partitioner;
}
#Bean
#StepScope
#Qualifier("nodeItemReader")
#DependsOn("partitioner")
public FlatFileItemReader<FolderNodePojo> NodeItemReader(#Value("#{stepExecutionContext['fileName']}") String filename)
throws MalformedURLException {
return new FlatFileItemReaderBuilder<FolderNodePojo>().name("NodeItemReader").delimited().delimiter("<##>")
.names(new String[] { "id" }).fieldSetMapper(new BeanWrapperFieldSetMapper<FolderNodePojo>() {
{
setTargetType(FolderNodePojo.class);
}
}).linesToSkip(0).resource(new UrlResource(filename)).build();
}
There are thousands of .txt files having thousands of lines which are getting processed. Some files have corrupted data and the job fails. I need to generate and send a report about the file names having corrupted data.
What should I do to log the name of the files which were processed successfully for all their lines, or, if possible, if I can log the unsuccessful ones, that will help too? So that I can generate a report based on that and also when I start the job again, I can remove those successful ones from the input directory. Any pointers/solution will be greatly appreciated.

As a strategy, you could do something like this:
once a file is in progress (say, filename "file1.txt"), step 1 is to rename that file, so the filename reflects that it's in progress; for example: "file1.txt.started"
when a file completes successfully, rename it to reflect that; for example: "file1.txt.complete"
if an error is encountered, and the code has a chance to rename the file, rename it to something like "file1.txt.error"
What's nice about this is you allow the filesystem to act like database for you, capturing the state of everything.
Once your program is finished, you can just count up the totals by file extension:
anything named *.complete is good, those are the obvious successful runs
anything with *.started is broken, and specifically it means something went really wrong
anything named *.error is broken, but your code had a chance to notice
anything named *.txt was, for some reason, not picked up
Those last three scenarios – *.started, *.error, *.txt – are the ones to study further, understand what went wrong. If all goes well, you'll end up with *.complete only.

Hadoop multiple outputs with speculative execution

I have a task which writes avro output in multiple directories organized by few fields of the input records.
For example :
Process records of countries across years
and write in a directory structure of country/year
eg:
outputs/usa/2015/outputs_usa_2015.avro
outputs/uk/2014/outputs_uk_2014.avro
AvroMultipleOutputs multipleOutputs=new AvroMultipleOutputs(context);
....
....
multipleOutputs.write("output", avroKey, NullWritable.get(),
OUTPUT_DIR + "/" + record.getCountry() + "/" + record.getYear() + "/outputs_" +record.getCountry()+"_"+ record.getYear());
What output commiter would the below code use to write the output.Is it not safe to be used with speculative execution?
With speculative execution this causes(may cause) org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException
In this post
Hadoop Reducer: How can I output to multiple directories using speculative execution?
It is suggested to use a custom output committer
The below code from hadoop AvroMultipleOutputs does not state any problem with speculative execution
private synchronized RecordWriter getRecordWriter(TaskAttemptContext taskContext,
String baseFileName) throws IOException, InterruptedException {
writer =
((OutputFormat) ReflectionUtils.newInstance(taskContext.getOutputFormatClass(),
taskContext.getConfiguration())).getRecordWriter(taskContext);
...
}
Neither does the write method document any issues if baseoutput path is outside the job directory
public void write(String namedOutput, Object key, Object value, String baseOutputPath)
Is there a real issue with AvroMultipleOutputs (an other outputs) with speculative execution when writing outside the job directory?
If,then how do i override AvroMultipleOutputs to have it's own output committer.I can't see any outputformat inside AvroMultipleOutputs whose output committer it uses

AvroMultipleOutputs will use the OutputFormat which you have registered to Job configurations while adding named output e.g using addNamedOutput API from AvroMultipleOutputs (e.g. AvroKeyValueOutputFormat).
With AvroMultipleOutputs, you might not be able to use speculative task execution feature. Even overriding it either would not help or would not be simple.
Instead you should write your own OutputFormat (most probably extending one of the available Avro output formats e.g. AvroKeyValueOutputFormat), and override/implement its getRecordWriter API, where it would return one RecordWriter instance say MainRecordWriter (just for reference).
This MainRecordWriterwould maintain a map of RecordWriter (e.g. AvroKeyValueRecordWriter) instances. Each of these RecordWriter instances would belong to one of the output file. In write API of MainRecordWriter, you would get the actual RecordWriter instance from the map (based on the record you are going to write), and write the record using this record writer. So MainRecordWriter would be just working as a wrapper over multiple RecordWriter instances.
For some similar implementation, you might like to study the code of MultiStorage class from piggybank library.

When you add a named output to AvroMultipleOutputs, it will call either AvroKeyOutputFormat.getRecordWriter() or AvroKeyValueOutputFormat.getRecordWriter(), which call AvroOutputFormatBase.getAvroFileOutputStream(), whose content is
protected OutputStream getAvroFileOutputStream(TaskAttemptContext context) throws IOException {
Path path = new Path(((FileOutputCommitter)getOutputCommitter(context)).getWorkPath(),
getUniqueFile(context,context.getConfiguration().get("avro.mo.config.namedOutput","part"),org.apache.avro.mapred.AvroOutputFormat.EXT));
return path.getFileSystem(context.getConfiguration()).create(path);
}
And AvroOutputFormatBase extends FileOutputFormat (the getOutputCommitter() in the above method is in fact a call to FileOutputFormat.getOutputCommitter(). Hence, AvroMultipleOutputs should have the same constraints as MultipleOutputs.

Is it possible to run map/reduce job on Hadoop cluster with no input file?

When I try to run map/reduce job on Hadoop cluster without specifying any input file I get following exception:
java.io.IOException: No input paths specified in job
Well, I can imagine cases when running a job without input files does make sense. Generation of test file would be the case. Is it possible to do that with Hadoop? If not do you have some experience on generating files? Is there better way then keeping dummy file with one record on cluster to be used as input file for generation jobs?

File paths are relevant for FileInputFormat based inputs like SequenceInputFormat, etc. But inputformats that read from hbase, database do not read from files, so you could make your own implementation of the InputFormat and define your own behaviour in getSplits, RecordReader, createRecordReader. For insperation look into the source code of the TextInputFormat class.

For MR job unit testing you can also use MRUnit .
If you want to generate test data with Hadoop, then I'd recommend you to have a look at the source code of Teragen .

I guess your are looking to test your map-reduce on samll set of data so in that case i will recommand following
Unit Test For Map-Reduce will solve your problem
If you want to test your mapper/combiner/reducer for a single line of linput from your file , best possible thing is to use UnitTest for each .
sample code:-
using Mocking Frame work in java Use can run these test cases in your IDE
Here i have used Mockito OR MRunit can also be used which too is depended on a Mockito(Java Mocking Framework)
public class BoxPlotMapperTest {
#Test
public void validOutputTextMapper() throws IOException, InterruptedException
{
Mapper mapper=new Mapper();//Your Mapper Object
Text line=new Text("single line from input-file"); // single line input from file
Mapper.Context context=Mockito.mock(Mapper.Context.class);
mapper.map(null, line, context);//(key=null,value=line,context)//key was not used in my code so its null
Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output")); //
}
#Test
public void validOutputTextReducer() throws IOException, InterruptedException
{
Reducer reduer=new Reducer();
final List<Text> values=new ArrayList<Text>();
values.add(new Text("value1"));
values.add(new Text("value2"));
values.add(new Text("value3"));
values.add(new Text("value4"));
Iterable<Text> iterable=new Iterable<Text>() {
#Override
public Iterator<Text> iterator() {
// TODO Auto-generated method stub
return values.iterator();
}
};
Reducer.Context context=Mockito.mock(Reducer.Context.class);
reduer.reduce(new Text("key"),iterable, context);
Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output"));
}
}

If you want to generate a test file why would you need to use hadoop in the first place? Any kind of file you'd use an input to a mapreduce step can be created using type-specific API's outside on a mapreduce step, even HDFS files.

I know I'm resurrecting an old thread, but there was no best answer chosen, so I thought I'd throw this out there. I agre MRUnit is good for many things, but sometimes I just wanna play around with some real data (especially for tests where I'd need to mock it out to make it work in MRUnit).
When that's my goal, I create a separate little job to test my ideas and use SleepInputFormat to basically lie to Hadoop and say there's input when really there's not. The old API provided an example of that here: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.22/mapreduce/src/test/mapred/org/apache/hadoop/mapreduce/SleepJob.java, and I converted the input format to the new API here: https://gist.github.com/keeganwitt/6053872.

Get Total Input Path Count in Hadoop Mapper

We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.

You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths.
These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured).
So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:
public class DummyFileInputFormat extends FileInputFormat {
public List<FileStatus> listStatus(Context mapContext) throws IOException {
return super.listStatus(mapContext);
}
#Override
public RecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
// dummy input format, so this will never be called
return null;
}
}
EDIT: In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)
Here's the JIRA ticket

you can setup a configuration in your job with the number of input paths. just like
jobConf.setInt("numberOfPaths",paths.length);
just put the code in that place where you configure your job. After that read it out of the configuration in your Mapper.setup(Mapper.Context context) by getting it from the context.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Multiple output path (Java - Hadoop - MapReduce) - java

Related

Possible side effects when several CommandLine instance "work" on the same instance of an annotated class?

How to log file names which were processed successfully

Hadoop multiple outputs with speculative execution

Is it possible to run map/reduce job on Hadoop cluster with no input file?

Get Total Input Path Count in Hadoop Mapper

Categories

Resources