Hadoop multiple outputs with speculative execution

Hadoop multiple outputs with speculative execution - java

I have a task which writes avro output in multiple directories organized by few fields of the input records.
For example :
Process records of countries across years
and write in a directory structure of country/year
eg:
outputs/usa/2015/outputs_usa_2015.avro
outputs/uk/2014/outputs_uk_2014.avro
AvroMultipleOutputs multipleOutputs=new AvroMultipleOutputs(context);
....
....
multipleOutputs.write("output", avroKey, NullWritable.get(),
OUTPUT_DIR + "/" + record.getCountry() + "/" + record.getYear() + "/outputs_" +record.getCountry()+"_"+ record.getYear());
What output commiter would the below code use to write the output.Is it not safe to be used with speculative execution?
With speculative execution this causes(may cause) org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException
In this post
Hadoop Reducer: How can I output to multiple directories using speculative execution?
It is suggested to use a custom output committer
The below code from hadoop AvroMultipleOutputs does not state any problem with speculative execution
private synchronized RecordWriter getRecordWriter(TaskAttemptContext taskContext,
String baseFileName) throws IOException, InterruptedException {
writer =
((OutputFormat) ReflectionUtils.newInstance(taskContext.getOutputFormatClass(),
taskContext.getConfiguration())).getRecordWriter(taskContext);
...
}
Neither does the write method document any issues if baseoutput path is outside the job directory
public void write(String namedOutput, Object key, Object value, String baseOutputPath)
Is there a real issue with AvroMultipleOutputs (an other outputs) with speculative execution when writing outside the job directory?
If,then how do i override AvroMultipleOutputs to have it's own output committer.I can't see any outputformat inside AvroMultipleOutputs whose output committer it uses

AvroMultipleOutputs will use the OutputFormat which you have registered to Job configurations while adding named output e.g using addNamedOutput API from AvroMultipleOutputs (e.g. AvroKeyValueOutputFormat).
With AvroMultipleOutputs, you might not be able to use speculative task execution feature. Even overriding it either would not help or would not be simple.
Instead you should write your own OutputFormat (most probably extending one of the available Avro output formats e.g. AvroKeyValueOutputFormat), and override/implement its getRecordWriter API, where it would return one RecordWriter instance say MainRecordWriter (just for reference).
This MainRecordWriterwould maintain a map of RecordWriter (e.g. AvroKeyValueRecordWriter) instances. Each of these RecordWriter instances would belong to one of the output file. In write API of MainRecordWriter, you would get the actual RecordWriter instance from the map (based on the record you are going to write), and write the record using this record writer. So MainRecordWriter would be just working as a wrapper over multiple RecordWriter instances.
For some similar implementation, you might like to study the code of MultiStorage class from piggybank library.

When you add a named output to AvroMultipleOutputs, it will call either AvroKeyOutputFormat.getRecordWriter() or AvroKeyValueOutputFormat.getRecordWriter(), which call AvroOutputFormatBase.getAvroFileOutputStream(), whose content is
protected OutputStream getAvroFileOutputStream(TaskAttemptContext context) throws IOException {
Path path = new Path(((FileOutputCommitter)getOutputCommitter(context)).getWorkPath(),
getUniqueFile(context,context.getConfiguration().get("avro.mo.config.namedOutput","part"),org.apache.avro.mapred.AvroOutputFormat.EXT));
return path.getFileSystem(context.getConfiguration()).create(path);
}
And AvroOutputFormatBase extends FileOutputFormat (the getOutputCommitter() in the above method is in fact a call to FileOutputFormat.getOutputCommitter(). Hence, AvroMultipleOutputs should have the same constraints as MultipleOutputs.

Related

How can I access the Mapper/Reducer counters on the Output stage?

I have some counters I created at my Mapper class:
(example written using the appengine-mapreduce Java library v.0.5)
#Override
public void map(Entity entity) {
getContext().incrementCounter("analyzed");
if (isSpecial(entity)){
getContext().incrementCounter("special");
}
}
(The method isSpecial just returns true or false depending on the state of the entity, not relevant to the question)
I want to access those counters when I finish processing the whole stuff, at the finish method of the Output class:
#Override
public Summary finish(Collection<? extends OutputWriter<Entity>> writers) {
//get the counters and save/return the summary
int analyzed = 0; //getCounter("analyzed");
int special = 0; //getCounter("special");
Summary summary = new Summary(analyzed, special);
save(summary);
return summary;
}
... but the method getCounter is only available from the MapperContext class, which is accessible only from Mappers/Reducers getContext() method.
How can I access my counters at the Output stage?
Side note: I can't send the counters values to my outputted class because the whole Map/Reduce is about transforming a set of Entities to another set (in other words: the counters are not the main purpose of the Map/Reduce). The counters are just for control - it makes sense I compute them here instead of creating another process just to make the counts.
Thanks.

There is not a way to do this inside of output today. But feel free to request it here:
https://code.google.com/p/appengine-mapreduce/issues/list
What you can do however is to chain a job to run after your map-reduce that will receive it's output and counters. There is an example of this here:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/java/example/src/com/google/appengine/demos/mapreduce/entitycount/ChainedMapReduceJob.java
In the above example it is running 3 MapReduce jobs in a row. Note that these don't have to be MapReduce jobs, you can create your own class that extends Job and has a run method which creates your Summary object.

Multiple output path (Java - Hadoop - MapReduce)

I do two MapReduce job, and I want for the second job to be able to write my result into two different files, in two different directories.
I would like something similar to FileInputFormat.addInputPath(.., multiple input path) in a sense, but for the output.
I'm completely new to MapReduce, and I have a specificity to write my code in Hadoop 0.21.0
I use context.write(..) in my Reduce step, but I don't see how to control multiple output paths...
Thanks for your time !
My reduceCode from my first job, to show you I only know how to output (it goes into a /../part* file. But now what I would like is to be able to specify two precises files for different output, depending on the key) :
public static class NormalizeReducer extends Reducer<LongWritable, NetflixRating, LongWritable, NetflixUser> {
public void reduce(LongWritable key, Iterable<NetflixRating> values, Context context) throws IOException, InterruptedException {
NetflixUser user = new NetflixUser(key.get());
for(NetflixRating r : values) {
user.addRating(new NetflixRating(r));
}
user.normalizeRatings();
user.reduceRatings();
context.write(key, user);
}
}
EDIT: so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.

so I did the method in the last comment as you mentioned, Amar. I don't know if it's works, I have other problem with my HDFS, but before I forget let's put here my discoveries for the sake of civilization :
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
MultipleOutputs DOES NOT act in place of FormatOutputFormat. You define one output path with FormatOutputFormat, and then you can add many more with multiple MultipleOutputs.
addNamedOutput method: String namedOutput is just a word who describe.
You define the path actually in the write method, the String baseOutputPath arg.

Is it possible to run map/reduce job on Hadoop cluster with no input file?

When I try to run map/reduce job on Hadoop cluster without specifying any input file I get following exception:
java.io.IOException: No input paths specified in job
Well, I can imagine cases when running a job without input files does make sense. Generation of test file would be the case. Is it possible to do that with Hadoop? If not do you have some experience on generating files? Is there better way then keeping dummy file with one record on cluster to be used as input file for generation jobs?

File paths are relevant for FileInputFormat based inputs like SequenceInputFormat, etc. But inputformats that read from hbase, database do not read from files, so you could make your own implementation of the InputFormat and define your own behaviour in getSplits, RecordReader, createRecordReader. For insperation look into the source code of the TextInputFormat class.

For MR job unit testing you can also use MRUnit .
If you want to generate test data with Hadoop, then I'd recommend you to have a look at the source code of Teragen .

I guess your are looking to test your map-reduce on samll set of data so in that case i will recommand following
Unit Test For Map-Reduce will solve your problem
If you want to test your mapper/combiner/reducer for a single line of linput from your file , best possible thing is to use UnitTest for each .
sample code:-
using Mocking Frame work in java Use can run these test cases in your IDE
Here i have used Mockito OR MRunit can also be used which too is depended on a Mockito(Java Mocking Framework)
public class BoxPlotMapperTest {
#Test
public void validOutputTextMapper() throws IOException, InterruptedException
{
Mapper mapper=new Mapper();//Your Mapper Object
Text line=new Text("single line from input-file"); // single line input from file
Mapper.Context context=Mockito.mock(Mapper.Context.class);
mapper.map(null, line, context);//(key=null,value=line,context)//key was not used in my code so its null
Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output")); //
}
#Test
public void validOutputTextReducer() throws IOException, InterruptedException
{
Reducer reduer=new Reducer();
final List<Text> values=new ArrayList<Text>();
values.add(new Text("value1"));
values.add(new Text("value2"));
values.add(new Text("value3"));
values.add(new Text("value4"));
Iterable<Text> iterable=new Iterable<Text>() {
#Override
public Iterator<Text> iterator() {
// TODO Auto-generated method stub
return values.iterator();
}
};
Reducer.Context context=Mockito.mock(Reducer.Context.class);
reduer.reduce(new Text("key"),iterable, context);
Mockito.verify(context).write(new Text("your expected key-output"), new Text("your expected value-output"));
}
}

If you want to generate a test file why would you need to use hadoop in the first place? Any kind of file you'd use an input to a mapreduce step can be created using type-specific API's outside on a mapreduce step, even HDFS files.

I know I'm resurrecting an old thread, but there was no best answer chosen, so I thought I'd throw this out there. I agre MRUnit is good for many things, but sometimes I just wanna play around with some real data (especially for tests where I'd need to mock it out to make it work in MRUnit).
When that's my goal, I create a separate little job to test my ideas and use SleepInputFormat to basically lie to Hadoop and say there's input when really there's not. The old API provided an example of that here: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.22/mapreduce/src/test/mapred/org/apache/hadoop/mapreduce/SleepJob.java, and I converted the input format to the new API here: https://gist.github.com/keeganwitt/6053872.

Get Total Input Path Count in Hadoop Mapper

We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.

You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths.
These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured).
So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:
public class DummyFileInputFormat extends FileInputFormat {
public List<FileStatus> listStatus(Context mapContext) throws IOException {
return super.listStatus(mapContext);
}
#Override
public RecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
// dummy input format, so this will never be called
return null;
}
}
EDIT: In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)
Here's the JIRA ticket

you can setup a configuration in your job with the number of input paths. just like
jobConf.setInt("numberOfPaths",paths.length);
just put the code in that place where you configure your job. After that read it out of the configuration in your Mapper.setup(Mapper.Context context) by getting it from the context.

GWT Maven plugin: Generating non-String parameters in the Messages class

I have a property in my "Messages.properties" file that has an argument that uses number formatting:
my.message=File exceeds {0,number,0.0}MB.
When I run the gwt:i18n Maven goal, it generates a Messages interface based on the properties in my "Messages.properties" file (like normal):
public interface Messages extends com.google.gwt.i18n.client.Messages {
//...
#DefaultMessage("File exceeds {0,number,0.0}MB.")
#Key("my.message")
String my_message(String arg0);
//...
}
The problem is that the method parameter is a String. When I run the application, it gives me an error because the message argument expects a number, but a String is supplied (the error message is, "Only Number subclasses may be formatted as a number").
How do I configure Maven to have it change this parameter to number (like a float or Number)? Thanks.

Given the discussion above, I have decided to complement my previous answer.
First of all, as far as I know there's no way you can use the existing i18n Maven goal (and GWT's I18NCreator) to do what is asked.
Secondly, after researching a bit more on the Generator solution I had suggested, I found that:
Michael is right that you wouldn't pick up errors at compile time using the interface method with look up for properties (a sin in GWT) as suggested above. However, this is still the simplest/quickest way to do it.
You can ensure compile-time check by writing your own interface which
is kept up-to-date with the properties file, having one method for each property, and then getting your
generator to write a class which implements that interface. Notice
that when you change a property on the properties file, you only need
to change the interface you wrote. If you've written the Generator
properly, it will never have to be changed again! The best way to go
about method names is probably follow GWT: if a property is called
the.prop.one, then the method name is the_prop_one(..).
If you really don't want to maintain an interface manually, the only
way I can see is for you to write your own version of I18NCreator.
This is because the maven goal i18n is not a GWT compiler
parameter, but a call for the maven plugin to write
Messages/Constants interfaces based on properties files found in the
class path. Therefore, if you write your own I18NCreator, you will
have to also write a Maven plugin that you can use to call it before
compiling the GWT application. Or, to make it simpler, you can simply
run your I18NCreator manually (using the good-old java command to run
it) every time you change your properties file keys (of course,
there's no need to run it when only actual messages are changed).
Personally, I would just write and maintain my properties file and the interface that mirrors it manually. The Generator will always look at the properties file and generate the methods that correspond to the properties (with whatever arguments are required based on the actual message), so if the interface you wrote reflects the properties file, the class generated by the Generator will always implement it correctly.

It seems to me this feature is not supported by GWT I18NCreator (which is what the maven i18n goal calls). You would have to write your own Generator to do that.
I have written a couple of Generators and it's not as hard as you may think.
In your case, you would want to write a Generator that creates an instance of an interface similar to GWT's Messages (but you can use your own) but which has the added functionality that you want when decoding messages.
The following how-to little guide may help you, as it seems it's pretty much what I did as well and it works:
http://groups.google.com/group/Google-Web-Toolkit/msg/ae249ea67c2c3435?pli=1
I found that the easiest way to write a GWT Generator is to actually write a test class with the code you would want generated in your IDE (and with the help of auto-completion, syntax-checks etc), and then past/adapt it to the writer calls like this:
writer.println("public void doSomething() { /* implement */ }");
And don't forget to tell your module (module.gwt.xml file) which interface needs to be generated, and with which class, like this:
<generate-with class="mycompany.utils.generators.MyGenerator">
<when-type-assignable class="mycompany.messages.MyCoolPropertiesReader" />
</generate-with>
In the Generator code, you can use Java with all its great features (not limited to GWT-translatable code) so it shouldn't be hard to implement what you want. In the client-side code, you can then just do:
public interface MyCoolPropertiesReader {
public String getMessage(String propKey, Object... parameters);
}
public class MyClientSideClass {
MyCoolPropertiesReader reader = GWT.create(MyCoolPropertiesReader.class);
String msg = reader.getMessage("my.message", 10);
// do more work
}
A test Generator that I wrote (a GWT "reflective" getter and setter, as it were) looks like this:
public class TestGenerator extends Generator {
#Override
public String generate(TreeLogger logger, GeneratorContext context,
String typeName) throws UnableToCompleteException {
try {
TypeOracle oracle = context.getTypeOracle();
JClassType requestedClass = oracle.getType(typeName);
String packageName = requestedClass.getPackage().getName();
String simpleClassName = requestedClass.getSimpleSourceName();
String proxyClassName = simpleClassName + "GetterAndSetter";
String qualifiedProxyClassName = packageName + "." + proxyClassName;
System.out.println("Created a class called: " + qualifiedProxyClassName);
PrintWriter printWriter = context.tryCreate(logger, packageName, className);
if (printWriter == null) return null;
ClassSourceFileComposerFactory composerFactory = new ClassSourceFileComposerFactory(packageName, className);
composerFactory.addImport("test.project.shared.GetterAndSetter");
composerFactory.addImplementedInterface("GetterAndSetter<" + underlyingTypeName + ">");
SourceWriter writer = composerFactory.createSourceWriter(context, printWriter);
if (writer != null) {
JField[] fields = requestedClass.getFields();
for (JField field : fields) {
createSetterMethodForField(typeName, writer, field);
}
writer.indent();
writer.println("public void set(" + typeName + " target, String path, Object value) {");
writer.indent();
createIfBlockForFields(writer, fields, true);
writer.outdent();
writer.println("}");
writer.println();
writer.println("public <K> K get(" + typeName + " target, String path) {");
writer.indent();
createIfBlockForFields(writer, fields, false);
writer.outdent();
writer.println("}");
writer.println();
writer.outdent();
writer.commit(logger);
}
return packageName + "." + proxyClassName;
} catch(NotFoundException nfe) {
throw new UnableToCompleteException();
}
}
}
I hope this helps you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop multiple outputs with speculative execution - java

Related

How can I access the Mapper/Reducer counters on the Output stage?

Multiple output path (Java - Hadoop - MapReduce)

Is it possible to run map/reduce job on Hadoop cluster with no input file?

Get Total Input Path Count in Hadoop Mapper

GWT Maven plugin: Generating non-String parameters in the Messages class

Categories

Resources