Extends a class that extends the Hadoop's Mapper - java

This is an example of a Map class [1] from the Hadoop that extends the Mapper class. [3] is the Hadoop's Mapper class.
I want to create my MyExampleMapper that extends the ExampleMapper that also extends the hadoop's Mapper [2]. I do this because I just want to set a property in the ExampleMapper so that, when I create the MyExampleMapper or other examples, I don't have to set the property myself because I have extended the ExampleMapper. Is it possible to do this?
[1] Example mapper
import org.apache.hadoop.mapreduce.Mapper;
public class ExampleMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
[2] What I want
import org.apache.hadoop.mapreduce.Mapper;
public class MyExampleMapper
extends ExampleMapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
String result = System.getProperty("job.examplemapper")
if (result.equals("true")) {
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
}
public class ExampleMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
extends Mapper{
System.setProperty("job.examplemapper", "true");
}
[3] This is the Hadoop's Mapper class
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public Mapper() {
}
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
}
protected void map(KEYIN key, VALUEIN value, Mapper.Context context) throws IOException, InterruptedException {
context.write(key, value);
}
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
}
public void run(Mapper.Context context) throws IOException, InterruptedException {
this.setup(context);
try {
while(context.nextKeyValue()) {
this.map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
this.cleanup(context);
}
}
public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public Context(Configuration var1, TaskAttemptID conf, RecordReader<KEYIN, VALUEIN> taskid, RecordWriter<KEYOUT, VALUEOUT> reader, OutputCommitter writer, StatusReporter committer, InputSplit reporter) throws IOException, InterruptedException {
super(conf, taskid, reader, writer, committer, reporter, split);
}
}
}

import org.apache.hadoop.mapreduce.Mapper;
public class ExampleMapper<T, X, Y, Z> extends Mapper<T, X, Y, Z> {
static {
System.setProperty("job.examplemapper", "true");
}
}
Then extend it, in your program
public class MyExampleMapper
extends ExampleMapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
String result = System.getProperty("job.examplemapper")
if (result.equals("true")) {
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
}

Related

Can Jackson be used to create a type-tree from a class?

I'm looking for a way to use Jackson to create a type-tree instead of a value-tree.
I had assumed that this would be possible but I ran into an issue where Jackson creates a NullNode object when it encounters a field which has null as a value.
What I'm interested in is types, not values. Currently I'm doing the following as a workaround as I cannot provide Jackson with a class to build the tree:
package org.example.jackson.typetree;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.junit.jupiter.api.Test;
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
public class FunctioningTest {
static class SomeClass{
public Integer integer;
public String string;
}
#Test
void extractTypeTree() throws NoSuchMethodException, InvocationTargetException, InstantiationException, IllegalAccessException {
final Constructor<?> constructor = SomeClass.class.getDeclaredConstructor();
final Object o = constructor.newInstance((Object[]) null);
final var jsonNode = new ObjectMapper().valueToTree(o);
final var fields = jsonNode.fields();
while(fields.hasNext()){
final var child = fields.next();
if(child.getValue().isIntegralNumber() || child.getValue().isTextual()){
System.out.println("Nice!");
}else if(child.getValue().isNull()){
System.out.println("Booooh...!");
}
}
}
}
As I mentioned this results in a ObjectNode instance which has 2 NullNode instances as children. What I would like however is to get an ObjectNode with a IntNode/NumericNode and a TextNode regardless of the actual value of the fields in the instance of SomeClass.
Can Jackson be used to do this?
I've put in a few more hours trying to figure out a way to traverse a Class and I've found a method that is good enough for me.
Just to highlight, Jackson does not create a type-tree. Instead it has several Visitor types which can be use to visit Serializers. Serializers within Jackson being the method by which a instance of a class is serialized to some output (JSON, YAML, etc.). When visiting these Serializers we can in turn check for properties of objects and in turn visit those.
Below is a rudimentary implementation of this mechanism, which can be adjusted to build a type-tree manually. I used jackson-databind:2.12.1 of but I believe the mechanism was introduced in jackson-databind: 2.2.0.
package org.example.jackson.type.tree;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.databind.*;
import com.fasterxml.jackson.databind.jsonFormatVisitors.*;
import com.fasterxml.jackson.databind.ser.SerializerFactory;
import org.junit.jupiter.api.Test;
import java.util.Set;
public class FunctioningTest {
#Test
void extractTypeTree() throws JsonMappingException {
final var objectMapper = new ObjectMapper();
final var serializerProvider = objectMapper.getSerializerProviderInstance();
final var javaType = objectMapper.constructType(SomeClass.class);
final var serializerFactory = objectMapper.getSerializerFactory();
final var typeSerializer = serializerFactory.createSerializer(serializerProvider, javaType);
final var FieldVisitor = new FieldVisitor(serializerProvider, serializerFactory);
typeSerializer.acceptJsonFormatVisitor(FieldVisitor, javaType);
}
static class SomeClass {
public Integer integer;
public String string;
public FieldClass fieldClass;
}
static class FieldClass {
public boolean aBoolean;
public double floatingPoint;
}
static class FieldVisitor extends JsonFormatVisitorWrapper.Base implements JsonFormatVisitorWrapper {
private final SerializerFactory serializerFactory;
public FieldVisitor(final SerializerProvider provider, final SerializerFactory serializerFactory) {
super(provider);
this.serializerFactory = serializerFactory;
}
#Override
public JsonObjectFormatVisitor expectObjectFormat(JavaType javaType) throws JsonMappingException {
System.out.println("FieldVisitor (Object): " + javaType);
final var objectVisitor = new ObjectVisitor(getProvider());
final var objectSerializer = serializerFactory.createSerializer(getProvider(), javaType);
final var properties = objectSerializer.properties();
while (properties.hasNext()) {
final var property = properties.next();
final var propertyType = property.getType();
final var propertySerializer = serializerFactory.createSerializer(getProvider(), propertyType);
propertySerializer.acceptJsonFormatVisitor(this, propertyType);
}
return objectVisitor;
}
#Override
public JsonArrayFormatVisitor expectArrayFormat(JavaType type) throws JsonMappingException {
return null;
}
#Override
public JsonStringFormatVisitor expectStringFormat(JavaType type) throws JsonMappingException {
System.out.println("FieldVisitor (String): " + type);
return new StringVisitor();
}
#Override
public JsonNumberFormatVisitor expectNumberFormat(JavaType type) throws JsonMappingException {
System.out.println("FieldVisitor (Number): " + type);
return new NumberVisitor();
}
#Override
public JsonIntegerFormatVisitor expectIntegerFormat(JavaType type) throws JsonMappingException {
System.out.println("FieldVisitor (Integer): " + type);
return new IntegerVisitor();
}
#Override
public JsonBooleanFormatVisitor expectBooleanFormat(JavaType type) throws JsonMappingException {
System.out.println("FieldVisitor (Boolean): " + type);
return new BooleanVisitor();
}
#Override
public JsonNullFormatVisitor expectNullFormat(JavaType type) throws JsonMappingException {
System.out.println("Null: " + type);
return null;
}
#Override
public JsonAnyFormatVisitor expectAnyFormat(JavaType type) throws JsonMappingException {
return null;
}
#Override
public JsonMapFormatVisitor expectMapFormat(JavaType type) throws JsonMappingException {
return null;
}
}
static class ObjectVisitor extends JsonObjectFormatVisitor.Base implements JsonObjectFormatVisitor {
public ObjectVisitor(SerializerProvider serializerProvider) {
super(serializerProvider);
}
#Override
public void property(BeanProperty writer) throws JsonMappingException {
System.out.println("ObjectVisitor: " + writer);
}
#Override
public void property(String name, JsonFormatVisitable handler, JavaType propertyTypeHint) throws JsonMappingException {
System.out.println("ObjectVisitor: " + String.join(", ", name, handler.toString(), propertyTypeHint.toString()));
}
#Override
public void optionalProperty(BeanProperty writer) throws JsonMappingException {
System.out.println("ObjectVisitor (optional): " + writer);
}
#Override
public void optionalProperty(String name, JsonFormatVisitable handler, JavaType propertyTypeHint) throws JsonMappingException {
System.out.println("ObjectVisitor (optional): " + String.join(", ", name, handler.toString(), propertyTypeHint.toString()));
}
}
static class StringVisitor implements JsonStringFormatVisitor {
#Override
public void format(JsonValueFormat format) {
System.out.println("StringVisitor (format): " + format);
}
#Override
public void enumTypes(Set<String> enums) {
System.out.println("StringVisitor (enums): " + enums);
}
}
static class IntegerVisitor implements JsonIntegerFormatVisitor {
#Override
public void numberType(JsonParser.NumberType type) {
System.out.println("IntegerVisitor (numberType): " + type);
}
#Override
public void format(JsonValueFormat format) {
System.out.println("IntegerVisitor (format): " + format);
}
#Override
public void enumTypes(Set<String> enums) {
System.out.println("IntegerVisitor (enums): " + enums);
}
}
static public class BooleanVisitor implements JsonBooleanFormatVisitor {
#Override
public void format(JsonValueFormat format) {
System.out.println("BooleanVisitor (format): " + format);
}
#Override
public void enumTypes(Set<String> enums) {
System.out.println("BooleanVisitor (enums): " + enums);
}
}
static class NumberVisitor implements JsonNumberFormatVisitor {
#Override
public void numberType(JsonParser.NumberType type) {
System.out.println("NumberVisitor (numberType): " + type);
}
#Override
public void format(JsonValueFormat format) {
System.out.println("NumberVisitor (format): " + format);
}
#Override
public void enumTypes(Set<String> enums) {
System.out.println("NumberVisitor (enums): " + enums);
}
}
}
Which outputs:
FieldVisitor (Object): [simple type, class io.serpentes.examples.schema.sources.jackson.FunctioningTest$SomeClass]
FieldVisitor (Integer): [simple type, class java.lang.Integer]
IntegerVisitor (numberType): INT
FieldVisitor (String): [simple type, class java.lang.String]
FieldVisitor (Object): [simple type, class io.serpentes.examples.schema.sources.jackson.FunctioningTest$FieldClass]
FieldVisitor (Boolean): [simple type, class boolean]
FieldVisitor (Number): [simple type, class double]
NumberVisitor (numberType): DOUBLE
ObjectVisitor (optional): property 'aBoolean' (field "io.serpentes.examples.schema.sources.jackson.FunctioningTest$FieldClass#aBoolean, no static serializer)
ObjectVisitor (optional): property 'floatingPoint' (field "io.serpentes.examples.schema.sources.jackson.FunctioningTest$FieldClass#floatingPoint, no static serializer)
ObjectVisitor (optional): property 'integer' (field "io.serpentes.examples.schema.sources.jackson.FunctioningTest$SomeClass#integer, no static serializer)
ObjectVisitor (optional): property 'string' (field "io.serpentes.examples.schema.sources.jackson.FunctioningTest$SomeClass#string, no static serializer)
ObjectVisitor (optional): property 'fieldClass' (field "io.serpentes.examples.schema.sources.jackson.FunctioningTest$SomeClass#fieldClass, no static serializer)

Hadoop mapreduce custom writable static context

I'm working on an university homework and we have to use hadoop mapreduce for it. I'm trying to create a new custom writable as I want to output key-value pairs as (key, (doc_name, 1)).
public class Detector {
private static final Path TEMP_PATH = new Path("temp");
private static final String LENGTH = "gramLength";
private static final String THRESHOLD = "threshold";
public class Custom implements Writable {
private Text document;
private IntWritable count;
public Custom(){
setDocument("");
setCount(0);
}
public Custom(String document, int count) {
setDocument(document);
setCount(count);
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
document.readFields(in);
count.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
document.write(out);
count.write(out);
}
public int getCount() {
return count.get();
}
public void setCount(int count) {
this.count = new IntWritable(count);
}
public String getDocument() {
return document.toString();
}
public void setDocument(String document) {
this.document = new Text(document);
}
}
public static class NGramMapper extends Mapper<Text, Text, Text, Text> {
private int gramLength;
private Pattern space_pattern=Pattern.compile("[ ]");
private StringBuilder gramBuilder= new StringBuilder();
#Override
protected void setup(Context context) throws IOException, InterruptedException{
gramLength=context.getConfiguration().getInt(LENGTH, 0);
}
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens=space_pattern.split(value.toString());
for(int i=0;i<tokens.length;i++){
gramBuilder.setLength(0);
if(i+gramLength<=tokens.length){
for(int j=i;j<i+gramLength;j++){
gramBuilder.append(tokens[j]);
gramBuilder.append(" ");
}
context.write(new Text(gramBuilder.toString()), key);
}
}
}
}
public static class OutputReducer extends Reducer<Text, Text, Text, Custom> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text val : values) {
context.write(key,new Custom(val.toString(),1));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.setInt(LENGTH, Integer.parseInt(args[0]));
conf.setInt(THRESHOLD, Integer.parseInt(args[1]));
// Setup first MapReduce phase
Job job1 = Job.getInstance(conf, "WordOrder-first");
job1.setJarByClass(Detector.class);
job1.setMapperClass(NGramMapper.class);
job1.setReducerClass(OutputReducer.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(Text.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Custom.class);
job1.setInputFormatClass(WholeFileInputFormat.class);
FileInputFormat.addInputPath(job1, new Path(args[2]));
FileOutputFormat.setOutputPath(job1, new Path(args[3]));
boolean status1 = job1.waitForCompletion(true);
if (!status1) {
System.exit(1);
}
}
}
When I compile the code to a class file i get this error:
Detector.java:147: error: non-static variable this cannot be referenced from a static context
context.write(key,new Custom(val.toString(),1));
I followed differents tutorials about custom writable and my solution is the same as the others. Any suggestion?
Static fields and methods are shared with all instances. They are for values which are specific to the class and not a specific instance. Stay out of them as much as possible.
To solve your problem, you need to instantiate an instance (create an object) of your class so the run-time can reserve memory for the instance; or change the part you are accessing it to have static access (not recommended!).
The keyword this is for referencing something that's indeed an instance (hence the this thing) and not something that's static, which in that case should be referenced by the class name instead. You are using it in a static context which is not allowed.

How to sort word count program by value or count?

How do I sort my wordcount output by count/value rather than by the key.
In the normal case, the output is
hi 2
hw 3
wr 1
r 3
but the desired output is
wr 1
hi 2
hw 3
r 3
My code is:
public class sortingprog {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(one,word);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<IntWritable,Text, IntWritable, Text> {
public void reduce(Iterator<IntWritable> key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
int sum=0;
while (key.hasNext()) {
sum+=key.next().get();
}
output.collect(new IntWritable(sum),value);
}
#Override
public void reduce(IntWritable arg0, Iterator<Text> arg1,
OutputCollector<IntWritable, Text> arg2, Reporter arg3)
throws IOException {
// TODO Auto-generated method stub
}
}
public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntWritable.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntWritable v1 = (IntWritable) w1;
IntWritable v2 = (IntWritable) w2;
return -1 * v1.compareTo(v2);
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(sortingprog.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setOutputValueGroupingComparator(GroupComparator.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
What you look for is called "Secondary Sort". Here you can find two tutorials of how to achieve a value short in your MapReduce:
http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
http://codingjunkie.net/secondary-sort/
you need to do the following.
Create a custom writable comparable which uses both the fields.
In the compareTo method provide the implementation logic of comparing the custom writable. This is called by the Reducer later to sort the keys. That is key in the whole implementation. Here in the compareTo just use the second field to compare the values.
public CustomPair implements WritableComparable{
public CustomPair(String fld1,int fld2){
this.fld1=fld1; //wr
this.fld2=fld2;//1
}
#Override
public int compareTo(Object o2) {
CustomPair other = (CustomPair ) o2;
int compareValue = other.fld2().compareTo(this.fld2());
return compareValue;
}
public void write(DataOutput out) throws IOException {
dataOutput.writeUTF(fld1);
dataOutput.writeInt(fld2);
}
// You have to implement the rest of the methods.
}
Let me know if you need additional help.

Send multiple arguments to reducer-MapReduce

I've written a code which does something similar to SQL GroupBy.
The dataset I took is here:
250788681419,20090906,200937,200909,619,SUNDAY,WEEKEND,ON-NET,MORNING,OUTGOING,VOICE,25078,PAY_AS_YOU_GO_PER_SECOND_PSB,SUCCESSFUL-RELEASEDBYSERVICE,17,0,1,21.25,635-10-112-30455
public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException
{
String line = value.toString();
String[] attribute=line.split(",");
double rs=Double.parseDouble(attribute[17]);
String comb=new String();
comb=attribute[5].concat(attribute[8].concat(attribute[10]));
context.write(new Text(comb),new DoubleWritable (rs));
}
}
public class MyReduce extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
protected void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double sum = 0;
Iterator<DoubleWritable> iter=values.iterator();
while (iter.hasNext())
{
double val=iter.next().get();
sum = sum+ val;
}
context.write(key, new DoubleWritable(sum));
};
}
In the Mapper, as its value sends the 17th argument to the reducer to sum it. Now I also want to sum the 14th argument how do i send it to the reducer?
If your data types are the same, then creating an ArrayWritable class should work for this. The class should resemble:
public class DblArrayWritable extends ArrayWritable
{
public DblArrayWritable()
{
super(DoubleWritable.class);
}
}
Your mapper class then looks like:
public class MyMap extends Mapper<LongWritable, Text, Text, DblArrayWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException
{
String line = value.toString();
String[] attribute=line.split(",");
DoubleWritable[] values = new DoubleWritable[2];
values[0] = Double.parseDouble(attribute[14]);
values[1] = Double.parseDouble(attribute[17]);
String comb=new String();
comb=attribute[5].concat(attribute[8].concat(attribute[10]));
context.write(new Text(comb),new DblArrayWritable.set(values));
}
}
In your reducer you should now be able to iterate over the values of the DblArrayWritable.
Based on your sample data however it looks like they may be separate types. You may be able to implement an ObjectArrayWritable class that would do the trick, but I'm not certain of this and I can't see much to support it. If it works the class would be:
public class ObjArrayWritable extends ArrayWritable
{
public ObjArrayWritable()
{
super(Object.class);
}
}
You could handle this by simply concatenating the values and passing them as Text to the reducer which would then split them again.
Another option is to implement your own Writable class. Here's a sample of how that could work:
public static class PairWritable implements Writable
{
private Double myDouble;
private String myString;
// TODO :- Override the Hadoop serialization/Writable interface methods
#Override
public void readFields(DataInput in) throws IOException {
myLong = in.readDouble();
myString = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeDouble(myLong);
out.writeUTF(myString);
}
//End of Implementation
//Getter and Setter methods for myLong and mySring variables
public void set(Double d, String s) {
myDouble = d;
myString = s;
}
public Long getLong() {
return myDouble;
}
public String getString() {
return myString;
}
}

MultipleOutputFormat in hadoop

I'm a newbie in Hadoop. I'm trying out the Wordcount program.
Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
in my driver class i had
MultipleOutputs.addNamedOutput(conf, "even",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);
MultipleOutputs.addNamedOutput(conf, "odd",
org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
IntWritable.class);`
and my reduce class became this
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
MultipleOutputs mos = null;
public void configure(JobConf job) {
mos = new MultipleOutputs(job);
}
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum % 2 == 0) {
mos.getCollector("even", reporter).collect(key, new IntWritable(sum));
}else {
mos.getCollector("odd", reporter).collect(key, new IntWritable(sum));
}
//output.collect(key, new IntWritable(sum));
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
mos.close();
}
}
Things worked , but i get LOT of files, (one odd and one even for every map-reduce)
Question is : How can i have just 2 output files (odd & even) so that every odd output of every map-reduce gets written into that odd file, and same for even.
Each reducer uses an OutputFormat to write records to. So that's why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.
If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer, because all the mappers will be feeding into a single reducer.
Another option is to change the process the reads these files to accept multiple input files, or write a separate process that merges these files together.
I wrote a class for doing this.
Just use it your job:
job.setOutputFormatClass(m_customOutputFormatClass);
This is the my class:
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
* TextOutputFormat extension which enables writing the mapper/reducer's output in multiple files.<br>
* <p>
* <b>WARNING</b>: The number of different folder shuoldn't be large for one mapper since we keep an
* {#link RecordWriter} instance per folder name.
* </p>
* <p>
* In this class the folder name is defined by the written entry's key.<br>
* To change this behavior simply extend this class and override the
* {#link HdMultipleFileOutputFormat#getFolderNameExtractor()} method and create your own
* {#link FolderNameExtractor} implementation.
* </p>
*
*
* #author ykesten
*
* #param <K> - Keys type
* #param <V> - Values type
*/
public class HdMultipleFileOutputFormat<K, V> extends TextOutputFormat<K, V> {
private String folderName;
private class MultipleFilesRecordWriter extends RecordWriter<K, V> {
private Map<String, RecordWriter<K, V>> fileNameToWriter;
private FolderNameExtractor<K, V> fileNameExtractor;
private TaskAttemptContext job;
public MultipleFilesRecordWriter(FolderNameExtractor<K, V> fileNameExtractor, TaskAttemptContext job) {
fileNameToWriter = new HashMap<String, RecordWriter<K, V>>();
this.fileNameExtractor = fileNameExtractor;
this.job = job;
}
#Override
public void write(K key, V value) throws IOException, InterruptedException {
String fileName = fileNameExtractor.extractFolderName(key, value);
RecordWriter<K, V> writer = fileNameToWriter.get(fileName);
if (writer == null) {
writer = createNewWriter(fileName, fileNameToWriter, job);
if (writer == null) {
throw new IOException("Unable to create writer for path: " + fileName);
}
}
writer.write(key, value);
}
#Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
for (Entry<String, RecordWriter<K, V>> entry : fileNameToWriter.entrySet()) {
entry.getValue().close(context);
}
}
}
private synchronized RecordWriter<K, V> createNewWriter(String folderName,
Map<String, RecordWriter<K, V>> fileNameToWriter, TaskAttemptContext job) {
try {
this.folderName = folderName;
RecordWriter<K, V> writer = super.getRecordWriter(job);
this.folderName = null;
fileNameToWriter.put(folderName, writer);
return writer;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
#Override
public Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException {
Path path = super.getDefaultWorkFile(context, extension);
if (folderName != null) {
String newPath = path.getParent().toString() + "/" + folderName + "/" + path.getName();
path = new Path(newPath);
}
return path;
}
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
return new MultipleFilesRecordWriter(getFolderNameExtractor(), job);
}
public FolderNameExtractor<K, V> getFolderNameExtractor() {
return new KeyFolderNameExtractor<K, V>();
}
public interface FolderNameExtractor<K, V> {
public String extractFolderName(K key, V value);
}
private static class KeyFolderNameExtractor<K, V> implements FolderNameExtractor<K, V> {
public String extractFolderName(K key, V value) {
return key.toString();
}
}
}
Multiple Output files will be generated based on number of reducers.
You can use hadoop dfs -getmerge to merged outputs
you may try to change the output file name (Reducer output), since HDFS supports append operations only, then it will collect all Temp-r-0000x files (partitions) from all reducers and put them together in one file.
here the class you need to create which overrides methods in TextOutputFormat:
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class CustomNameMultipleFileOutputFormat<K, V> extends TextOutputFormat<K, V> {
private String folderName;
private class MultipleFilesRecordWriter extends RecordWriter<K, V> {
private Map<String, RecordWriter<K, V>> fileNameToWriter;
private FolderNameExtractor<K, V> fileNameExtractor;
private TaskAttemptContext job;
public MultipleFilesRecordWriter(FolderNameExtractor<K, V> fileNameExtractor, TaskAttemptContext job) {
fileNameToWriter = new HashMap<String, RecordWriter<K, V>>();
this.fileNameExtractor = fileNameExtractor;
this.job = job;
}
#Override
public void write(K key, V value) throws IOException, InterruptedException {
String fileName = "**[FOLDER_NAME_INCLUDING_SUB_DIRS]**";//fileNameExtractor.extractFolderName(key, value);
RecordWriter<K, V> writer = fileNameToWriter.get(fileName);
if (writer == null) {
writer = createNewWriter(fileName, fileNameToWriter, job);
if (writer == null) {
throw new IOException("Unable to create writer for path: " + fileName);
}
}
writer.write(key, value);
}
#Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
for (Entry<String, RecordWriter<K, V>> entry : fileNameToWriter.entrySet()) {
entry.getValue().close(context);
}
}
}
private synchronized RecordWriter<K, V> createNewWriter(String folderName,
Map<String, RecordWriter<K, V>> fileNameToWriter, TaskAttemptContext job) {
try {
this.folderName = folderName;
RecordWriter<K, V> writer = super.getRecordWriter(job);
this.folderName = null;
fileNameToWriter.put(folderName, writer);
return writer;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
#Override
public Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException {
Path path = super.getDefaultWorkFile(context, extension);
if (folderName != null) {
String newPath = path.getParent().toString() + "/" + folderName + "/**[ONE_FILE_NAME]**";
path = new Path(newPath);
}
return path;
}
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
return new MultipleFilesRecordWriter(getFolderNameExtractor(), job);
}
public FolderNameExtractor<K, V> getFolderNameExtractor() {
return new KeyFolderNameExtractor<K, V>();
}
public interface FolderNameExtractor<K, V> {
public String extractFolderName(K key, V value);
}
private static class KeyFolderNameExtractor<K, V> implements FolderNameExtractor<K, V> {
public String extractFolderName(K key, V value) {
return key.toString();
}
}
}
then Reducer/Mapper:
public static class ExtraLabReducer extends Reducer<CustomKeyComparable, Text, CustomKeyComparable, Text>
{
MultipleOutputs multipleOutputs;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs(context);
}
#Override
public void reduce(CustomKeyComparable key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
for(Text d : values)
{
**multipleOutputs.write**("batta",key, d,**"[EXAMPLE_FILE_NAME]"**);
}
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}
then in job config:
Job job = new Job(getConf(), "ExtraLab");
job.setJarByClass(ExtraLab.class);
job.setMapperClass(ExtraLabMapper.class);
job.setReducerClass(ExtraLabReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
job.setMapOutputKeyClass(CustomKeyComparable.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
//job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
//adding one more reducer
job.setNumReduceTasks(2);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
MultipleOutputs.addNamedOutput(job,"batta", CustomNameMultipleFileOutputFormat.class,CustomKeyComparable.class,Text.class);

Categories

Resources