Using MultipleOutputs without context.write results empty files

Using MultipleOutputs without context.write results empty files - java

I don't know how to use MultipleOutputs class. I'm using it to create multiple output files. Following is my Driver class's code snippet
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(CustomKeyValueTest.class);//class with mapper and reducer
job.setOutputKeyClass(CustomKey.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(CustomKey.class);
job.setMapOutputValueClass(CustomValue.class);
job.setMapperClass(CustomKeyValueTestMapper.class);
job.setReducerClass(CustomKeyValueTestReducer.class);
job.setInputFormatClass(TextInputFormat.class);
Path in = new Path(args[1]);
Path out = new Path(args[2]);
out.getFileSystem(conf).delete(out, true);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
MultipleOutputs.addNamedOutput(job, "islnd" , TextOutputFormat.class, CustomKey.class, Text.class);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
MultipleOutputs.setCountersEnabled(job, true);
boolean status = job.waitForCompletion(true);
and in Reducer, I used MultipleOutputs like this,
private MultipleOutputs<CustomKey, Text> multipleOutputs;
#Override
public void setup(Context context) throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<>(context);
}
#Override
public void reduce(CustomKey key, Iterable<CustomValue> values, Context context) throws IOException, InterruptedException {
...
multipleOutputs.write("islnd", key, pop, key.toString());
//context.write(key, pop);
}
public void cleanup() throws IOException, InterruptedException {
multipleOutputs.close();
}
}
When I use context.write I get output files with data in it. But When I remove context.write the output files are empty. But I don't want to call context.write because it creates extra file part-r-00000. As Stated here(last para in the description of class) I used LazyOutputFormat to avoid part-r-00000 file. But still didn't work.

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
This means , in case you are not creating any output, dont create empty files.
Can you please look at hadoop counters and find
1. map.output.records
2. reduce.input.groups
3. reduce.input.records to verify if your mappers are sending any data to mapper.
Code with IT for multioutput is
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

Related

mapreduce to read hive table and write to hdfs location with context

I am looking out for the mapreduce program to read from one hive table and write to hdfs location of first column value of each record. And it should contain only map phase not reducer phase.
Below is the mapper
public class Map extends Mapper<WritableComparable, HCatRecord, NullWritable, IntWritable> {
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
NullWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
// groupname = (String) value.get(0);
int id = (Integer) value.get(1);
// Just select and emit the name and ID
context.write(null, new IntWritable(id));
}
}
Main class
public class mapper1 {
public static void main(String[] args) throws Exception {
mapper1 m=new mapper1();
m.run(args);
}
public void run(String[] args) throws IOException, Exception, InterruptedException {
Configuration conf =new Configuration();
// Get the input and output table names as arguments
String inputTableName = args[0];
// Assume the default database
String dbName = "xademo";
Job job = new Job(conf, "UseHCat");
job.setJarByClass(mapper1.class);
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setMapperClass(Map.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath((JobConf) conf, new Path(args[1]));
job.waitForCompletion(true);
}
}
Is there anything wrong in this code?
This is giving some error as Numberformat exception from string 5s. I am not sure where it is being taken from. Showing error at below line HCatInputFormat.setInput()

Hadoop is skipping reduce phase entirely

I have set up a Hadoop job like so:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Legion");
job.setJarByClass(Legion.class);
job.setMapperClass(CallQualityMap.class);
job.setReducerClass(CallQualityReduce.class);
// Explicitly configure map and reduce outputs, since they're different classes
job.setMapOutputKeyClass(CallSampleKey.class);
job.setMapOutputValueClass(CallSample.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(CombineRepublicInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
CombineRepublicInputFormat.setInputDirRecursive(job, true);
CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
This job completes, but something strange happens. I get one output line per input line. Each output line consists of the output from a CallSampleKey.toString() method, then a tab, then something like CallSample#17ab34d.
This means that the reduce phase is never running and the CallSampleKey and CallSample are getting passed directly to the TextOutputFormat. But I don't understand why this would be the case. I've very clearly specified job.setReducerClass(CallQualityReduce.class);, so I have no idea why it would skip the reducer!
Edit: Here's the code for the reducer:
public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());
while (inValues.hasNext()) {
call.addSample(inValues.next());
}
context.write(NullWritable.get(), new Text(call.getStats()));
}
}

What if you try to change your
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
to use Iterable instead of Iterator?
public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {
You'll have to then use inValues.iterator() to get the actual iterator.
If the method signature doesn't match then it's just falling through to the default identity reducer implementation. It's perhaps unfortunate that the underlying default implementation doesn't make it easy to detect this kind of typo, but the next best thing is to always use #Override in all methods you intend to override so that the compiler can help.

hadoop parameters between jobs

I have two jobs.
The first jobs map/reduce obtain two values only :
context.write(one,two).
But in the second jobs map/reduce I need a file and the two values.
I try put the result for the first job in context
In the reducer
#Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
conf.setFloat("one",valorOne);
conf.setFloat("two",valorTwo);
}
And obtain this in the mapper of second job:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
one=conf.getFloat("one",0f);
two=conf.getFloat("two",0f);
}
But the values are empty.
In the run
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "1job");
job.setJarByClass(job.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(ValoresWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(ValoresWritable.class);
job.setMapperClass(Job1Map.class);
job.setCombinerClass(Job1Reducer.class);
job.setReducerClass(Job1Reducer.class);
FileInputFormat.addInputPath(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
Job job2 = Job.getInstance(getConf(), "job2");
job2.setJarByClass(job.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
job2.setMapOutputKeyClass(IntWritable.class);
job2.setMapOutputValueClass(IntWritable.class);
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapperClass(Job2Map.class);
job2.setReducerClass(Job2Reduce.class);
What I am doing wrong?
Thanks

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

Driver code:
public class WcDriver {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "WcDriver");
job.setJarByClass(WcDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.waitForCompletion(true);
}
}
Reducer code
public class WcReducer extends Reducer<Text, LongWritable, Text,String>
{
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
String key1 = null;
int total = 0;
for (LongWritable value : values) {
total += value.get();
key1= key.toString();
}
context.write(new Text(key1), "ABC");
}
}
Here, in driver class I have set job.setOutputKeyClass(Text.class) and job.setOutputValueClass(LongWritable.class), but in reducer class I am writing a string context.write(new Text(key1), "ABC");. I think there should be an error while running the program because output types are not matching, and also reducer's key should implement WritableComparable and value should implement Writable interface. Strangely, this program is running good. I do not understand why there is not an exception.

try to do this :
// job.setOutputFormatClass(TextOutputFormat.class);
// comment this line, and you'll sure get exception of casting.
This is because, TextOutputFormat assumes LongWritable as key, and Text as value, if you'll not define the outPutFormat class, it will expect to get default behaviour of writable, which is by default, but if u'll mention it, it would implicitly cast it to the given type.;

try this
//job.setOutputValueClass(LongWritable.class); if you comment this line you get an error
this will for only define the key value pair by defaul it depent on the output format and
it will be text so this is not giving any error

How can I use MultipleoutputFormai in Hadoop 0.20?

I am working with Hadoop 0.20 and I want to have two reduce output files instead of one output. I know that MultipleOutputFormat doesn't work in Hadoop 0.20. I added the hadoop1.1.1-core jar file in the build path of my project in Eclipse. But it still shows the last error.
Here is my code:
public static class ReduceStage extends Reducer<IntWritable, BitSetWritable, IntWritable, Text>
{
private MultipleOutputs mos;
public ReduceStage() {
System.out.println("ReduceStage");
}
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(final IntWritable key, final Iterable<BitSetWritable> values, Context output ) throws IOException, InterruptedException
{
mos.write("text1", key, new Text("Hello"));
}
public void cleanup(Context context) throws IOException {
try {
mos.close();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
And in the run():
FileOutputFormat.setOutputPath(job, ConnectedComponents_Nodes);
job.setOutputKeyClass(MultipleTextOutputFormat.class);
MultipleOutputs.addNamedOutput(job, "text1", TextOutputFormat.class,
IntWritable.class, Text.class);
The error is:
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputName(Lorg/apache/hadoop/mapreduce/JobContext;Ljava/lang/String;)V
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:409)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:348)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:179)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
What can I do to have MultipleOutputFormat? Did I use the code right?

You may go for an overridden extension of MultipleTextOutputFormat and then make all the contents of the record to be the part of 'value', while make the file-name or path to be the key.
There is an oddjob library. They have a range of outputformat implementations. The one which you want is MultipleLeafValueOutputFormat : Writes to the file specified by the key, and only writes the value.
Now,say you have to write the following pairs and your separator is say the tab character ('\t'):
<"key1","value1"> (you want this to be written in filename1)
<"key2","value2"> (you want this to be written in filename2)
So, now the output from reducer would transform into follows:
<"filename1","key1\tvalue1">
<"filename2","key2\tvalue2">
Also, don't forget that the above defined class should be added as the outformat class to the job:
conf.setOutputFormat(MultipleLeafValueOutputFormat.class);
One thing to note here is that you will need to work with the old mapred package rather than the mapreduce package. But that shouldn't be a problem.

Firstly, you should make sure FileOutputFormat.setOutputName has the same code between versions 0.20 and 1.1.1. If not, you must have compatible version to compile your code. If the same, there must be some parameter error in your command.
I encountered the same issue and I removed -Dmapreduce.user.classpath.first=true from run command and it works. hope that helps!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using MultipleOutputs without context.write results empty files - java

Related

mapreduce to read hive table and write to hdfs location with context

Hadoop is skipping reduce phase entirely

hadoop parameters between jobs

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

How can I use MultipleoutputFormai in Hadoop 0.20?

Categories

Resources