Emitting different value class than the one declared [MapReduce] - java

I am using Jimmy Lin's Github repo [1] in my project. However it came to my attention that the ArrayListOfDoublesWritable is returning DoubleWritable. This is not a problem if I'm using it in the Reduce phase, i.e:
public static class Reduce extends Reducer<Text,DoubleWritable,Text,ArrayListOfDoublesWritable>
The reason being that I could set the parameter of method setOutputValueClass to DoubleWritable.class.
But this does not seem to be the case when I am using it in the Map phase. Hadoop is complaining that it is expecting DoubleWritable while actually receiving ArrayListOfDoublesWritable.
Is there any way to set the Map Value Class to be different than the one declared? I have gone through the method setMapOutputValueClass but that did not seem to be the fix for this problem.
--Driver--
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); // get all args
if (otherArgs.length != 2) {
System.err.println("Usage: WordCount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "job 1");
job.setJarByClass(Kira.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// set output key type
job.setOutputKeyClass(Text.class);
// set output value type
job.setOutputValueClass(DoubleWritable.class);
//set the HDFS path of the input data
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
// set the HDFS path for the output
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
//Wait till job completion
System.exit(job.waitForCompletion(true) ? 0 : 1);
Notice my output value class is set to DoubleWritable.class although I have declared ArrayListOfDoublesWritable.
How can possibly do this for the Mapper?

Related

Map reduce Driver Code not Working

I am learning Big Data Hadoop by my own and I wrote a Simple Map Reduce code for
Word Count Which is not working .Please lets have a look
// importing all classes
public class WordCount {
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String Line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(Line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable x : values) {
sum = sum + x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
But after replacing these lines
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
in driver code from these
Path outputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
then it works properly.
May I know the reason and what these lines are for.
I am assuming that when you say it is not working, so you would be getting below error :-
org.apache.hadoop.mapred.FileAlreadyExistsException: **Output directory hdfs://localhost:54310/<<your_output_directory>> already exists**
The output directory should not exists before the map reduce job is submitted. So it might have given you the above exception.
The new lines of code that you used in driver, gets the FileSystem (either local/hdfs based on conf object) from path and deletes the output path before the map reduce job is submitted. So now the job executes as the output directory doesn't exist.
Can you please tell me what error you are getting earlier.
may this error is related to output path is already exist.
here you have written code that will delete output path every time.
If output path already exist then it will delete that path.
you can write this code like that .
Path outputPath = new Path(args[1]);
if (outputPath.getFileSystem(conf).exist()) {
outputPath.getFileSystem(conf).delete(outputPath);
}

NullPointerException in hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex when chaining two jobs

I am trying to build inverted index.
I chain two jobs.
Basically, the first job parses the input and cleans it, and stores result in a folder 'output' which is the input folder to the second job.
The second job is supposed to actually build the inverted index.
When I just had the first job, it worked fine (at least, there were no exceptions).
I chain two jobs like this:
public class Main {
public static void main(String[] args) throws Exception {
String inputPath = args[0];
String outputPath = args[1];
String stopWordsPath = args[2];
String finalOutputPath = args[3];
Configuration conf = new Configuration();
conf.set("job.stopwords.path", stopWordsPath);
Job job = Job.getInstance(conf, "Tokenize");
job.setJobName("Tokenize");
job.setJarByClass(TokenizerMapper.class);
job.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PostingListEntry.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PostingListEntry.class);
job.setOutputFormatClass(MapFileOutputFormat.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(TokenizerReducer.class);
// Delete the output directory if it exists already.
Path outputDir = new Path(outputPath);
FileSystem.get(conf).delete(outputDir, true);
long startTime = System.currentTimeMillis();
job.waitForCompletion(true);
System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
//-------------------------------------------------------------------------
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2, "BuildIndex");
job2.setJobName("BuildIndex");
job2.setJarByClass(InvertedIndexMapper.class);
job2.setOutputFormatClass(TextOutputFormat.class);
job2.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job2, new Path(outputPath));
FileOutputFormat.setOutputPath(job2, new Path(finalOutputPath));
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(PostingListEntry.class);
job2.setMapperClass(InvertedIndexMapper.class);
job2.setReducerClass(InvertedIndexReducer.class);
// Delete the output directory if it exists already.
Path finalOutputDir = new Path(finalOutputPath);
FileSystem.get(conf2).delete(finalOutputDir, true);
startTime = System.currentTimeMillis();
// THIS LINE GIVES ERROR:
job2.waitForCompletion(true);
System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
}
}
I get an
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at Main.main(Main.java:79)
What is wrong with this configuration, and how should I chain the jobs?
It isn't clear if you're intentionally using MapFileOutputFormat as the output format in the first job. The more common approach is to use SequenceFileOutputFormat with SequenceFileInputFormat as the input format in the second job.
At the moment, you've specified MapFileOutputFormat as the output to the first job with no input specified in the second, so it will be TextInputFormat which is unlikely to work.
Looking at your TokenizerReducer class the signature for the reduce method is incorrect. You have:
public void reduce(Text key, Iterator<PostingListEntry> values, Context context)
It should be:
public void reduce(Key key, Iterable<PostingListEntry> values, Context context)
Because of this it won't be calling your implementation, so it will just be an identity reduce.

Change the default delimiter of the mapreduce

Hi I am a beginner to MapReduce, and I want to program the WordCount so it output the K/V pairs. But the question is I don't want to use the 'tab' as the key value pair delimiter for the file. How could I change it?
The code I use is slightly different from the example one. Here is the driver class.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Job1");
job.setJarByClass(Simpletask.class);
job.setMapperClass(TokenizerMapper.class);
//job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
Since I want the file name to be respective with the partition of the reducer, I use multipleout.write() in the reduce function, and thus the code is slightly different.
public void reduce(IntWritable key,Iterable<Text> values, Context context) throws IOException, InterruptedException {
String accu = "";
for (Text val : values) {
String[] entry=val.toString().split(",");
String MBR = entry[1];
//ASSUME MBR IS ENTRY 1. IT CAN BE REPLACED BY INVOKING FUNCTION TO CALCULATE MBR([COORDINATES])
String mes_line = entry[0]+",MBR"+MBR+" ";
result.set(mes_line);
mos.write(key, result, generateFileName(key));
}
Any help will be appreciated! Thank you!
Since you are using FileInputFormat the key is the line offset in the file, and the value is a line from the input file. It's upto the mapper to split the input line with any delimiter. You can use it to split the record read in map method. The default behavior comes with a specific input format like TextInputFormat etc.

Mapreduce job to HBase throws IOException: Pass a Delete or a Put

I am trying to output to a HBase table directly from my Mapper while using Hadoop2.4.0 with HBase0.94.18 on EMR.
I am getting a nasty IOException: Pass a Delete or a Put when executing the code below.
public class TestHBase {
static class ImportMapper
extends Mapper<MyKey, MyValue, ImmutableBytesWritable, Writable> {
private byte[] family = Bytes.toBytes("f");
#Override
public void map(MyKey key, MyValue value, Context context) {
MyItem item = //do some stuff with key/value and create item
byte[] rowKey = Bytes.toBytes(item.getKey());
Put put = new Put(rowKey);
for (String attr : Arrays.asList("a1", "a2", "a3")) {
byte[] qualifier = Bytes.toBytes(attr);
put.add(family, qualifier, Bytes.toBytes(item.get(attr)));
}
context.write(new ImmutableBytesWritable(rowKey), put);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String input = args[0];
String table = "table";
Job job = Job.getInstance(conf, "stuff");
job.setJarByClass(ImportMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.setInputDirRecursive(job, true);
FileInputFormat.addInputPath(job, new Path(input));
TableMapReduceUtil.initTableReducerJob(
table, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Does anyone know what I am doing wrong?
Stacktrace
Error: java.io.IOException: Pass a Delete or a Put at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:125) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:84) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:646) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:775) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
It would be better if you can show the full stack trace, so that i can help you solve it easily. I've not executed your code. As far as i've seen your code, this could be the issue
job.setNumReduceTasks(0);
Mapper will be expecting your put object to write directly to Apache HBase.
You can increase the setNumReduceTasks OR If you see the API you can find its default value and comment it.
Thanks for adding the stack trace. Unfortunately you didn't include the code that threw the exception so I can't fully trace it for you. Instead I did a little searching around and discovered a few things for you.
Your stack trace is similar to one in another SO question here:
Pass a Delete or a Put error in hbase mapreduce
That one solved the issue by commenting out job.setNumReduceTasks(0);
There is a similar SO question that had the same exception but couldn't solve the problem that way. Instead it was having a problem with annotations:
"java.io.IOException: Pass a Delete or a Put" when reading HDFS and storing HBase
Here are some good examples of how to write working code both with setNumReduceTasks at 0 and at 1 or more.
"51.2. HBase MapReduce Read/Write Example
The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
This is the 1 or more example:
"51.4. HBase MapReduce Summary to HBase Example
The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
MyTableReducer.class, // reducer class
job);
job.setNumReduceTasks(1); // at least one, adjust as required
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
http://hbase.apache.org/book.html#mapreduce.example
You seem to be more closely following the first example. I wanted to show that sometimes there is a reason to set the number of reduce tasks to zero.

java.lang.RuntimeException: java.lang.InstantiationException running mapreduce code on eclipse using cygwin

Hi I am running mapreduce code on eclipse using Cygwin. I am able to successfully run wordcount program in this environment. But for my new code I am getting below exception.
My program does not have any reducer job/class. I also debug the code in eclipse. All mapper jobs running successfully and writing output in context. After that exception is thrown. Temporary output folders are created but no final output.
Please help me to solve this problem.
Thanks
java.lang.RuntimeException: java.lang.InstantiationException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:530)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:410)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:215)
Caused by: java.lang.InstantiationException
at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113
... 3 more
Please find my main function below.
public static void main(String[] args) throws Exception {
if (args.length < 5) {
System.out.println("Arguments: [model] [dictionnary] [document frequency] [tweet file] [output directory]");
return;
}
String modelPath = args[0];
String dictionaryPath = args[1];
String documentFrequencyPath = args[2];
String tweetsPath = args[3];
String outputPath = args[4];
Configuration conf = new Configuration();
conf.setStrings(Classifier.MODEL_PATH_CONF, modelPath);
conf.setStrings(Classifier.DICTIONARY_PATH_CONF, dictionaryPath);
conf.setStrings(Classifier.DOCUMENT_FREQUENCY_PATH_CONF, documentFrequencyPath);
// do not create a new jvm for each task
conf.setLong("mapred.job.reuse.jvm.num.tasks", -1);
Job job = new Job(conf, "classifier");
job.setJarByClass(MapReduceClassifier.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(ClassifierMap.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(tweetsPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
job.waitForCompletion(true);
}
This looks similar to your issue:
InstantiationException in hadoop map reduce program
You may want to check that none of the classes you provided to the JOB are abstract.

Categories

Resources