How to create a chain of Hadoop job without using OOzie

How to create a chain of Hadoop job without using OOzie - java

I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.
I have written the following code to acheive it :-
public class TfIdf {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
{
TfIdf tfIdf = new TfIdf();
tfIdf.runWordCount();
tfIdf.runDocWordCount();
tfIdf.TFIDFComputation();
}
public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Count calculation");
job.setMapperClass(WordFrequencyMapper.class);
job.setReducerClass(WordFrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("ouput"));
job.waitForCompletion(true);
}
public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Doc count calculation");
job.setMapperClass(WordCountDocMapper.class);
job.setReducerClass(WordCountDocReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
job.waitForCompletion(true);
}
public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("TFIDF calculation");
job.setMapperClass(TFIDFMapper.class);
job.setReducerClass(TFIDFReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output_job2"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
job.waitForCompletion(true);
}
}
However I get the error:
Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output
Could anyone help me out with this?

This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".

Related

java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>

I have this script and it works on local machine.
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "ToParquet");
job.setJarByClass(ToParquet.class);
job.setMapperClass(BasicsMapper.class);
job.setMapperClass(RatingsMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.basics.tsv"), TextInputFormat.class, BasicsMapper.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.ratings.tsv"), TextInputFormat.class, RatingsMapper.class);
job.setOutputKeyClass(Void.class);
job.setOutputValueClass(GenericRecord.class);
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, getSchema());
FileOutputFormat.setOutputPath(job, new Path("hdfs:///to_parquet_output"));
job.waitForCompletion(true);
However when I try to run it on HFDS environment this following error message shows up.
I don't know what's going on. If someone can help me, I appreciate.

Reading parquet files in hadoop mapreduce

I'm new in hadoop and I need to read a parquet file at map stage of map reduce process. I've found the following snippets of code at cloudera:
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
#Override
public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
Job configuration:
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
The question is Can I use my own type instead of key and value and how to implement it ? I mean sort of pojo which represent one record from parquet file.

How to train an Italian language model in OpenNLP on Hadoop?

I would like to implement a Natural language processing algorithm on Hadoop for Italian language
I have 2 questions;
how I can find a stemming algorithm for italian ?
how to integrate in hadoop?
here is my code
String pathSent=...tagged sentences...;
String pathChunk=....chunked train path....;
File fileSent=new File(pathSent);
File fileChunk=new File(pathChunk);
InputStream inSent=null;
InputStream inChunk=null;
inSent = new FileInputStream(fileSent);
inChunk = new FileInputStream(fileChunk);
POSModel posModel=POSTaggerME.train("it", new WordTagSampleStream((
new InputStreamReader(inSent))), ModelType.MAXENT, null, null, 3, 3);
ObjectStream stringStream =new PlainTextByLineStream(new InputStreamReader(inChunk));
ObjectStream chunkStream = new ChunkSampleStream(stringStream);
ChunkerModel chunkModel=ChunkerME.train("it",chunkStream ,1, 1);
this.tagger= new POSTaggerME(posModel);
this.chunker=new ChunkerME(chunkModel);
inSent.close();
inChunk.close();

You need a grammatical sentence engine:
"io voglio andare a casa"
io, sostantivo
volere, verbo
andare, verbo
a, preposizione semplice
casa, oggetto
when you have the sentence tagged you can teach OpenNLP.
On Hadoop create custom Map
public class Map extends Mapper<longwritable,
intwritable="" text,=""> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
#Override public void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//your code here
}
}
On Hadoop create custom reduce
public class Reduce extends Reducer<text,
intwritable,="" intwritable="" text,=""> {
#Override
protected void reduce(
Text key,
java.lang.Iterable<intwritable> values,
org.apache.hadoop.mapreduce.Reducer<text,
intwritable,="" intwritable="" text,="">.Context context)
throws IOException, InterruptedException {
// your reduce here
}
}
configure both
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "opennlp");
job.setJarByClass(CustomOpenNLP.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}

Getting the Tool Interface warning even though it is implemented

I have a very simple "Hello world" style map/reduce job.
public class Tester extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = Job.getInstance(new Configuration());
job.setJarByClass(getClass());
getConf().set("mapreduce.job.queuename", "adhoc");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapperClass(TesterMapper.class);
job.setNumReduceTasks(0);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Tester(), args);
System.exit(exitCode);
}
Which implements the ToolRunner, but when run is not parsing the arguments.
$hadoop jar target/manifold-mapreduce-0.1.0.jar ga.manifold.mapreduce.Tester -conf conf.xml etl/manifold/pipeline/ABV1T/ingest/input etl/manifold/pipeline/ABV1T/ingest/output
15/02/04 16:35:24 INFO client.RMProxy: Connecting to ResourceManager at lxjh116-pvt.phibred.com/10.56.100.23:8050
15/02/04 16:35:25 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
I can verify that the configuration is not being added.
Anyone know why Hadoop thinks the ToolRunner isn't implemented?
$hadoop version
Hadoop 2.4.0.2.1.2.0-402
Hortonworks
Thanks,
Chris

As your question pops really fast on the top of Google search for this warning, I'll give a proper answer here :
As user1797538 you said : (sorry about that)
user1797538: "The problem was the call to get a Job instance"
The superclass Configured must be used. As its name suggests, it is already configured, so the existing Configuration must be used by the Tester class and not set a new empty one.
If we extract the Job creation in a method :
private Job createJob() throws IOException {
// On this line use getConf() instead of new Configuration()
Job job = Job.getInstance(getConf(), Tester.class.getCanonicalName());
// Other job setter call here, for example
job.setJarByClass(Tester.class);
job.setMapperClass(TesterMapper.class);
job.setCombinerClass(TesterReducer.class);
job.setReducerClass(TesterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// adapt this to your needs of course.
return job;
}
Another example from the javadoc : org.apache.hadoop.util.Tool
And the Javadoc : Configured.getConf()

Error while Chaining Map Reduce Jobs

My Map Reduce Structure
public class ChainingMapReduce {
public static class ChainingMapReduceMapper
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// code
}
}
}
public static class ChainingMapReduceReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static class ChainingMapReduceMapper1
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//code
}
}
}
public static class ChainingMapReduceReducer1
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "First");
job.setJarByClass(ChainingMapReduce.class);
job.setMapperClass(ChainingMapReduceMapper.class);
job.setCombinerClass(ChainingMapReduceReducer.class);
job.setReducerClass(ChainingMapReduceReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/log"));
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output"));
job.waitForCompletion( true );
System.out.println("First Job Completed.....Starting Second Job");
System.out.println(job.isSuccessful());
/* FileSystem hdfs = FileSystem.get(conf);
Path fromPath = new Path("/home/Desktop/temp/output/part-r-00000");
Path toPath = new Path("/home/Desktop/temp/output1");
hdfs.rename(fromPath, toPath);
conf.clear();
*/
if(job.isSuccessful()){
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1,"Second");
job1.setJarByClass(ChainingMapReduce.class);
job1.setMapperClass(ChainingMapReduceMapper1.class);
job1.setCombinerClass(ChainingMapReduceReducer1.class);
job1.setReducerClass(ChainingMapReduceReducer1.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While I run this Program ...First Job get executed perfectly and after that following error come :
First Job Completed.....Starting Second Job true
12/01/27 15:24:21 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized 12/01/27
15:24:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/01/27 15:24:21 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String). 12/01/27 15:24:21 INFO mapred.JobClient:
Cleaning up the staging area
file:/tmp/hadoop/mapred/staging/4991311720439552/.staging/job_local_0002
Exception in thread "main"
org.apache.hadoop.mapred.InvalidJobConfException: Output directory not
set. at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at
ChainingMapReduce.main(ChainingMapReduce.java:129)
I tried to use "conf" for both jobs and "conf" "conf1" for respective jobs.

Change
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
to
FileInputFormat.addInputPath(job1, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job1, new Path("/home/Desktop/temp/output1"));
for the second job.
Also consider using o.a.h.mapred.jobcontrol.Job and Apache Oozie.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to create a chain of Hadoop job without using OOzie - java

This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".

Related

java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>

Reading parquet files in hadoop mapreduce

How to train an Italian language model in OpenNLP on Hadoop?

Getting the Tool Interface warning even though it is implemented

Error while Chaining Map Reduce Jobs

Categories

Resources