I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.
I have written the following code to acheive it :-
public class TfIdf {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
{
TfIdf tfIdf = new TfIdf();
tfIdf.runWordCount();
tfIdf.runDocWordCount();
tfIdf.TFIDFComputation();
}
public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Count calculation");
job.setMapperClass(WordFrequencyMapper.class);
job.setReducerClass(WordFrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("ouput"));
job.waitForCompletion(true);
}
public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Doc count calculation");
job.setMapperClass(WordCountDocMapper.class);
job.setReducerClass(WordCountDocReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
job.waitForCompletion(true);
}
public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("TFIDF calculation");
job.setMapperClass(TFIDFMapper.class);
job.setReducerClass(TFIDFReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output_job2"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
job.waitForCompletion(true);
}
}
However I get the error:
Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output
Could anyone help me out with this?
This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".
Related
I have this script and it works on local machine.
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "ToParquet");
job.setJarByClass(ToParquet.class);
job.setMapperClass(BasicsMapper.class);
job.setMapperClass(RatingsMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.basics.tsv"), TextInputFormat.class, BasicsMapper.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.ratings.tsv"), TextInputFormat.class, RatingsMapper.class);
job.setOutputKeyClass(Void.class);
job.setOutputValueClass(GenericRecord.class);
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, getSchema());
FileOutputFormat.setOutputPath(job, new Path("hdfs:///to_parquet_output"));
job.waitForCompletion(true);
However when I try to run it on HFDS environment this following error message shows up.
I don't know what's going on. If someone can help me, I appreciate.
I'm new in hadoop and I need to read a parquet file at map stage of map reduce process. I've found the following snippets of code at cloudera:
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
#Override
public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
Job configuration:
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
The question is Can I use my own type instead of key and value and how to implement it ? I mean sort of pojo which represent one record from parquet file.
I would like to implement a Natural language processing algorithm on Hadoop for Italian language
I have 2 questions;
how I can find a stemming algorithm for italian ?
how to integrate in hadoop?
here is my code
String pathSent=...tagged sentences...;
String pathChunk=....chunked train path....;
File fileSent=new File(pathSent);
File fileChunk=new File(pathChunk);
InputStream inSent=null;
InputStream inChunk=null;
inSent = new FileInputStream(fileSent);
inChunk = new FileInputStream(fileChunk);
POSModel posModel=POSTaggerME.train("it", new WordTagSampleStream((
new InputStreamReader(inSent))), ModelType.MAXENT, null, null, 3, 3);
ObjectStream stringStream =new PlainTextByLineStream(new InputStreamReader(inChunk));
ObjectStream chunkStream = new ChunkSampleStream(stringStream);
ChunkerModel chunkModel=ChunkerME.train("it",chunkStream ,1, 1);
this.tagger= new POSTaggerME(posModel);
this.chunker=new ChunkerME(chunkModel);
inSent.close();
inChunk.close();
You need a grammatical sentence engine:
"io voglio andare a casa"
io, sostantivo
volere, verbo
andare, verbo
a, preposizione semplice
casa, oggetto
when you have the sentence tagged you can teach OpenNLP.
On Hadoop create custom Map
public class Map extends Mapper<longwritable,
intwritable="" text,=""> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
#Override public void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//your code here
}
}
On Hadoop create custom reduce
public class Reduce extends Reducer<text,
intwritable,="" intwritable="" text,=""> {
#Override
protected void reduce(
Text key,
java.lang.Iterable<intwritable> values,
org.apache.hadoop.mapreduce.Reducer<text,
intwritable,="" intwritable="" text,="">.Context context)
throws IOException, InterruptedException {
// your reduce here
}
}
configure both
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "opennlp");
job.setJarByClass(CustomOpenNLP.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
I have a very simple "Hello world" style map/reduce job.
public class Tester extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = Job.getInstance(new Configuration());
job.setJarByClass(getClass());
getConf().set("mapreduce.job.queuename", "adhoc");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapperClass(TesterMapper.class);
job.setNumReduceTasks(0);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Tester(), args);
System.exit(exitCode);
}
Which implements the ToolRunner, but when run is not parsing the arguments.
$hadoop jar target/manifold-mapreduce-0.1.0.jar ga.manifold.mapreduce.Tester -conf conf.xml etl/manifold/pipeline/ABV1T/ingest/input etl/manifold/pipeline/ABV1T/ingest/output
15/02/04 16:35:24 INFO client.RMProxy: Connecting to ResourceManager at lxjh116-pvt.phibred.com/10.56.100.23:8050
15/02/04 16:35:25 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
I can verify that the configuration is not being added.
Anyone know why Hadoop thinks the ToolRunner isn't implemented?
$hadoop version
Hadoop 2.4.0.2.1.2.0-402
Hortonworks
Thanks,
Chris
As your question pops really fast on the top of Google search for this warning, I'll give a proper answer here :
As user1797538 you said : (sorry about that)
user1797538: "The problem was the call to get a Job instance"
The superclass Configured must be used. As its name suggests, it is already configured, so the existing Configuration must be used by the Tester class and not set a new empty one.
If we extract the Job creation in a method :
private Job createJob() throws IOException {
// On this line use getConf() instead of new Configuration()
Job job = Job.getInstance(getConf(), Tester.class.getCanonicalName());
// Other job setter call here, for example
job.setJarByClass(Tester.class);
job.setMapperClass(TesterMapper.class);
job.setCombinerClass(TesterReducer.class);
job.setReducerClass(TesterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// adapt this to your needs of course.
return job;
}
Another example from the javadoc : org.apache.hadoop.util.Tool
And the Javadoc : Configured.getConf()
My Map Reduce Structure
public class ChainingMapReduce {
public static class ChainingMapReduceMapper
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// code
}
}
}
public static class ChainingMapReduceReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static class ChainingMapReduceMapper1
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//code
}
}
}
public static class ChainingMapReduceReducer1
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "First");
job.setJarByClass(ChainingMapReduce.class);
job.setMapperClass(ChainingMapReduceMapper.class);
job.setCombinerClass(ChainingMapReduceReducer.class);
job.setReducerClass(ChainingMapReduceReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/log"));
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output"));
job.waitForCompletion( true );
System.out.println("First Job Completed.....Starting Second Job");
System.out.println(job.isSuccessful());
/* FileSystem hdfs = FileSystem.get(conf);
Path fromPath = new Path("/home/Desktop/temp/output/part-r-00000");
Path toPath = new Path("/home/Desktop/temp/output1");
hdfs.rename(fromPath, toPath);
conf.clear();
*/
if(job.isSuccessful()){
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1,"Second");
job1.setJarByClass(ChainingMapReduce.class);
job1.setMapperClass(ChainingMapReduceMapper1.class);
job1.setCombinerClass(ChainingMapReduceReducer1.class);
job1.setReducerClass(ChainingMapReduceReducer1.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While I run this Program ...First Job get executed perfectly and after that following error come :
First Job Completed.....Starting Second Job true
12/01/27 15:24:21 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized 12/01/27
15:24:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/01/27 15:24:21 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String). 12/01/27 15:24:21 INFO mapred.JobClient:
Cleaning up the staging area
file:/tmp/hadoop/mapred/staging/4991311720439552/.staging/job_local_0002
Exception in thread "main"
org.apache.hadoop.mapred.InvalidJobConfException: Output directory not
set. at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at
ChainingMapReduce.main(ChainingMapReduce.java:129)
I tried to use "conf" for both jobs and "conf" "conf1" for respective jobs.
Change
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
to
FileInputFormat.addInputPath(job1, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job1, new Path("/home/Desktop/temp/output1"));
for the second job.
Also consider using o.a.h.mapred.jobcontrol.Job and Apache Oozie.