I have this script and it works on local machine.
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "ToParquet");
job.setJarByClass(ToParquet.class);
job.setMapperClass(BasicsMapper.class);
job.setMapperClass(RatingsMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.basics.tsv"), TextInputFormat.class, BasicsMapper.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.ratings.tsv"), TextInputFormat.class, RatingsMapper.class);
job.setOutputKeyClass(Void.class);
job.setOutputValueClass(GenericRecord.class);
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, getSchema());
FileOutputFormat.setOutputPath(job, new Path("hdfs:///to_parquet_output"));
job.waitForCompletion(true);
However when I try to run it on HFDS environment this following error message shows up.
I don't know what's going on. If someone can help me, I appreciate.
Related
I'm new in hadoop and I need to read a parquet file at map stage of map reduce process. I've found the following snippets of code at cloudera:
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
#Override
public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
Job configuration:
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
The question is Can I use my own type instead of key and value and how to implement it ? I mean sort of pojo which represent one record from parquet file.
I would like to implement a Natural language processing algorithm on Hadoop for Italian language
I have 2 questions;
how I can find a stemming algorithm for italian ?
how to integrate in hadoop?
here is my code
String pathSent=...tagged sentences...;
String pathChunk=....chunked train path....;
File fileSent=new File(pathSent);
File fileChunk=new File(pathChunk);
InputStream inSent=null;
InputStream inChunk=null;
inSent = new FileInputStream(fileSent);
inChunk = new FileInputStream(fileChunk);
POSModel posModel=POSTaggerME.train("it", new WordTagSampleStream((
new InputStreamReader(inSent))), ModelType.MAXENT, null, null, 3, 3);
ObjectStream stringStream =new PlainTextByLineStream(new InputStreamReader(inChunk));
ObjectStream chunkStream = new ChunkSampleStream(stringStream);
ChunkerModel chunkModel=ChunkerME.train("it",chunkStream ,1, 1);
this.tagger= new POSTaggerME(posModel);
this.chunker=new ChunkerME(chunkModel);
inSent.close();
inChunk.close();
You need a grammatical sentence engine:
"io voglio andare a casa"
io, sostantivo
volere, verbo
andare, verbo
a, preposizione semplice
casa, oggetto
when you have the sentence tagged you can teach OpenNLP.
On Hadoop create custom Map
public class Map extends Mapper<longwritable,
intwritable="" text,=""> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
#Override public void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//your code here
}
}
On Hadoop create custom reduce
public class Reduce extends Reducer<text,
intwritable,="" intwritable="" text,=""> {
#Override
protected void reduce(
Text key,
java.lang.Iterable<intwritable> values,
org.apache.hadoop.mapreduce.Reducer<text,
intwritable,="" intwritable="" text,="">.Context context)
throws IOException, InterruptedException {
// your reduce here
}
}
configure both
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "opennlp");
job.setJarByClass(CustomOpenNLP.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
I have a very simple "Hello world" style map/reduce job.
public class Tester extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = Job.getInstance(new Configuration());
job.setJarByClass(getClass());
getConf().set("mapreduce.job.queuename", "adhoc");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapperClass(TesterMapper.class);
job.setNumReduceTasks(0);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Tester(), args);
System.exit(exitCode);
}
Which implements the ToolRunner, but when run is not parsing the arguments.
$hadoop jar target/manifold-mapreduce-0.1.0.jar ga.manifold.mapreduce.Tester -conf conf.xml etl/manifold/pipeline/ABV1T/ingest/input etl/manifold/pipeline/ABV1T/ingest/output
15/02/04 16:35:24 INFO client.RMProxy: Connecting to ResourceManager at lxjh116-pvt.phibred.com/10.56.100.23:8050
15/02/04 16:35:25 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
I can verify that the configuration is not being added.
Anyone know why Hadoop thinks the ToolRunner isn't implemented?
$hadoop version
Hadoop 2.4.0.2.1.2.0-402
Hortonworks
Thanks,
Chris
As your question pops really fast on the top of Google search for this warning, I'll give a proper answer here :
As user1797538 you said : (sorry about that)
user1797538: "The problem was the call to get a Job instance"
The superclass Configured must be used. As its name suggests, it is already configured, so the existing Configuration must be used by the Tester class and not set a new empty one.
If we extract the Job creation in a method :
private Job createJob() throws IOException {
// On this line use getConf() instead of new Configuration()
Job job = Job.getInstance(getConf(), Tester.class.getCanonicalName());
// Other job setter call here, for example
job.setJarByClass(Tester.class);
job.setMapperClass(TesterMapper.class);
job.setCombinerClass(TesterReducer.class);
job.setReducerClass(TesterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// adapt this to your needs of course.
return job;
}
Another example from the javadoc : org.apache.hadoop.util.Tool
And the Javadoc : Configured.getConf()
I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.
I have written the following code to acheive it :-
public class TfIdf {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
{
TfIdf tfIdf = new TfIdf();
tfIdf.runWordCount();
tfIdf.runDocWordCount();
tfIdf.TFIDFComputation();
}
public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Count calculation");
job.setMapperClass(WordFrequencyMapper.class);
job.setReducerClass(WordFrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("ouput"));
job.waitForCompletion(true);
}
public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Doc count calculation");
job.setMapperClass(WordCountDocMapper.class);
job.setReducerClass(WordCountDocReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
job.waitForCompletion(true);
}
public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("TFIDF calculation");
job.setMapperClass(TFIDFMapper.class);
job.setReducerClass(TFIDFReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output_job2"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
job.waitForCompletion(true);
}
}
However I get the error:
Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output
Could anyone help me out with this?
This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".
I'm using Hadoop 0.20.203.0. I want to output to two different files, so I'm trying to get MultipleOutputs working.
Here's my configuration method:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "indy cascade");
job.setJarByClass(IndyCascade.class);
job.setMapperClass(ICMapper.class);
job.setCombinerClass(ICReducer.class);
job.setReducerClass(ICReducer.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
job.waitForCompletion(true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
However, this won't compile. The offending line is MultipleOutputs.addNamedOutput(...), which throws a "cannot find symbol" error.
isaac/me/saac/i/IndyCascade.java:94: cannot find symbol
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
Of course, I tried using a JobConf instead of Configuration, as the API demands, but that leads to the same error. Additionally, JobConf is deprecated.
How do I get MultipleOutputs to work? Is that even the correct class to use?
You're mixing old and new API types:
You're using the old API org.apache.hadoop.mapred.lib.MultipleOutputs:
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
With the new API org.apache.hadoop.mapreduce.lib.output.TextOutputFormat:
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
Make the APIs consistent and you should be ok
Edit: Infact 0.20.203 doesn't have a port of MultipleOutputs for the new API, so you'll have to use the old api, find a new API port online Cloudera- 0.20.2+320), or port it yourself
Also, you should look at the ToolRunner class to execute your jobs, it will remove the need to explicitly call the GenericOptionsParser:
public static class Driver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Driver(), args));
}
public int run(String args[]) {
if (args.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// insert other job set up here
return job.waitForCompletion(true) ? 0 : 1;
}
}
Final point - any reference to conf after you create the Job instance will be the original conf. Job makes a deep copy of the conf object, so calling MultipleOutputs.addNamedoutput(conf, ...) will not have the desired effect, use MultipleOutputs.addNamedoutput(job.getConfiguration(), ...) instead. See my example code above for the correct way to do this