java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init> - java

I have this script and it works on local machine.
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "ToParquet");
job.setJarByClass(ToParquet.class);
job.setMapperClass(BasicsMapper.class);
job.setMapperClass(RatingsMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.basics.tsv"), TextInputFormat.class, BasicsMapper.class);
MultipleInputs.addInputPath(job, new Path("hdfs:///title.ratings.tsv"), TextInputFormat.class, RatingsMapper.class);
job.setOutputKeyClass(Void.class);
job.setOutputValueClass(GenericRecord.class);
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, getSchema());
FileOutputFormat.setOutputPath(job, new Path("hdfs:///to_parquet_output"));
job.waitForCompletion(true);
However when I try to run it on HFDS environment this following error message shows up.
I don't know what's going on. If someone can help me, I appreciate.

Related

Reading parquet files in hadoop mapreduce

I'm new in hadoop and I need to read a parquet file at map stage of map reduce process. I've found the following snippets of code at cloudera:
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
#Override
public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
Job configuration:
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
The question is Can I use my own type instead of key and value and how to implement it ? I mean sort of pojo which represent one record from parquet file.

How to train an Italian language model in OpenNLP on Hadoop?

I would like to implement a Natural language processing algorithm on Hadoop for Italian language
I have 2 questions;
how I can find a stemming algorithm for italian ?
how to integrate in hadoop?
here is my code
String pathSent=...tagged sentences...;
String pathChunk=....chunked train path....;
File fileSent=new File(pathSent);
File fileChunk=new File(pathChunk);
InputStream inSent=null;
InputStream inChunk=null;
inSent = new FileInputStream(fileSent);
inChunk = new FileInputStream(fileChunk);
POSModel posModel=POSTaggerME.train("it", new WordTagSampleStream((
new InputStreamReader(inSent))), ModelType.MAXENT, null, null, 3, 3);
ObjectStream stringStream =new PlainTextByLineStream(new InputStreamReader(inChunk));
ObjectStream chunkStream = new ChunkSampleStream(stringStream);
ChunkerModel chunkModel=ChunkerME.train("it",chunkStream ,1, 1);
this.tagger= new POSTaggerME(posModel);
this.chunker=new ChunkerME(chunkModel);
inSent.close();
inChunk.close();
You need a grammatical sentence engine:
"io voglio andare a casa"
io, sostantivo
volere, verbo
andare, verbo
a, preposizione semplice
casa, oggetto
when you have the sentence tagged you can teach OpenNLP.
On Hadoop create custom Map
public class Map extends Mapper<longwritable,
intwritable="" text,=""> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
#Override public void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//your code here
}
}
On Hadoop create custom reduce
public class Reduce extends Reducer<text,
intwritable,="" intwritable="" text,=""> {
#Override
protected void reduce(
Text key,
java.lang.Iterable<intwritable> values,
org.apache.hadoop.mapreduce.Reducer<text,
intwritable,="" intwritable="" text,="">.Context context)
throws IOException, InterruptedException {
// your reduce here
}
}
configure both
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "opennlp");
job.setJarByClass(CustomOpenNLP.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}

Getting the Tool Interface warning even though it is implemented

I have a very simple "Hello world" style map/reduce job.
public class Tester extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = Job.getInstance(new Configuration());
job.setJarByClass(getClass());
getConf().set("mapreduce.job.queuename", "adhoc");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapperClass(TesterMapper.class);
job.setNumReduceTasks(0);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Tester(), args);
System.exit(exitCode);
}
Which implements the ToolRunner, but when run is not parsing the arguments.
$hadoop jar target/manifold-mapreduce-0.1.0.jar ga.manifold.mapreduce.Tester -conf conf.xml etl/manifold/pipeline/ABV1T/ingest/input etl/manifold/pipeline/ABV1T/ingest/output
15/02/04 16:35:24 INFO client.RMProxy: Connecting to ResourceManager at lxjh116-pvt.phibred.com/10.56.100.23:8050
15/02/04 16:35:25 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
I can verify that the configuration is not being added.
Anyone know why Hadoop thinks the ToolRunner isn't implemented?
$hadoop version
Hadoop 2.4.0.2.1.2.0-402
Hortonworks
Thanks,
Chris
As your question pops really fast on the top of Google search for this warning, I'll give a proper answer here :
As user1797538 you said : (sorry about that)
user1797538: "The problem was the call to get a Job instance"
The superclass Configured must be used. As its name suggests, it is already configured, so the existing Configuration must be used by the Tester class and not set a new empty one.
If we extract the Job creation in a method :
private Job createJob() throws IOException {
// On this line use getConf() instead of new Configuration()
Job job = Job.getInstance(getConf(), Tester.class.getCanonicalName());
// Other job setter call here, for example
job.setJarByClass(Tester.class);
job.setMapperClass(TesterMapper.class);
job.setCombinerClass(TesterReducer.class);
job.setReducerClass(TesterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// adapt this to your needs of course.
return job;
}
Another example from the javadoc : org.apache.hadoop.util.Tool
And the Javadoc : Configured.getConf()

How to create a chain of Hadoop job without using OOzie

I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.
I have written the following code to acheive it :-
public class TfIdf {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
{
TfIdf tfIdf = new TfIdf();
tfIdf.runWordCount();
tfIdf.runDocWordCount();
tfIdf.TFIDFComputation();
}
public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Count calculation");
job.setMapperClass(WordFrequencyMapper.class);
job.setReducerClass(WordFrequencyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("ouput"));
job.waitForCompletion(true);
}
public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("Word Doc count calculation");
job.setMapperClass(WordCountDocMapper.class);
job.setReducerClass(WordCountDocReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
job.waitForCompletion(true);
}
public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(TfIdf.class);
job.setJobName("TFIDF calculation");
job.setMapperClass(TFIDFMapper.class);
job.setReducerClass(TFIDFReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("output_job2"));
FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
job.waitForCompletion(true);
}
}
However I get the error:
Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output
Could anyone help me out with this?
This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".

Hadoop MultipleOutputs.addNamedOutput throws "cannot find symbol"

I'm using Hadoop 0.20.203.0. I want to output to two different files, so I'm trying to get MultipleOutputs working.
Here's my configuration method:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "indy cascade");
job.setJarByClass(IndyCascade.class);
job.setMapperClass(ICMapper.class);
job.setCombinerClass(ICReducer.class);
job.setReducerClass(ICReducer.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
job.waitForCompletion(true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
However, this won't compile. The offending line is MultipleOutputs.addNamedOutput(...), which throws a "cannot find symbol" error.
isaac/me/saac/i/IndyCascade.java:94: cannot find symbol
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
Of course, I tried using a JobConf instead of Configuration, as the API demands, but that leads to the same error. Additionally, JobConf is deprecated.
How do I get MultipleOutputs to work? Is that even the correct class to use?
You're mixing old and new API types:
You're using the old API org.apache.hadoop.mapred.lib.MultipleOutputs:
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
With the new API org.apache.hadoop.mapreduce.lib.output.TextOutputFormat:
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
Make the APIs consistent and you should be ok
Edit: Infact 0.20.203 doesn't have a port of MultipleOutputs for the new API, so you'll have to use the old api, find a new API port online Cloudera- 0.20.2+320), or port it yourself
Also, you should look at the ToolRunner class to execute your jobs, it will remove the need to explicitly call the GenericOptionsParser:
public static class Driver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Driver(), args));
}
public int run(String args[]) {
if (args.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// insert other job set up here
return job.waitForCompletion(true) ? 0 : 1;
}
}
Final point - any reference to conf after you create the Job instance will be the original conf. Job makes a deep copy of the conf object, so calling MultipleOutputs.addNamedoutput(conf, ...) will not have the desired effect, use MultipleOutputs.addNamedoutput(job.getConfiguration(), ...) instead. See my example code above for the correct way to do this

Categories

Resources