I need to convert the textfile to avrofile in hadoop hdfs using mapreduce.
I already placed text file in hdfs.
I didnt know how to implement in mapreduce.
This and example demonstrating how to convert text file to Avro
(Our input file is a movie database, contains the following information )
serial number :: movie name (year)::tag1|tag2
Example Records :
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
Result Schema is :
{
"name":"movies",
"type":"record",
"fields":[
{"name":"movieName",
"type":"string"
},
{"name":"year",
"type":"string"
},
{"name":"tags",
"type":{"type":"array","items":"string"}
}
]
}
CODE EXAMPLE :
The map function will extract the different fields from the input record and constructs a generic record. The reduce function will simply write it’s key to the output, no processing is done in reducer.
public class MRTextToAvro extends Configured implements Tool{
public static void main(String[] args) throws Exception {
int exitCode=ToolRunner.run(new MRTextToAvro(),args );
System.out.println("Exit code "+exitCode);
}
public int run(String[] arg0) throws Exception {
Job job= Job.getInstance(getConf(),"Text To Avro");
job.setJarByClass(getClass());
FileInputFormat.setInputPaths(job, new Path(arg0[0]));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
Schema.Parser parser = new Schema.Parser();
Schema schema=parser.parse(getClass().getResourceAsStream("movies.avsc"));
job.getConfiguration().setBoolean(
Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
AvroJob.setMapOutputKeySchema(job,schema);
job.setMapOutputValueClass( NullWritable.class);
AvroJob.setOutputKeySchema(job, schema);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyOutputFormat.class);
job.setMapperClass(TextToAvroMapper.class);
job.setReducerClass(TextToAvroReduce.class);
return job.waitForCompletion(true)?0:1;
}
public static class TextToAvroMapper extends Mapper<LongWritable ,Text,AvroKey<GenericRecord>,NullWritable>{
Schema schema;
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Schema.Parser parser = new Schema.Parser();
schema=parser.parse(getClass().getResourceAsStream("movies.avsc"));
}
public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
GenericRecord record=new GenericData.Record(schema);
String inputRecord=value.toString();
record.put("movieName", getMovieName(inputRecord));
record.put("year", getMovieRelaseYear(inputRecord));
record.put("tags", getMovieTags(inputRecord));
context.write(new AvroKey(record), NullWritable.get());
}
public String getMovieName(String record){
String movieName=record.split("::")[1];
return movieName.substring(0, movieName.lastIndexOf('(' )).trim();
}
public String getMovieRelaseYear(String record){
String movieName=record.split("::")[1];
return movieName.substring( movieName.lastIndexOf( '(' )+1,movieName.lastIndexOf( ')' )).trim();
}
public String[] getMovieTags(String record){
return (record.split("::")[2]).split("\\|");
}
}
public static class TextToAvroReduce extends Reducer<AvroKey<GenericRecord>,NullWritable,AvroKey<GenericRecord>,NullWritable>{
#Override
public void reduce(AvroKey<GenericRecord> key,Iterable<NullWritable> value,Context context) throws IOException, InterruptedException{
context.write(key, NullWritable.get());
}
}
}
To run the job , you need to put the avro related lib on class path
export HADOOP_CLASSPATH=/path/to/targets/avro-mapred-1.7.4-hadoop2.jar
yarn jar HadoopAvro-1.0-SNAPSHOT.jar MRTextToAvro -libjars avro-mapred-1.7.4-hadoop2.jar /input/path output/path
Related
I'm getting the following error when I try to deserialize a previously serialized file:
Exception in thread "main" java.lang.ClassCastException:
com.ssgg.bioinfo.effect.Sample$Zygosity cannot be cast to
com.ssgg.bioinfo.effect.Sample$.Zygosity
at com.ssgg.ZygosityTest.deserializeZygosityToAvroStructure(ZygosityTest.java:45)
at com.ssgg.ZygosityTest.main(ZygosityTest.java:30)
In order to reproduce the error, the main class is as follows:
public class ZygosityTest {
public static void main(String[] args) throws JsonParseException, JsonMappingException, IOException {
String filepath = "/home/XXXX/zygosity.avro";
/* Populate Zygosity*/
com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity = com.ssgg.bioinfo.effect.Sample$.Zygosity.HET;
/* Create file serialized */
createZygositySerialized(zygosity, filepath);
/* Deserializae file */
com.ssgg.bioinfo.effect.Sample$.Zygosity avroZygosityOutput = deserializeZygosityToAvroStructure(filepath);
}
private static com.ssgg.bioinfo.effect.Sample$.Zygosity deserializeZygosityToAvroStructure(String filepath)
throws IOException {
com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity = null;
File myFile = new File(filepath);
DatumReader<com.ssgg.bioinfo.effect.Sample$.Zygosity> reader = new SpecificDatumReader<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
com.ssgg.bioinfo.effect.Sample$.Zygosity.class);
DataFileReader<com.ssgg.bioinfo.effect.Sample$.Zygosity> dataFileReader = new DataFileReader<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
myFile, reader);
while (dataFileReader.hasNext()) {
zygosity = dataFileReader.next(zygosity);
}
dataFileReader.close();
return zygosity;
}
private static void createZygositySerialized(com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity, String filepath)
throws IOException {
DatumWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity> datumWriter = new SpecificDatumWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
com.ssgg.bioinfo.effect.Sample$.Zygosity.class);
DataFileWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity> fileWriter = new DataFileWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
datumWriter);
Schema schema = com.ssgg.bioinfo.effect.Sample$.Zygosity.getClassSchema();
fileWriter.create(schema, new File(filepath));
fileWriter.append(zygosity);
fileWriter.close();
}
}
The avro generated enum for Zygosity is as follows:
/**
* Autogenerated by Avro
*
* DO NOT EDIT DIRECTLY
*/
package com.ssgg.bioinfo.effect.Sample$;
#SuppressWarnings("all")
#org.apache.avro.specific.AvroGenerated
public enum Zygosity {
HOM_REF, HET, HOM_VAR, HEMI, UNK ;
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"enum\",\"name\":\"Zygosity\",\"namespace\":\"com.ssgg.bioinfo.effect.Sample$\",\"symbols\":[\"HOM_REF\",\"HET\",\"HOM_VAR\",\"HEMI\",\"UNK\"]}");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
}
I'm a newbie in Avro, can somebody plese help me to find the problem?
In my project I try to serialize and deserialize a bigger structure, but I have problems with the enums, so I isolated a smaller problem here.
If you need more info I can post it.
Thanks.
I believe the main issue here is that $ has a special meaning in Java classes, and less important is that package names are typically lowercased.
So, you should at least edit the namespaces to remove the $
I'am facing a problem parsing csv in Apache Beam pipeline project.
I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...
Here's snippets of my code:
public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage> {
private final Logger log = LoggerFactory.getLogger(ParseCsv.class);
#ProcessElement
public void processElement(ProcessContext c) {
String startConversationMessage = c.element();
JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
}
I am using TextIO.read() to read csv from a GC Storage:
public class CsvToPubsub {
public interface Options extends PipelineOptions {
#Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
#Required
ValueProvider<String> getInputFilePattern();
void setInputFilePattern(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
PipelineUtils pipelineUtils = new PipelineUtils();
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options,configurationLoader,pipelineUtils);
}
public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
.apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
.apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
.apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
.apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));
return pipeline.run();
}
}
Is there any csv Library that support a TextIo output?
I was trying to do a simple sort example with TotalOrderPartitioner. The input is a sequence file with IntWritable as key and NullWritable as value. I want to sort based on key. The output of is a sequence file with IntWritable as key and NullWritable as value. I'm running this job in clustered environment. This is my driver class:
public class SortDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf);
job.setJobName("SORT-WITH-TOTAL-ORDER-PARTITIONER");
job.setJarByClass(SortDriver.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
SequenceFileInputFormat.setInputPaths(job, new Path("/user/client/seq-input"));
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TotalOrderPartitioner.setPartitionFile(conf, new Path("/user/client/partition.lst"));
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);
SequenceFileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
SequenceFileOutputFormat.setOutputPath(job, new Path("/user/client/sorted-output"));
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setNumReduceTasks(3);
InputSampler.Sampler<IntWritable, NullWritable> sampler = new InputSampler.RandomSampler<>(0.1, 200);
InputSampler.writePartitionFile(job, sampler);
boolean res = job.waitForCompletion(true);
return res ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Configuration(), new SortDriver(), args));
}
}
Mapper class:
public class SortMapper extends Mapper<IntWritable, NullWritable, IntWritable, NullWritable>{
#Override
protected void map(IntWritable key, NullWritable value, Context context) throws IOException, InterruptedException {
context.write(key, value);
}
}
Reducer class:
public class SortReducer extends Reducer<IntWritable, NullWritable, IntWritable, NullWritable> {
#Override
protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
context.write(key, NullWritable.get());
}
}
When I run this job I get:
Error: java.lang.IllegalArgumentException: Can't read partitions file
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:678)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.FileNotFoundException: File file:/grid/hadoop/yarn/local/usercache/client/appcache/application_1406784047304_0002/container_1406784047304_0002_01_000003/_partition.lst does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1749)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:301)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:88)
... 10 more
I found partition file in my home directory(/user/client) with name _partition.lst. The partition file name does not match with code: TotalOrderPartitioner.setPartitionFile(conf, new Path("/user/client/partition.lst"));. Can anyone help me with this problem? I'm using hadoop 2.4 in HDP 2.1 distribution.
I think the problem is in the line:
TotalOrderPartitioner.setPartitionFile(conf, new Path("/user/client/partition.lst"));
You have to replace it with:
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), new Path("/user/client/partition.lst"));
since you are using
InputSampler.writePartitionFile(job, sampler);
Otherwise, try replacing the last line only with:
InputSampler.writePartitionFile(conf, sampler);
But I am not sure if it works like that in the new API.
Hope it helps! Good luck!
I also found this error when I was using hadoop mapreduce and the mapreduce service had not been installed and started. After installing mapreduce and starting it, the exception disappeared.
got this error when I had job.setNumReduceTasks(3); and was running my code in standalone mode
changed it to job.setNumReduceTasks(1) and worked fine in standalone mode
I am trying to tweak an existing problem to suit my needs..
Basically input is simple text
I process it and pass key/value pair to reducer
And I create a json.. so there is key but no value
So mapper:
Input: Text/Text
Output: Text/Text
Reducer: Text/Text
Output: Text/None
My signatures are as follows:
public class AdvanceCounter {
/**
* The map class of WordCount.
*/
public static class TokenCounterMapper
extends Mapper<Object, Text, Text, Text> { // <--- See this signature
public void map(Object key, Text value, Context context) // <--- See this signature
throws IOException, InterruptedException {
context.write(key,value); //both are of type text OUTPUT TO REDUCER
}
}
public static class TokenCounterReducer
extends Reducer<Text, Text, Text, **NullWritable**> { // <--- See this signature Nullwritable here
public void reduce(Text key, Iterable<Text> values, Context context) // <--- See this signature
throws IOException, InterruptedException {
for (Text value : values) {
JSONObject jsn = new JSONObject();
//String output = "";
String[] vals = value.toString().split("\t");
String[] targetNodes = vals[0].toString().split(",",-1);
try {
jsn.put("source",vals[1]);
jsn.put("targets",targetNodes);
context.write(new Text(jsn.toString()),null); // no value
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
// ...
//
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
But on execution i am getting this error:
13/06/04 13:08:26 INFO mapred.JobClient: Task Id : attempt_201305241622_0053_m_000008_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.sogou.Stinger$TokenCounterMapper.map(Stinger.java:72)
at org.sogou.Stinger$TokenCounterMapper.map(Stinger.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
You haven't specified your map output types, so it's taking the same as you set for your reducer, which are Text and NullWritable which is incorrect for your mapper. You should do the following to avoid any confusing it's better to specify all your types for both mapper and reducer:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
My Map Reduce Structure
public class ChainingMapReduce {
public static class ChainingMapReduceMapper
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// code
}
}
}
public static class ChainingMapReduceReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static class ChainingMapReduceMapper1
extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//code
}
}
}
public static class ChainingMapReduceReducer1
extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
//code
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "First");
job.setJarByClass(ChainingMapReduce.class);
job.setMapperClass(ChainingMapReduceMapper.class);
job.setCombinerClass(ChainingMapReduceReducer.class);
job.setReducerClass(ChainingMapReduceReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/log"));
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output"));
job.waitForCompletion( true );
System.out.println("First Job Completed.....Starting Second Job");
System.out.println(job.isSuccessful());
/* FileSystem hdfs = FileSystem.get(conf);
Path fromPath = new Path("/home/Desktop/temp/output/part-r-00000");
Path toPath = new Path("/home/Desktop/temp/output1");
hdfs.rename(fromPath, toPath);
conf.clear();
*/
if(job.isSuccessful()){
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1,"Second");
job1.setJarByClass(ChainingMapReduce.class);
job1.setMapperClass(ChainingMapReduceMapper1.class);
job1.setCombinerClass(ChainingMapReduceReducer1.class);
job1.setReducerClass(ChainingMapReduceReducer1.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While I run this Program ...First Job get executed perfectly and after that following error come :
First Job Completed.....Starting Second Job true
12/01/27 15:24:21 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized 12/01/27
15:24:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/01/27 15:24:21 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String). 12/01/27 15:24:21 INFO mapred.JobClient:
Cleaning up the staging area
file:/tmp/hadoop/mapred/staging/4991311720439552/.staging/job_local_0002
Exception in thread "main"
org.apache.hadoop.mapred.InvalidJobConfException: Output directory not
set. at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at
ChainingMapReduce.main(ChainingMapReduce.java:129)
I tried to use "conf" for both jobs and "conf" "conf1" for respective jobs.
Change
FileInputFormat.addInputPath(job, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job, new Path("/home/Desktop/temp/output1"));
to
FileInputFormat.addInputPath(job1, new Path("/home/Desktop/temp/output/part-r-00000)");
FileOutputFormat.setOutputPath(job1, new Path("/home/Desktop/temp/output1"));
for the second job.
Also consider using o.a.h.mapred.jobcontrol.Job and Apache Oozie.