Parse CSV in apache Beam pipeline to post it Google pubsub - java

I'am facing a problem parsing csv in Apache Beam pipeline project.
I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...
Here's snippets of my code:
public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage> {
private final Logger log = LoggerFactory.getLogger(ParseCsv.class);
#ProcessElement
public void processElement(ProcessContext c) {
String startConversationMessage = c.element();
JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
}
I am using TextIO.read() to read csv from a GC Storage:
public class CsvToPubsub {
public interface Options extends PipelineOptions {
#Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
#Required
ValueProvider<String> getInputFilePattern();
void setInputFilePattern(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
PipelineUtils pipelineUtils = new PipelineUtils();
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options,configurationLoader,pipelineUtils);
}
public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
.apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
.apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
.apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
.apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));
return pipeline.run();
}
}
Is there any csv Library that support a TextIo output?

Related

Checkpoint with spark file streaming in java

I want to implement checkpoint with spark file streaming application to process all unprocessed files from hadoop if in any case my spark streaming application stop/terminates. I am following this : streaming programming guide, but not found JavaStreamingContextFactory. Please help me what should I do.
My Code is
public class StartAppWithCheckPoint {
public static void main(String[] args) {
try {
String filePath = "hdfs://Master:9000/mmi_traffic/listenerTransaction/2020/*/*/*/";
String checkpointDirectory = "hdfs://Mongo1:9000/probeAnalysis/checkpoint";
SparkSession sparkSession = JavaSparkSessionSingleton.getInstance();
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
#Override public JavaStreamingContext create() {
SparkConf sparkConf = new SparkConf().setAppName("ProbeAnalysis");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(300));
JavaDStream<String> lines = jssc.textFileStream(filePath).cache();
jssc.checkpoint(checkpointDirectory);
return jssc;
}
};
JavaStreamingContext context = JavaStreamingContext.getOrCreate(checkpointDirectory, contextFactory);
context.start();
context.awaitTermination();
context.close();
sparkSession.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
You must use Checkpointing
For checkpointing use stateful transformations either updateStateByKey or reduceByKeyAndWindow. There are a plenty of examples in spark-examples provided along with prebuild spark and spark source in git-hub. For your specific, see JavaStatefulNetworkWordCount.java;

How to create GRPC client directly from protobuf without compiling it into java code

When working with gRPC, we need to generate the gRPC client and server interfaces from our .proto service definition via protocol buffer compiler (protoc) or using Gradle or Maven protoc build plugin.
Flow now: protobuf file -> java code -> gRPC client.
So, is there any way to skip this step?
How to create a generic gRPC client that can call the server directly from the protobuf file without compile into java code?
Or, is there a way to Generated Code at runtime?
Flow expect: protobuf file -> gRPC client.
I want to build a generic gRPC client system with the input are protobuf files along with description of method, package, message request ... without having to compile again for each protobuf.
Thank you very much.
Protobuf systems really need protoc to be run. However, the generated code could be skipped. Instead of passing something like --java_out and --grpc_java_out to protoc you can pass --descriptor_set_out=FILE which will parse the .proto file into a descriptor file. A descriptor file is a proto-encoded FileDescriptorSet. This is the same basic format as used with the reflection service.
Once you have a descriptor, you can load it a FileDescriptor at a time and create a DynamicMessage.
Then for the gRPC piece, you need to create a gRPC MethodDescriptor.
static MethodDescriptor from(
Descriptors.MethodDescriptor methodDesc
) {
return MethodDescriptor.<DynamicMessage, DynamicMessage>newBuilder()
// UNKNOWN is fine, but the "correct" value can be computed from
// methodDesc.toProto().getClientStreaming()/getServerStreaming()
.setType(getMethodTypeFromDesc(methodDesc))
.setFullMethodName(MethodDescriptor.generateFullMethodName(
serviceDesc.getFullName(), methodDesc.getName()))
.setRequestMarshaller(ProtoUtils.marshaller(
DynamicMessage.getDefaultInstance(methodDesc.getInputType())))
.setResponseMarshaller(ProtoUtils.marshaller(
DynamicMessage.getDefaultInstance(methodDesc.getOutputType())))
.build();
static MethodDescriptor.MethodType getMethodTypeFromDesc(
Descriptors.MethodDescriptor methodDesc
) {
if (!methodDesc.isServerStreaming()
&& !methodDesc.isClientStreaming()) {
return MethodDescriptor.MethodType.UNARY;
} else if (methodDesc.isServerStreaming()
&& !methodDesc.isClientStreaming()) {
return MethodDescriptor.MethodType.SERVER_STREAMING;
} else if (!methodDesc.isServerStreaming()) {
return MethodDescriptor.MethodType.CLIENT_STREAMING);
} else {
return MethodDescriptor.MethodType.BIDI_STREAMING);
}
}
At that point you have everything you need and can call Channel.newCall(method, CallOptions.DEFAULT) in gRPC. You're also free to use ClientCalls to use something more similar to the stub APIs.
So dynamic calls are definitely possible, and is used for things like grpcurl. But it also is not easy and so is generally only done when necessary.
I did it in Java, and the step is:
Call reflection service to get FileDescriptorProto list by method name
Get FileDescriptor of method from FileDescriptorProto list by package name, service name
Get MethodDescriptor from ServiceDescriptor which get from the FileDescriptor
Generate a MethodDescriptor<DynamicMessage, DynamicMessage> by MethodDescriptor
Build request DynamicMessage from content like JSON or others
Call method
Parse response content to JSON from DynamicMessage response
You can reference the full sample in project helloworlde/grpc-java-sample#reflection
And proto is:
syntax = "proto3";
package io.github.helloworlde.grpc;
option go_package = "api;grpc_gateway";
option java_package = "io.github.helloworlde.grpc";
option java_multiple_files = true;
option java_outer_classname = "HelloWorldGrpc";
service HelloService{
rpc SayHello(HelloMessage) returns (HelloResponse){
}
}
message HelloMessage {
string message = 2;
}
message HelloResponse {
string message = 1;
}
Start server for this proto by yourself, and the full code in Java just like:
import com.google.protobuf.ByteString;
import com.google.protobuf.DescriptorProtos;
import com.google.protobuf.Descriptors;
import com.google.protobuf.DynamicMessage;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.TypeRegistry;
import com.google.protobuf.util.JsonFormat;
import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.ManagedChannelBuilder;
import io.grpc.MethodDescriptor;
import io.grpc.protobuf.ProtoUtils;
import io.grpc.reflection.v1alpha.ServerReflectionGrpc;
import io.grpc.reflection.v1alpha.ServerReflectionRequest;
import io.grpc.reflection.v1alpha.ServerReflectionResponse;
import io.grpc.stub.ClientCalls;
import io.grpc.stub.StreamObserver;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
#Slf4j
public class ReflectionCall {
public static void main(String[] args) throws InterruptedException {
// 反射方法的格式只支持 package.service.method 或者 package.service
String methodSymbol = "io.github.helloworlde.grpc.HelloService.SayHello";
String requestContent = "{\"message\": \"Reflection\"}";
// 构建 Channel
ManagedChannel channel = ManagedChannelBuilder.forAddress("127.0.0.1", 9090)
.usePlaintext()
.build();
// 使用 Channel 构建 BlockingStub
ServerReflectionGrpc.ServerReflectionStub reflectionStub = ServerReflectionGrpc.newStub(channel);
// 响应观察器
StreamObserver<ServerReflectionResponse> streamObserver = new StreamObserver<ServerReflectionResponse>() {
#Override
public void onNext(ServerReflectionResponse response) {
try {
// 只需要关注文件描述类型的响应
if (response.getMessageResponseCase() == ServerReflectionResponse.MessageResponseCase.FILE_DESCRIPTOR_RESPONSE) {
List<ByteString> fileDescriptorProtoList = response.getFileDescriptorResponse().getFileDescriptorProtoList();
handleResponse(fileDescriptorProtoList, channel, methodSymbol, requestContent);
} else {
log.warn("未知响应类型: " + response.getMessageResponseCase());
}
} catch (Exception e) {
log.error("处理响应失败: {}", e.getMessage(), e);
}
}
#Override
public void onError(Throwable t) {
}
#Override
public void onCompleted() {
log.info("Complete");
}
};
// 请求观察器
StreamObserver<ServerReflectionRequest> requestStreamObserver = reflectionStub.serverReflectionInfo(streamObserver);
// 构建并发送获取方法文件描述请求
ServerReflectionRequest getFileContainingSymbolRequest = ServerReflectionRequest.newBuilder()
.setFileContainingSymbol(methodSymbol)
.build();
requestStreamObserver.onNext(getFileContainingSymbolRequest);
channel.awaitTermination(10, TimeUnit.SECONDS);
}
/**
* 处理响应
*/
private static void handleResponse(List<ByteString> fileDescriptorProtoList,
ManagedChannel channel,
String methodFullName,
String requestContent) {
try {
// 解析方法和服务名称
String fullServiceName = extraPrefix(methodFullName);
String methodName = extraSuffix(methodFullName);
String packageName = extraPrefix(fullServiceName);
String serviceName = extraSuffix(fullServiceName);
// 根据响应解析 FileDescriptor
Descriptors.FileDescriptor fileDescriptor = getFileDescriptor(fileDescriptorProtoList, packageName, serviceName);
// 查找服务描述
Descriptors.ServiceDescriptor serviceDescriptor = fileDescriptor.getFile().findServiceByName(serviceName);
// 查找方法描述
Descriptors.MethodDescriptor methodDescriptor = serviceDescriptor.findMethodByName(methodName);
// 发起请求
executeCall(channel, fileDescriptor, methodDescriptor, requestContent);
} catch (Exception e) {
log.error(e.getMessage(), e);
}
}
/**
* 解析并查找方法对应的文件描述
*/
private static Descriptors.FileDescriptor getFileDescriptor(List<ByteString> fileDescriptorProtoList,
String packageName,
String serviceName) throws Exception {
Map<String, DescriptorProtos.FileDescriptorProto> fileDescriptorProtoMap =
fileDescriptorProtoList.stream()
.map(bs -> {
try {
return DescriptorProtos.FileDescriptorProto.parseFrom(bs);
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
}
return null;
})
.filter(Objects::nonNull)
.collect(Collectors.toMap(DescriptorProtos.FileDescriptorProto::getName, f -> f));
if (fileDescriptorProtoMap.isEmpty()) {
log.error("服务不存在");
throw new IllegalArgumentException("方法的文件描述不存在");
}
// 查找服务对应的 Proto 描述
DescriptorProtos.FileDescriptorProto fileDescriptorProto = findServiceFileDescriptorProto(packageName, serviceName, fileDescriptorProtoMap);
// 获取这个 Proto 的依赖
Descriptors.FileDescriptor[] dependencies = getDependencies(fileDescriptorProto, fileDescriptorProtoMap);
// 生成 Proto 的 FileDescriptor
return Descriptors.FileDescriptor.buildFrom(fileDescriptorProto, dependencies);
}
/**
* 根据包名和服务名查找相应的文件描述
*/
private static DescriptorProtos.FileDescriptorProto findServiceFileDescriptorProto(String packageName,
String serviceName,
Map<String, DescriptorProtos.FileDescriptorProto> fileDescriptorProtoMap) {
for (DescriptorProtos.FileDescriptorProto proto : fileDescriptorProtoMap.values()) {
if (proto.getPackage().equals(packageName)) {
boolean exist = proto.getServiceList()
.stream()
.anyMatch(s -> serviceName.equals(s.getName()));
if (exist) {
return proto;
}
}
}
throw new IllegalArgumentException("服务不存在");
}
/**
* 获取前缀
*/
private static String extraPrefix(String content) {
int index = content.lastIndexOf(".");
return content.substring(0, index);
}
/**
* 获取后缀
*/
private static String extraSuffix(String content) {
int index = content.lastIndexOf(".");
return content.substring(index + 1);
}
/**
* 获取依赖类型
*/
private static Descriptors.FileDescriptor[] getDependencies(DescriptorProtos.FileDescriptorProto proto,
Map<String, DescriptorProtos.FileDescriptorProto> finalDescriptorProtoMap) {
return proto.getDependencyList()
.stream()
.map(finalDescriptorProtoMap::get)
.map(f -> toFileDescriptor(f, getDependencies(f, finalDescriptorProtoMap)))
.toArray(Descriptors.FileDescriptor[]::new);
}
/**
* 将 FileDescriptorProto 转为 FileDescriptor
*/
#SneakyThrows
private static Descriptors.FileDescriptor toFileDescriptor(DescriptorProtos.FileDescriptorProto fileDescriptorProto,
Descriptors.FileDescriptor[] dependencies) {
return Descriptors.FileDescriptor.buildFrom(fileDescriptorProto, dependencies);
}
/**
* 执行方法调用
*/
private static void executeCall(ManagedChannel channel,
Descriptors.FileDescriptor fileDescriptor,
Descriptors.MethodDescriptor originMethodDescriptor,
String requestContent) throws Exception {
// 重新生成 MethodDescriptor
MethodDescriptor<DynamicMessage, DynamicMessage> methodDescriptor = generateMethodDescriptor(originMethodDescriptor);
CallOptions callOptions = CallOptions.DEFAULT;
TypeRegistry registry = TypeRegistry.newBuilder()
.add(fileDescriptor.getMessageTypes())
.build();
// 将请求内容由 JSON 字符串转为相应的类型
JsonFormat.Parser parser = JsonFormat.parser().usingTypeRegistry(registry);
DynamicMessage.Builder messageBuilder = DynamicMessage.newBuilder(originMethodDescriptor.getInputType());
parser.merge(requestContent, messageBuilder);
DynamicMessage requestMessage = messageBuilder.build();
// 调用,调用方式可以通过 originMethodDescriptor.isClientStreaming() 和 originMethodDescriptor.isServerStreaming() 推断
DynamicMessage response = ClientCalls.blockingUnaryCall(channel, methodDescriptor, callOptions, requestMessage);
// 将响应解析为 JSON 字符串
JsonFormat.Printer printer = JsonFormat.printer()
.usingTypeRegistry(registry)
.includingDefaultValueFields();
String responseContent = printer.print(response);
log.info("响应: {}", responseContent);
}
/**
* 重新生成方法描述
*/
private static MethodDescriptor<DynamicMessage, DynamicMessage> generateMethodDescriptor(Descriptors.MethodDescriptor originMethodDescriptor) {
// 生成方法全名
String fullMethodName = MethodDescriptor.generateFullMethodName(originMethodDescriptor.getService().getFullName(), originMethodDescriptor.getName());
// 请求和响应类型
MethodDescriptor.Marshaller<DynamicMessage> inputTypeMarshaller = ProtoUtils.marshaller(DynamicMessage.newBuilder(originMethodDescriptor.getInputType())
.buildPartial());
MethodDescriptor.Marshaller<DynamicMessage> outputTypeMarshaller = ProtoUtils.marshaller(DynamicMessage.newBuilder(originMethodDescriptor.getOutputType())
.buildPartial());
// 生成方法描述, originMethodDescriptor 的 fullMethodName 不正确
return MethodDescriptor.<DynamicMessage, DynamicMessage>newBuilder()
.setFullMethodName(fullMethodName)
.setRequestMarshaller(inputTypeMarshaller)
.setResponseMarshaller(outputTypeMarshaller)
// 使用 UNKNOWN,自动修改
.setType(MethodDescriptor.MethodType.UNKNOWN)
.build();
}
}
There isn't much to prevent this technically. The two big hurdles are:
having a runtime-callable parser for reading the .proto, and
having a general purpose gRPC client available that takes things like the service method name as literals
Both are possible, but neither is trivial.
For 1, the crude way would be to shell/invoke protoc using the descriptor-set option to generate a schema binary, then deserialize that as a FileDescriptorSet (from descriptor.proto); this model gives you access to how protoc sees the file. Some platforms also have native parsers (essentially reimplementing protoc as a library in that platform), for example protobuf-net.Reflection does this in .NET-land
For 2, here's an implementation of that in C#. The approach should be fairly portable to Java, even if the details vary. You can look at a generated implementation to see how it works in any particular language.
(Sorry that the specific examples are C#/.NET, but that's where I live; the approaches should be portable, even if the specific code: not directly)
technically both are possible.
The codegen is simply generating a handful of classes; mainly protobuf messages, grpc method descriptors and stubs. You can implement it or check in the generated code to bypass the codegen. i am not sure what is the benefit of doing this tbh. Also, it will be very annoying if the proto is changed.
It is also possible to do it dynamically using byte codegen as long as you check-in some interfaces/abstract classes to represent those generated stub/method descriptors and protobuf messages. you have to make sure those non dynamic code is in sync with the proto definition though (most likely runtime check/exception).

SnakeYAML Dump nested key

I am using SnakeYAML as my YAML parser for a project, and I don't know how to set keys that are nested. For instance, here is a YAML file with nested keys in it.
control:
advertising:
enabled: true
logging:
chat: true
commands: true
format: '%id% %date% %username% | %value%'
My goal is to be able to easily set the path control.advertising.enabled or any other path to any value.
When I use
void Set(String key, Object value, String configName){
Yaml yaml = new Yaml();
OutputStream oS;
try {
oS = new FileOutputStream(main.getDataFolder() + File.separator + configName);
} catch (FileNotFoundException e) {
e.printStackTrace();
return;
}
Map<String, Object> data = new HashMap<String, Object>();
// set data based on original + modified
data.put(key, value);
String output = yaml.dump(data);
try {
oS.write(output.getBytes());
} catch (IOException e) {
e.printStackTrace();
}
}
to set the value, instead of getting
logging:
chat: true
commands: true
format: '%id% %date% %username% | %value%'
the entire yaml file is cleared and I only get
{logging.chat: false}
Thank you!
Define your structure with Java classes:
public class Config {
public static class Advertising {
public boolean enabled;
}
public static class Control {
public Advertising advertising;
}
public static class Logging {
public boolean chat;
public boolean commands;
public String format;
}
Control control;
Logging logging;
}
Then you can modify it like this:
Yaml yaml = new Yaml();
Config config = yaml.loadAs(inputStream, Config.class);
config.control.advertising.enabled = false;
String output = yaml.dump(config);
Note that loading & saving YAML data this way might mess with the order of mapping keys, because this order is not preserved. I assume that the output order will be according to the field order in the Java classes, but I'm not sure.

Avro: ClassCastException while serializing / deserializing a file that contains an Enum value

I'm getting the following error when I try to deserialize a previously serialized file:
Exception in thread "main" java.lang.ClassCastException:
com.ssgg.bioinfo.effect.Sample$Zygosity cannot be cast to
com.ssgg.bioinfo.effect.Sample$.Zygosity
at com.ssgg.ZygosityTest.deserializeZygosityToAvroStructure(ZygosityTest.java:45)
at com.ssgg.ZygosityTest.main(ZygosityTest.java:30)
In order to reproduce the error, the main class is as follows:
public class ZygosityTest {
public static void main(String[] args) throws JsonParseException, JsonMappingException, IOException {
String filepath = "/home/XXXX/zygosity.avro";
/* Populate Zygosity*/
com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity = com.ssgg.bioinfo.effect.Sample$.Zygosity.HET;
/* Create file serialized */
createZygositySerialized(zygosity, filepath);
/* Deserializae file */
com.ssgg.bioinfo.effect.Sample$.Zygosity avroZygosityOutput = deserializeZygosityToAvroStructure(filepath);
}
private static com.ssgg.bioinfo.effect.Sample$.Zygosity deserializeZygosityToAvroStructure(String filepath)
throws IOException {
com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity = null;
File myFile = new File(filepath);
DatumReader<com.ssgg.bioinfo.effect.Sample$.Zygosity> reader = new SpecificDatumReader<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
com.ssgg.bioinfo.effect.Sample$.Zygosity.class);
DataFileReader<com.ssgg.bioinfo.effect.Sample$.Zygosity> dataFileReader = new DataFileReader<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
myFile, reader);
while (dataFileReader.hasNext()) {
zygosity = dataFileReader.next(zygosity);
}
dataFileReader.close();
return zygosity;
}
private static void createZygositySerialized(com.ssgg.bioinfo.effect.Sample$.Zygosity zygosity, String filepath)
throws IOException {
DatumWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity> datumWriter = new SpecificDatumWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
com.ssgg.bioinfo.effect.Sample$.Zygosity.class);
DataFileWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity> fileWriter = new DataFileWriter<com.ssgg.bioinfo.effect.Sample$.Zygosity>(
datumWriter);
Schema schema = com.ssgg.bioinfo.effect.Sample$.Zygosity.getClassSchema();
fileWriter.create(schema, new File(filepath));
fileWriter.append(zygosity);
fileWriter.close();
}
}
The avro generated enum for Zygosity is as follows:
/**
* Autogenerated by Avro
*
* DO NOT EDIT DIRECTLY
*/
package com.ssgg.bioinfo.effect.Sample$;
#SuppressWarnings("all")
#org.apache.avro.specific.AvroGenerated
public enum Zygosity {
HOM_REF, HET, HOM_VAR, HEMI, UNK ;
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"enum\",\"name\":\"Zygosity\",\"namespace\":\"com.ssgg.bioinfo.effect.Sample$\",\"symbols\":[\"HOM_REF\",\"HET\",\"HOM_VAR\",\"HEMI\",\"UNK\"]}");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
}
I'm a newbie in Avro, can somebody plese help me to find the problem?
In my project I try to serialize and deserialize a bigger structure, but I have problems with the enums, so I isolated a smaller problem here.
If you need more info I can post it.
Thanks.
I believe the main issue here is that $ has a special meaning in Java classes, and less important is that package names are typically lowercased.
So, you should at least edit the namespaces to remove the $

Textfile to Avrofile in hadoop using Mapreduce

I need to convert the textfile to avrofile in hadoop hdfs using mapreduce.
I already placed text file in hdfs.
I didnt know how to implement in mapreduce.
This and example demonstrating how to convert text file to Avro
(Our input file is a movie database, contains the following information )
serial number :: movie name (year)::tag1|tag2
Example Records :
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
Result Schema is :
{
"name":"movies",
"type":"record",
"fields":[
{"name":"movieName",
"type":"string"
},
{"name":"year",
"type":"string"
},
{"name":"tags",
"type":{"type":"array","items":"string"}
}
]
}
CODE EXAMPLE :
The map function will extract the different fields from the input record and constructs a generic record. The reduce function will simply write it’s key to the output, no processing is done in reducer.
public class MRTextToAvro extends Configured implements Tool{
public static void main(String[] args) throws Exception {
int exitCode=ToolRunner.run(new MRTextToAvro(),args );
System.out.println("Exit code "+exitCode);
}
public int run(String[] arg0) throws Exception {
Job job= Job.getInstance(getConf(),"Text To Avro");
job.setJarByClass(getClass());
FileInputFormat.setInputPaths(job, new Path(arg0[0]));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
Schema.Parser parser = new Schema.Parser();
Schema schema=parser.parse(getClass().getResourceAsStream("movies.avsc"));
job.getConfiguration().setBoolean(
Job.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
AvroJob.setMapOutputKeySchema(job,schema);
job.setMapOutputValueClass( NullWritable.class);
AvroJob.setOutputKeySchema(job, schema);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyOutputFormat.class);
job.setMapperClass(TextToAvroMapper.class);
job.setReducerClass(TextToAvroReduce.class);
return job.waitForCompletion(true)?0:1;
}
public static class TextToAvroMapper extends Mapper<LongWritable ,Text,AvroKey<GenericRecord>,NullWritable>{
Schema schema;
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Schema.Parser parser = new Schema.Parser();
schema=parser.parse(getClass().getResourceAsStream("movies.avsc"));
}
public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
GenericRecord record=new GenericData.Record(schema);
String inputRecord=value.toString();
record.put("movieName", getMovieName(inputRecord));
record.put("year", getMovieRelaseYear(inputRecord));
record.put("tags", getMovieTags(inputRecord));
context.write(new AvroKey(record), NullWritable.get());
}
public String getMovieName(String record){
String movieName=record.split("::")[1];
return movieName.substring(0, movieName.lastIndexOf('(' )).trim();
}
public String getMovieRelaseYear(String record){
String movieName=record.split("::")[1];
return movieName.substring( movieName.lastIndexOf( '(' )+1,movieName.lastIndexOf( ')' )).trim();
}
public String[] getMovieTags(String record){
return (record.split("::")[2]).split("\\|");
}
}
public static class TextToAvroReduce extends Reducer<AvroKey<GenericRecord>,NullWritable,AvroKey<GenericRecord>,NullWritable>{
#Override
public void reduce(AvroKey<GenericRecord> key,Iterable<NullWritable> value,Context context) throws IOException, InterruptedException{
context.write(key, NullWritable.get());
}
}
}
To run the job , you need to put the avro related lib on class path
export HADOOP_CLASSPATH=/path/to/targets/avro-mapred-1.7.4-hadoop2.jar
yarn jar HadoopAvro-1.0-SNAPSHOT.jar MRTextToAvro -libjars avro-mapred-1.7.4-hadoop2.jar /input/path output/path

Categories

Resources