Protocol Buffers: How to parse a .proto file in Java

Protocol Buffers: How to parse a .proto file in Java - java

I am trying to dynamically parse a given .proto file in Java to decode a Protobuf-encoded binary.
I have the following parsing method, in which the "proto" string contains the content of the .proto file:
public static Descriptors.FileDescriptor parseProto (String proto) throws InvalidProtocolBufferException, Descriptors.DescriptorValidationException {
DescriptorProtos.FileDescriptorProto descriptorProto = DescriptorProtos.FileDescriptorProto.parseFrom(proto.getBytes());
return Descriptors.FileDescriptor.buildFrom(descriptorProto, null);
}
Though, on execution the previous method throws an exception with the message "Protocol message tag had invalid wire type.". I use the example .proto file from Google so I guess it is valid: https://github.com/google/protobuf/blob/master/examples/addressbook.proto
Here is the stack trace:
15:43:24.707 [pool-1-thread-1] ERROR com.github.whiver.nifi.processor.ProtobufDecoderProcessor - ProtobufDecoderProcessor[id=42c8ab94-2d8a-491b-bd99-b4451d127ae0] Protocol message tag had invalid wire type.
com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type.
at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:115)
at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:551)
at com.google.protobuf.GeneratedMessageV3.parseUnknownField(GeneratedMessageV3.java:293)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.<init>(DescriptorProtos.java:88)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.<init>(DescriptorProtos.java:53)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet$1.parsePartialFrom(DescriptorProtos.java:773)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet$1.parsePartialFrom(DescriptorProtos.java:768)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:163)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:197)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:209)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:214)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at com.google.protobuf.DescriptorProtos$FileDescriptorSet.parseFrom(DescriptorProtos.java:260)
at com.github.whiver.nifi.parser.SchemaParser.parseProto(SchemaParser.java:9)
at com.github.whiver.nifi.processor.ProtobufDecoderProcessor.lambda$onTrigger$0(ProtobufDecoderProcessor.java:103)
at org.apache.nifi.util.MockProcessSession.write(MockProcessSession.java:895)
at org.apache.nifi.util.MockProcessSession.write(MockProcessSession.java:62)
at com.github.whiver.nifi.processor.ProtobufDecoderProcessor.onTrigger(ProtobufDecoderProcessor.java:100)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:251)
at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:245)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Any idea?
Thank you!

It looks like you're trying to use FileDescriptorSet.parseFrom to populate a FileDescriptorSet. This will only work if the bytes you're providing are the binary protobuf contents - which is to say: a compiled schema. You can get a compiled schema by using the protoc command-line-tool with the --descriptor_set_out option. What you're actually passing it right now is the text bytes that make up the text schema, which is not what parseFrom expects.
Without a compiled schema, you would need a runtime .proto parser. I'm not aware of one for Java; protobuf-net includes one (protobuf-net.Reflection), but that is C#/.NET. Without an available runtime .proto parser, you'd need to shell-execute protoc instead.

Drawing from the other answers, here's a snippet of working Kotlin code from a library I'm developing.
https://github.com/asarkar/okgrpc
private fun lookupProtos(
protoPaths: List<String>,
protoFile: String,
tempDir: Path,
resolved: MutableSet<String>
): List<DescriptorProtos.FileDescriptorProto> {
val schema = generateSchema(protoPaths, protoFile, tempDir)
return schema.fileList
.filter { resolved.add(it.name) }
.flatMap { fd ->
fd.dependencyList
.filterNot(resolved::contains)
.flatMap { lookupProtos(protoPaths, it, tempDir, resolved) } + fd
}
}
private fun generateSchema(
protoPaths: List<String>,
protoFile: String,
tempDir: Path
): DescriptorProtos.FileDescriptorSet {
val outFile = Files.createTempFile(tempDir, null, null)
val stderr = ByteArrayOutputStream()
val exitCode = Protoc.runProtoc(
(protoPaths.map { "--proto_path=$it" } + listOf("--descriptor_set_out=$outFile", protoFile)).toTypedArray(),
DevNull,
stderr
)
if (exitCode != 0) {
throw IllegalStateException("Failed to generate schema for: $protoFile")
}
return Files.newInputStream(outFile).use { DescriptorProtos.FileDescriptorSet.parseFrom(it) }
}
The idea is to use os72/protoc-jar to write out a compiled schema/file descriptor. Then use FileDescriptorSet.parseFrom to read that file, and recurse on its dependencies.

Don't use java String to hold the protobuf payload. The issue is that String does translations behind the scenes, and makes assumptions about character sets.
Protobuf works on byte arrays, and the exact representation in the array has to be unchanged. Going to and from String does not work.

Related

Streaming data from Kinesis to S3 fails with Illegal Character that KPL itself writes

I have a relatively straightforward use case:
Read Avro data from a Kafka topic
Use KPL (v0.14.12) to send this data to Kinesis Data Streams
Use Kinesis Firehose to transform this data into Parquet and transfer it to S3.
The Kafka topic was written into by Kafka Streams using the following producer Configuration:
private void addAwsGlueSpecificProperties(Map<String, Object> props) {
props.put(AWSSchemaRegistryConstants.AWS_REGION, "eu-central-1");
props.put(AWSSchemaRegistryConstants.DATA_FORMAT, DataFormat.AVRO.name());
props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
props.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "Kinesis_Schema_Registry");
props.put(AWSSchemaRegistryConstants.COMPRESSION_TYPE, AWSSchemaRegistryConstants.COMPRESSION.ZLIB.name());
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, GlueSchemaRegistryKafkaStreamsSerde.class.getName());
}
Most notably, I've set SCHEMA_AUTO_REGISTRATION_SETTING to true to try and rule out problems with my schema definition. The auto-registration itself worked without any issues.
I have a very simple loop running for test purposes, which does step 1 and 2 of the above. It looks as follows:
KinesisProducer kinesisProducer = new KinesisProducer(getKinesisConfig());
try (final KafkaConsumer<String, AvroEvent> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList(TOPIC));
while (true) {
log.info("Polling...");
final ConsumerRecords<String, AvroEvent> records = consumer.poll(Duration.ofMillis(100));
for (final ConsumerRecord<String, AvroEvent> record : records) {
final String key = record.key();
ListenableFuture<UserRecordResult> request = kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
Futures.addCallback(request, CALLBACK, executor);
}
Thread.sleep(Duration.ofSeconds(10).toMillis());
}
}
The callback just does a bit of logging on success/failure.
My Kinesis Config looks as follows:
private static KinesisProducerConfiguration getKinesisConfig() {
KinesisProducerConfiguration config = new KinesisProducerConfiguration();
GlueSchemaRegistryConfiguration schemaRegistryConfiguration = getGlueSchemaRegistryConfiguration();
config.setGlueSchemaRegistryConfiguration(schemaRegistryConfiguration);
config.setRegion("eu-central-1");
config.setCredentialsProvider(new DefaultAWSCredentialsProviderChain());
config.setMaxConnections(2);
config.setThreadingModel(KinesisProducerConfiguration.ThreadingModel.POOLED);
config.setThreadPoolSize(2);
config.setRateLimit(100L);
return config;
}
private static GlueSchemaRegistryConfiguration getGlueSchemaRegistryConfiguration() {
GlueSchemaRegistryConfiguration gsrConfig = new GlueSchemaRegistryConfiguration("eu-central-1");
gsrConfig.setAvroRecordType(AvroRecordType.GENERIC_RECORD ); // have also tried SPECIFIC_RECORD
gsrConfig.setRegistryName("Kinesis_Schema_Registry");
gsrConfig.setCompressionType(AWSSchemaRegistryConstants.COMPRESSION.ZLIB);
return gsrConfig;
}
This setup allows me to read Specific Avro records from Kafka and send them to Kinesis. I have also verified that the correct schema version ID is queried from GSR by my code. However, when my data gets to Firehose, I receive only the following error message for all my records (one per record):
{
"attemptsMade": 1,
"arrivalTimestamp": 1659622848304,
"lastErrorCode": "DataFormatConversion.ParseError",
"lastErrorMessage": "Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 3)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream#6252e7eb; line: 1, column: 2]",
"attemptEndingTimestamp": 1659623152452,
"rawData": "<base64EncodedData>",
"sequenceNumber": "<seqNum>",
"dataCatalogTable": {
"databaseName": "<Glue database name>",
"tableName": "<Glue table name>",
"region": "eu-central-1",
"versionId": "LATEST",
"roleArn": "<arn>"
}
}
Unfortunately I can't post the entirety of the data as it is sensitive. However, the relevant part is that it always starts with the above control character that is causing the problem:
0x03 0x05 <schemaVersionId> <data>
My original data does not contain these control characters. After some debugging, I've found that KPL explicitly adds these bytes to the beginning of a UserRecord. In com.amazonaws.services.schemaregistry.serializers.SerializationDataEncoder#write:
public byte[] write(final byte[] objectBytes, UUID schemaVersionId) {
byte[] bytes;
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
writeHeaderVersionBytes(out);
writeCompressionBytes(out);
writeSchemaVersionId(out, schemaVersionId);
boolean shouldCompress = this.compressionHandler != null;
bytes = writeToExistingStream(out, shouldCompress ? compressData(objectBytes) : objectBytes);
} catch (Exception e) {
throw new AWSSchemaRegistryException(e.getMessage(), e);
}
return bytes;
}
With writeHeaderVersionBytes(out) and writeCompressionBytes(out) writing to the front of the stream, respectively:
// byte HEADER_VERSION_BYTE = (byte) 3;
private void writeHeaderVersionBytes(ByteArrayOutputStream out) {
out.write(AWSSchemaRegistryConstants.HEADER_VERSION_BYTE);
}
// byte COMPRESSION_BYTE = (byte) 5
// byte COMPRESSION_DEFAULT_BYTE = (byte) 0
private void writeCompressionBytes(ByteArrayOutputStream out) {
out.write(compressionHandler != null ? AWSSchemaRegistryConstants.COMPRESSION_BYTE
: AWSSchemaRegistryConstants.COMPRESSION_DEFAULT_BYTE);
}
Why is Kinesis unable to parse a message that is produced by the library that is supposed to be best suited for writing to it? What am I missing?

I've finally figured out the problem and it's quite dumb.
What it boils down to, is that the transformer that converts data to parquet in Firehose expects a pure JSON payload. It expects records in the form:
{"itemId": 1, "itemName": "someItem"}{"itemId": 2, "itemName": "otherItem"}
It seemingly does not accept the same data in a different format.
This means that Avro-compatible JSON (where the above itemId would look like "itemId": {"long": 1}, or e.g. binary Avro data, is not compatible with the Kinesis Firehose parquet transformer, regardless of the fact that my schema definition in the Glue Schema Registry is explicitly registered as being in Avro format.
In addition, the Firehose parquet transformer requires the use of a Glue table - creating this table from an imported Avro schema simply does not work (see this answer), and had to be created manually. Luckily, even though it can't use the table that is based on an existing schema, the table definition was the same (with the exception of the Serde it needs to use), so it was relatively easy to fix...
To sum up, to get the above code to work I had to:
Create a Glue table for the schema manually (you can use the first table created from the existing schema as a template for creating this second table, but you can't have Firehose link to the first table)
Change the above code:
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
to:
ByteBuffer data = ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), data);
Note that the I am now using the overloaded addUserRecord function that does not include a Schema parameter, which internally invokes the previous function with a null schema parameter. This prevents the KPL from encoding my payload and instead sends the 'plain' JSON over to KDS.
This is contrary to the only AWS Docs example that I could find on the topic, which likely is meant for a Firehose stream which does not convert the data prior to sending it to its destination.
I can't quite understand the reasons for all these undocumented limitations, and it was a pain to debug seeing how neither of the KPL functions nor KDS explicitly mentions anywhere that I can find that this is the expected behaviour. I feel like it's not worth trying to open an issue/PR over at the KPL repo seeing how it seems like Amazon doesn't really care about maintaining it that much...
I'll probably switch over to the plain Kinesis Client + Kinesis Aggregation for a more robust solution in the future, but hey, at least it works.

Parsing different file types

I am learning Kotlin and facing some difficulties understanding how I can proceed
Currently I have a kml file that gets sent from the front end but now I would like to accept geoJson and store this i database -> so I need to create a function in Kotlin to validate file type and based on type return the correct object.
This the function that accepts kml file and calls parseKmlToPolygons
fun parseKmlToPolygons(file: MultipartFile, applicationConfiguration: ApplicationConfiguration): Geometry {
if (file.size > applicationConfiguration.getMaxKmlUploadFileSizeLimitInBytes()) {
throw FileUploadSizeLimitReachedException()
}
return parseMultiParFileToPolygons(file.inputStream)
}
private fun parseKmlToPolygons(content: InputStream): Geometry {
try {
val kml = Kml.unmarshal(content) ?: throw InvalidKmlException("Failed to parse the kml file")
return toGeometry(kml.feature)
} catch (ex: IllegalArgumentException) {
throw InvalidKmlException(ex.localizedMessage, ex)
} catch (ex: InvalidGeometryException) {
throw InvalidKmlException(ex.localizedMessage, ex)
}
}
So I probably need to create a function that detects a correct file, but is it ok for me to return type Any here? Also, is it possible to get the type of the file from inputStream?
private fun detectFileType():Any {
}
My apologies if I am not really clear here, all I need is to replace the function that takes kml files to be able to take either kml or geoJson
Update
//todo would be better to have detection logic separate
private fun parseKmlToPolygons(file: MultipartFile): Geometry {
val fileExtension: String = FilenameUtils.getExtension(file.originalFilename)
if (fileExtension == PolygonFileType.KML.name) {
return parseKmlToPolygons(file.inputStream)
} else if (fileExtension == PolygonFileType.GEOJSON.name) {
return parseKmlToPolygons(file.inputStream)
}
throw FormatNotSupportedException("File format is not supported")
}

Actually, what do you mean by the "file type"? Both types, geoJson and kml, are text files. They do not have any magic-number encoded defining the type. So, I see the following options:
use extension of the original file uploaded by the user. For that you could use MultipartFile.getOriginalFilename
use content type set by the FE when uploading the file. MultipartFile.getContentType. Most likely it won't work out of the box and you will need to adjust your frontend.
check actual file content. It's the most comlex option, but as the kml is xml-based and the geoJson is JSON-based it should be feasable.
and finally the simplest solution: create separate endpoints for both types.

sparkjava: Load PNG as base64 from InputStream

I have the following method to load resources as String where path is the String to the resource on my classpath (which works just fine on plain text):
try (Scanner scanner = new Scanner(MyClass.class.getResourceAsStream(path))) {
return scanner.useDelimiter("\\A").hasNext() ? scanner.next() : "";
}
Now I want to load a PNG image as a base64 String so I can send it back through sparkjava with Content-Type: image/png.
How can I do that?
Do not use any libraries, only plain old Java.

After setting the MIME type in the header with response.header("Content-Type", "image/png") (look up your MIME type here), you can use this:
try {
return Files.readAllBytes(Paths.get(MyClass.class.getResource(path).toURI()));
} catch (IOException | URISyntaxException exception) {
exception.printStackTrace();
}
return null;
Apart from that, to base64-encode a String in Java 8, you can use the java.util.Base64.Encoder class, so you'd just run the result of the method I posted in my description through
Base64.getMimeEncoder().encodeToString(resourceAsString.getBytes(StandardCharsets.UTF_8))
and send it back as response. I haven't got it to work for me though, for some odd reason. I simply used my framework's static files feature.

Decompress async-http-client gzipped HttpResponseBodyPart in Scala?

Because Scala Dispatch 0.9.5 doesn't seem to have a default handler for decompressing GZIP streams, I'm attempting to modify it's as.stream.Lines handler for incoming data. Since the same project is also using Spray.io, I attempted to use it's GzipDecompressor on the HttpResponseBodyPart bytes, but it threw an exception. See below:
def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
if (state == CONTINUE) {
val decomp = new GzipDecompressor
val bytes = decomp.decompress( bodyPart.getBodyPartBytes )
onString(new String(bytes, charset))
}
state
}
The method above modifies the following (line 23): https://github.com/dispatch/reboot/blob/master/core/src/main/scala/stream/strings.scala#L23
It throws an exception! java.util.zip.ZipException: Not in GZIP format For those wondering what GzipDecompressor looks like, here it is: https://github.com/spray/spray/blob/master/spray-httpx/src/main/scala/spray/httpx/encoding/Gzip.scala
I know that it is indeed a Gzipped stream. Any workarounds? Thanks!
Update
Playing around with the following:
val machine = new GZIPDecompressMachine
val bytes = machine.write( bodyPart.getBodyPartBytes )
onString(machine.stop().flatMap((a) => a).mkString)
The class GZIPDecompressMachine comes from here: https://gist.github.com/841435/09921c8dcbb6b2ad01a3161589ed4fe68f256fbc
It's throwing an exception: java.io.EOFException: Unexpected end of ZLIB input stream any ideas here?

Compare file extension to file header

I'm starting to design an application, that will, in part, run through a directory of files and compare their extensions to their file headers.
Does anyone have any advice as to the best way to approach this? I know I could simply have a lookup table that will contain the file's header signature. e.g., JPEG: \xFF\xD8\xFF\xE0
I was hoping there might be a simper way.
Thanks in advance for your help.

I'm afraid it'll have to be more complicated than that. Not every file type has a header at all, and some (such as RAR) have their characteristic data structures at the end rather than at the beginning.
You may want to take a look at the Unix file command, which does the same job:
http://linux.die.net/man/1/file
http://linux.die.net/man/5/magic

If you don't need to do dirty work on these values (and you don't have linux) you could simply use an external program, like TrID, that is able to do this thing for you.
Maybe you can just work on its output without caring to doing it by yourself.. in anycase if you have just around 20 kinds of files that you will have to manage having a simple lookup table (eg. HashMap<String,byte[]>) is not that bad. Of cours this will work only if desidered file format has a magic number, otherwise you are on your own (or with an external program).

Because of the problem with the missing significant header for some file types (thanks #Michael) I would create a map of extension to a kind of type checker with a simple API like
public interface TypeCheck throws IOException {
public boolean isValid(InputStream data);
}
Now you can code something like
File toBeTested = ...;
Map<String,TypeCheck> typeCheckByExtension = ...;
TypeCheck check = typeCheckByExtension.get(getExtension(toBeTested.getName()));
if (check != null) {
InputStream in = new FileInputStream(toBeTested);
if (check.isValid(in)) {
// process valid file
} else {
// process invalid file
}
in.close();
} else {
// process unknown file
}
The Header check for JPEG for example may look like
public class JpegTypeCheck implements TypeCheck {
private static final byte[] HEADER = new byte[] {0xFF, 0xD8, 0xFF, 0xE0};
public boolean isValid(InputStream data) throws IOException {
byte[] header = new byte[4];
return data.read(header) == 4 && Arrays.equals(header, HEADER);
}
}
For other types with no significant header you can implement completly other type checks.

You can extract the mime type for each file and compare this to a map of mimetype/extension (Map<String, List<String>>, the first String is the mime type, the second is a list of valid extensions).
Resources :
Get the Mime Type from a File
JMimeMagic
On the same topic :
Java - HowTo extract MimeType from a byte[]
Getting A File's Mime Type In Java

You can know the file type of file reading the header using apache tika. Following code need apache tika jar.
InputStream is = MainApp.class.getResourceAsStream("/NetFx20SP1_x64.txt");
BufferedInputStream bis = new BufferedInputStream(is);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY,MainApp.class.getResource("/NetFx20SP1_x64.txt").getPath());
MediaType mediaType = detector.detect(bis, md);
System.out.println("MIMe Type of File : " + mediaType.toString());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Protocol Buffers: How to parse a .proto file in Java - java

Don't use java String to hold the protobuf payload. The issue is that String does translations behind the scenes, and makes assumptions about character sets. Protobuf works on byte arrays, and the exact representation in the array has to be unchanged. Going to and from String does not work.

Related

Streaming data from Kinesis to S3 fails with Illegal Character that KPL itself writes

Parsing different file types

sparkjava: Load PNG as base64 from InputStream

Decompress async-http-client gzipped HttpResponseBodyPart in Scala?

Compare file extension to file header

Categories

Resources