I am using Kafka streams to read and process protobuf messages.
I am using the following properties for the stream:
Properties properties = new Properties();
properties.put(ConsumerConfig.GROUP_ID_CONFIG, kafkaConfig.getGroupId());
properties.put(StreamsConfig.CLIENT_ID_CONFIG, kafkaConfig.getClientId());
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, kafkaConfig.getApplicationId());
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaConfig.getBootstrapServers());
properties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.StringSerde.class);
properties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, KafkaProtobufSerde.class);
properties.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, kafkaConfig.getSchemaRegistryUrl());
properties.put(KafkaProtobufDeserializerConfig.SPECIFIC_PROTOBUF_VALUE_TYPE, ProtobufData.class);
return properties;
}
but while running I encounter this error:
Caused by: java.lang.ClassCastException: class com.google.protobuf.DynamicMessage cannot be cast to class model.schema.proto.input.ProtobufDataProto$ProtobufData (com.google.protobuf.DynamicMessage and model.schema.proto.input.ProtobufDataProto$ProtobufData are in unnamed module of loader 'app')
My .proto files looks as follows:
import "inner_data.proto";
package myPackage;
option java_package = "model.schema.proto.input";
option java_outer_classname = "ProtobufDataProto";
message OuterData {
string timestamp = 1;
string x = 3;
repeated InnerObject flows = 4;
}
(I have two separate proto files)
package myPackage;
option java_package = "model.schema.proto.input";
option java_outer_classname = "InnerDataProto";
message InnerData {
string a = 1;
string b = 2;
string c = 3;
}
I would like to know why Kafka uses DynamicMessage even though I gave the specific protobuf value class in the properties and how to fix this?
I had the same issue while trying to make Kafkastream working with protobuf,
I solved this issue by using specifically KafkaProtobufSerde to configure the streambuilder AND by specifiing explicitly the class to deserialize to with this line: serdeConfig.put(SPECIFIC_PROTOBUF_VALUE_TYPE,ProtobufDataProto.class.getName());
/*
* Define SpecificSerde for Even in protobuff
*/
final KafkaProtobufSerde< ProtobufDataProto > protoSerde = new KafkaProtobufSerde<>();
Map<String, String> serdeConfig = new HashMap<>();
serdeConfig.put(SCHEMA_REGISTRY_URL_CONFIG, registryUrl);
/*
* Technically, the following line is only mandatory in order to de-serialize object into GeneratedMessageV3
* and NOT into DynamicMessages : https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/DynamicMessage
*/
serdeConfig.put(SPECIFIC_PROTOBUF_VALUE_TYPE,ProtobufDataProto.class.getName());
protoSerde.configure(serdeConfig, false);
Then i can create my input stream and it will be deserialized:
//Define a Serde for the key
final Serde<byte[]> bytesSerde = Serdes.ByteArray();
//Define the stream
StreamsBuilder streamsBuilder = new StreamsBuilder();
streamsBuilder.stream("inputTopic", Consumed.with(bytesSerde, protoSerde));
/*
add your treatments, maps, filter etc
...
*/
streamsBuilder.build();
Related
I'm relying on Confluent's schema registry to store my protobuf schemas.
I posted the following schema in the schema registry:
{
"schema": "syntax = 'proto3'; package com.xyz.message; option java_package = 'com.xyz.message'; option java_outer_classname = 'ActionMessage'; message Action { reserved 7; string id = 1; string version = 2; string action_name = 3; string unique_event_i_d = 4; string rule_i_d = 5; map<string, Value> parameters = 6; string secondary_id = 8; message Value { string value = 1; repeated string values = 2; } }",
"schemaType" : "PROTOBUF"
}
I then query the schema registry REST API from my application to retrieved it
...
JsonElement schemaRegistryResponse = new JsonParser().parse(inputStreamReader);
String schema = schemaRegistryResponse.getAsJsonObject().get("schema").getAsString();
This indeed makes schema variable holding a string containing the protobuf schema. I now want to create a com.google.protobuf.Descriptors.Descriptor instance from it.
I proceed as follows:
byte[] encoded = Base64.getEncoder().encode(schema.getBytes());
FileDescriptorSet set = FileDescriptorSet.parseFrom(encoded);
FileDescriptor f = FileDescriptor.buildFrom(set.getFile(0), new FileDescriptor[] {});
Descriptors.Descriptor descriptor = f.getMessageTypes().get(0);
However, this throws a Protocol message end-group tag did not match expected tag exception when invoking parseFrom(encoded) method.
Any idea what I might be doing wrong?
You're trying to parse a base64-encoded representation of a .proto file. That's not at all what FileDescriptorSet.parseFrom expects. It expects a protobuf binary representation of a FileDescriptorSet message, which is typically created by protoc using the descriptor_set_out option.
I don't believe there's any way of getting the protobuf library to parse the text of a .proto file - you really need to run protoc.
I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.
I have a YAML configuration file have the below list:
name:
- string1
- string2
- string3
I am reading the configuration file as follows:
Yaml = _yml = new Yaml();
InputStream in = Resources.getResources("myconfigfile.yml").opendStream();
Map cfg_map = (Map) yaml.load(in);
in.close();
String[] values = cfg_map.get("name");
Here in this line String[] values = cfg_map.get("name"); gives me the object. How can I convert it to the String array?
I tried with cfg_map.get("name").toString().split("\n") but it didn't work.
By default, SnakeYAML does not know the underlying types you want your YAML file to be parsed into. You can tell it the structure of your files by setting the root type. For example (matching the structure of your input):
class Config {
public List<String> name;
}
You can then load the YAML like this:
/* We need a constructor to tell SnakeYaml that the type parameter of
* the 'name' List is String
* (SnakeYAML cannot figure it out itself due to type erasure)
*/
Constructor constructor = new Constructor(Config.class);
TypeDescription configDesc = new TypeDescription(Config.class);
configDesc.putListPropertyType("name", String.class);
constructor.addTypeDescription(configDesc);
// Now we use our constructor to tell SnakeYAML how to load the YAML
Yaml yaml = new Yaml(constructor);
Config config = yaml.loadAs(in, Config.class);
// You can now easily access your strings
List<String> values = config.name;
I'm reading from a Kafka topic, which contains Avro messages serialized using the KafkaAvroEncoder (which automatically registers the schemas with the topics). I'm using the maven-avro-plugin to generate plain Java classes, which I'd like to use upon reading.
The KafkaAvroDecoder only supports deserializing into GenericData.Record types, which (in my opinion) misses the whole point of having a statically typed language. My deserialization code currently looks like this:
SpecificDatumReader<event> reader = new SpecificDatumReader<>(
event.getClassSchema() // event is my class generated from the schema
);
byte[] in = ...; // my input bytes;
ByteBuffer stuff = ByteBuffer.wrap(in);
// the KafkaAvroEncoder puts a magic byte and the ID of the schema (as stored
// in the schema-registry) before the serialized message
if (stuff.get() != 0x0) {
return;
}
int id = stuff.getInt();
// lets just ignore those special bytes
int length = stuff.limit() - 4 - 1;
int start = stuff.position() + stuff.arrayOffset();
Decoder decoder = DecoderFactory.get().binaryDecoder(
stuff.array(), start, length, null
);
try {
event ev = reader.read(null, decoder);
} catch (IOException e) {
e.printStackTrace();
}
I found my solution cumbersome, so I'd like to know if there is a simpler solution to do this.
Thanks to the comment I was able to find the answer. The secret was to instantiate KafkaAvroDecoder with a Properties specifying the use of the specific Avro reader, that is:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "...");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_C‌ONFIG, "...");
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
VerifiableProp vProps = new VerifiableProperties(props);
KafkaAvroDecoder decoder = new KafkaAvroDecoder(vProps);
MyLittleData data = (MyLittleData) decoder.fromBytes(input);
The same configuration applies for the case of using directly the KafkaConsumer<K, V> class (I'm consuming from Kafka in Storm using the KafkaSpout from the storm-kafka project, which uses the SimpleConsumer, so I have to manually deserialize the messages. For the courageous there is the storm-kafka-client project, which does this automatically by using the new style consumer).
I am testing Avro for java with a simple record composed of a string and a map. Here's my schema:
{
"type":"record",
"name":"TableRecord",
"fields":[
{"name":"ActionCode","type":"string"},
{
"name":"Fields",
"type":{"type":"map","values":["string","long","double","null"]}
}
]
}
And here's a very simple test case that fails:
#Test
public void testSingleMapSerialization() throws IOException {
final String schemaStr; // see above
// create some data
Map<String, Object> originalMap = new Hashtable<>();
originalMap.put("Ric", "sZwmXAdYKv");
originalMap.put("QuoteId", 4342740204922826921L);
originalMap.put("CompanyName", "8PKQ9va3nW8pRWb4SjPF2DvdQDBmlZ");
originalMap.put("Category", "AvrIfd");
// serialize data
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(schemaStr);
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
GenericRecord datum = new GenericData.Record(schema);
datum.put("ActionCode", "R");
datum.put("Map", originalMap);
writer.write(datum, encoder);
encoder.flush();
out.flush();
// deserialize data
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
datum = new GenericData.Record(schema);
Map<String, Object> deserializedMap = (Map<String, Object>) reader.read(datum, decoder).get("Map");
System.out.println(originalMap);
System.out.println(deserializedMap);
Assert.assertEquals("Maps data don't match", originalMap, deserializedMap);
}
And here's the output of the test:
{CompanyName=8PKQ9va3nW8pRWb4SjPF2DvdQDBmlZ, Ric=sZwmXAdYKv, Category=AvrIfd, QuoteId=4342740204922826921}
{QuoteId=4342740204922826921, Category=AvrIfd, CompanyName=8PKQ9va3nW8pRWb4SjPF2DvdQDBmlZ, Ric=sZwmXAdYKv}
java.lang.AssertionError: Maps data don't match expected:<{CompanyName=8PKQ9va3nW8pRWb4SjPF2DvdQDBmlZ, Ric=sZwmXAdYKv, Category=AvrIfd, QuoteId=4342740204922826921}> but was:<{QuoteId=4342740204922826921, Category=AvrIfd, CompanyName=8PKQ9va3nW8pRWb4SjPF2DvdQDBmlZ, Ric=sZwmXAdYKv}>
As you can see, the two maps look identical but the test fails. JUnit calls the "equals" method under the covers, and that should return true. BTW, if you're wondering what's the gibberish, I usually create test cases with randomly generated data, so that's where it comes from.
Am I doing something wrong? Is there a catch with string serialization/de-serialization I'm not aware of? I searched online with no success.
Ideas?
Thanks
Giodude
I figured out what the "catch" was. I was comparing a map containing java.lang.Strings with one containing org.apache.avro.util.Utf8. It turns out the Utf8 equals method doesn't work if used with strings. I realized this by adding to my test case the following:
for (Object o : deserializedMap.values())
System.out.println(o.getClass());
for (Object o : deserializedMap.keySet())
System.out.println(o.getClass());
which prints the following:
class java.lang.Long
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
class org.apache.avro.util.Utf8
I guess this was to be expected since Avro always converts strings to its native Utf8 type. I assumed it would reproduce my map as-is, but that's not the case. It's odd that the type cast to Map succeeded, I am not clear on how that happened.
Yes, Avro map uses org.apache.avro.util.Utf8 as default key since 1.5, and could be changed to String. For more details, you may refer: https://issues.apache.org/jira/browse/AVRO-803 or Apache Avro: map uses CharSequence as key