Kafka Connect JsonConverter with validation

Kafka Connect JsonConverter with validation - java

I'm trying to create kafka connect value converter which wraps invalid json records with a valid json object.
I'm reading the values from kinesis (using KinesisSourceConnector) so the input is in base64 encoding.
My implementation tries to process the input through ByteArrayConverter which decodes the data amd delegate the output to JsonConverter as follows (decode is initialized in the configure method to true):
private final Converter delegate = new JsonConverter();
private final Converter decoder = new ByteArrayConverter();
private boolean decode = false;
#Override
public byte[] fromConnectData(String topic, Schema schema, Object value) {
try {
String decoded = new String(decoder.fromConnectData(topic, schema, value));
LOG.info("decoded string\n" + decoded);
if(decode) {
byte[] bytes = decoder.fromConnectData(topic, schema, value);
return delegate.fromConnectData(topic, schema, bytes);
}
return delegate.fromConnectData(topic, schema, value);
} catch (Exception e) {
LOG.error("something went wrong", e);
return delegate.fromConnectData(topic, schema, wrapInvalidJson(new String(decoder.fromConnectData(topic, schema, value))));
}
}
When i am printing the decoded string it looks ok (decoded json string)
But when i consume the output topic it looks like base64 again and I'm not sure what i am missing

Not sure it is optimal but went for this approache
private final Converter delegate = new JsonConverter();
private final Converter decoder = new ByteArrayConverter();
private final Converter stringConverter = new StringConverter();
private final ObjectMapper mapper = new ObjectMapper();
private boolean decode = false;
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
delegate.configure(Collections.singletonMap("schemas.enable", false), false);
if (configs.containsKey("ni.decode.data") && Boolean.valueOf((String) configs.get("ni.decode.data"))) {
decode = true;
}
}
#Override
public byte[] fromConnectData(String topic, Schema schema, Object value) {
if (decode) {
String decoded = new String(decoder.fromConnectData(topic, schema, value));
try {
return mapper.readTree(decoded).toString().getBytes();
} catch (Exception e) {
return wrapInvalidJson(decoded).getBytes();
}
} else {
try {
return delegate.fromConnectData(topic, schema, value);
} catch (Exception e) {
byte[] msg = stringConverter.fromConnectData(topic, schema, value);
return wrapInvalidJson(new String(msg)).getBytes();
}
}
}

Related

TupleTag not found in DoFn

I have a DoFn that is supposed to split input into two separate PCollections. The pipeline builds and runs up until it is time to output in the DoFn, and then I get the following exception:
"java.lang.IllegalArgumentException: Unknown output tag Tag<edu.mayo.mcc.cdh.pipeline.PubsubToAvro$PubsubMessageToArchiveDoFn$2.<init>:219#2587af97b4865538>
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)...
If I declare the TupleTags I'm using in the ParDo, I get that error, but if I declare them outside of the ParDo I get a syntax error saying the OutputReceiver can't find the tags. Below is the apply and the ParDo/DoFn:
PCollectionTuple results = (messages.apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn()).withOutputTags(noTag, TupleTagList.of(medaPcollection))));
PCollection<AvroPubsubMessageRecord> medaPcollectionTransformed = results.get(medaPcollection);
PCollection<AvroPubsubMessageRecord> noTagPcollectionTransformed = results.get(noTag);
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context, MultiOutputReceiver out) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
out.get(medaPcollection).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
out.get(noTag).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
}

Can you try without MultiOutputReceiver out parameter in the processElement method ?
Outputs are then returned with context.output with passing element and corresponding TupleTag.
Your example only with context :
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
context.output(medaPcollection, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
context.output(noTag, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
I also show you an example that works for me :
public class WordCountFn extends DoFn<String, Integer> {
private final TupleTag<Integer> outputTag = new TupleTag<Integer>() {};
private final TupleTag<Failure> failuresTag = new TupleTag<Failure>() {};
#ProcessElement
public void processElement(ProcessContext ctx) {
try {
// Could throw ArithmeticException.
final String word = ctx.element();
ctx.output(1 / word.length());
} catch (Throwable throwable) {
final Failure failure = Failure.from("step", ctx.element(), throwable);
ctx.output(failuresTag, failure);
}
}
public TupleTag<Integer> getOutputTag() {
return outputTag;
}
public TupleTag<Failure> getFailuresTag() {
return failuresTag;
}
}
In my first output (good case), no need to pass the TupleTag ctx.output(1 / word.length());
For my second output (failure case), I pass the Failure tag with the corresponding element.

I was able to get around this by making my ParDo an anonymous function instead of a class. I put the whole function inline and had no problem finding the output tags after I did that. Thanks for the suggestions!

serialization external library

I am using Redis in one of our project.While I realized that redis needs objects to be serialized to be persisted, I want to understand how to deal with some classes which refers to external library classes(StandardServletEnvironment) in my case ,which doesn't implement serializable and we can't modify it as well ? I am getting notSerializableException in these cases.

If you want to store user defined Java objects in redis, serialization is the suitable option. However when you go with Java native serialization it comes with some drawbacks like you faced and also it is too slow. I also faced the same kind of problem, after a long search I came up with solution to use kryo serialization.Kryo doesn't needs serialization implementation and it is faster enough than java native serialization.
P.S:If you don't want to use kryo and use the java inbuilt serialization, then create serializable class and pass your object to this class and do your stuff.
I hope this will help you.

As Praga stated, kyro is a good solution for (de)serialization of objects that do not implement Serializable interface. Here is a sample code for serialization with kyro, hope it helps:
Kryo kryo = new Kryo();
private byte[] encode(Object obj) {
ByteArrayOutputStream objStream = new ByteArrayOutputStream();
Output objOutput = new Output(objStream);
kryo.writeClassAndObject(objOutput, obj);
objOutput.close();
return objStream.toByteArray();
}
private <T> T decode(byte[] bytes) {
return (T) kryo.readClassAndObject(new Input(bytes));
}
Maven dependency:
<dependency>
<groupId>com.esotericsoftware</groupId>
<artifactId>kryo</artifactId>
<version>4.0.1</version>
</dependency>
Full Implementation for redis integration:
RedisInterface :
public class RedisInterface {
static final Logger logger = LoggerFactory.getLogger(RedisInterface.class);
private static RedisInterface instance =null;
public static RedisInterface getInstance ()
{
if(instance ==null)
createInstance();
return instance ;
}
private static synchronized void createInstance()
{
if(instance ==null)//in case of multi thread instances
instance =new RedisInterface();
}
JedisConfig jedis = new JedisConfig();
public boolean setAttribute(String key, Object value)
{
return this.setAttribute(key, key, value);
}
public boolean setAttribute(String key, Object value, int expireSeconds)
{
return this.setAttribute(key, key, value, expireSeconds);
}
public boolean setAttribute(String key, String field, Object value)
{
int expireSeconds = 20 *60; //20 minutes
return this.setAttribute(key, field, value, expireSeconds);
}
public boolean setAttribute(String key, String field, Object value, int expireSeconds)
{
try
{
if(key==null || "".equals(key) || field==null || "".equals(field))
return false;
byte[]keyBytes = key.getBytes();
byte[]fieldBytes = field.getBytes();
byte []valueBytes = encode(value);
long start = new Date().getTime();
jedis.set(keyBytes, fieldBytes, valueBytes, expireSeconds);
long end = new Date().getTime();
long waitTime =end-start;
logger.info("{} key saved to redis in {} milliseconds with timeout: {} seconds", new Object[] {key, waitTime, expireSeconds} );
return true;
}
catch(Exception e)
{
logger.error( "error on saving object to redis. key: " + key, e);
return false;
}
}
public <T> T getAttribute(String key)
{
return this.getAttribute(key, key);
}
public <T> T getAttribute(String key, String field)
{
try
{
if(key==null || "".equals(key) || field==null || "".equals(field)) return null;
byte[]keyBytes = key.getBytes();
byte[]fieldBytes = field.getBytes();
long start = new Date().getTime();
byte[] valueBytes = jedis.get(keyBytes, fieldBytes);
T o =null;
if(valueBytes!=null && valueBytes.length>0)
o = decode(valueBytes);
long end = new Date().getTime();
long waitTime =end-start;
logger.info("{} key read operation from redis in {} milliseconds. key found?: {}", new Object[] {key, waitTime, (o!=null)});
return o;
}
catch (Exception e)
{
logger.error( "error on getting object from redis. key: "+ key, e);
return null;
}
}
Kryo kryo = new Kryo();
private byte[] encode(Object obj) {
ByteArrayOutputStream objStream = new ByteArrayOutputStream();
Output objOutput = new Output(objStream);
kryo.writeClassAndObject(objOutput, obj);
objOutput.close();
return objStream.toByteArray();
}
private <T> T decode(byte[] bytes) {
return (T) kryo.readClassAndObject(new Input(bytes));
}
}
JedisConfig :
public class JedisConfig implements Closeable
{
private Pool<Jedis> jedisPool = null;
private synchronized void initializePool()
{
if(jedisPool!=null) return;
JedisPoolConfig poolConfig = new JedisPoolConfig();
poolConfig.setMaxTotal(Integer.parseInt(Config.REDIS_MAX_ACTIVE_CONN)); // maximum active connections
poolConfig.setMaxIdle(Integer.parseInt(Config.REDIS_MAX_IDLE_CONN)); // maximum idle connections
poolConfig.setMaxWaitMillis(Long.parseLong(Config.REDIS_MAX_WAIT_MILLIS)); // max wait time for new connection (before throwing an exception)
if("true".equals(Config.REDIS_SENTINEL_ACTIVE))
{
String [] sentinelsArray = Config.REDIS_SENTINEL_HOST_LIST.split(",");
Set<String> sentinels = new HashSet();
for(String sentinel : sentinelsArray)
{
sentinels.add(sentinel);
}
String masterName = Config.REDIS_SENTINEL_MASTER_NAME;
jedisPool = new JedisSentinelPool(masterName, sentinels, poolConfig, Integer.parseInt(Config.REDIS_CONN_TIMEOUT));
}
else
{
jedisPool = new JedisPool(poolConfig,
Config.REDIS_IP,
Integer.parseInt(Config.REDIS_PORT),
Integer.parseInt(Config.REDIS_CONN_TIMEOUT));
}
}
protected Jedis getJedis()
{
if(jedisPool==null)
initializePool();
Jedis jedis = jedisPool.getResource();
return jedis;
}
public Long set(final byte[] key, final byte[] field, final byte[] value, int expireSeconds)
{
Jedis redis = null;
Long ret =0L;
try
{
redis = getJedis();
ret = redis.hset(key, field, value);
redis.expire(key, expireSeconds);
}
finally
{
if(redis!=null)
redis.close();
}
return ret;
}
public byte[] get(final byte[] key, final byte[] field) {
Jedis redis = null ;
byte[] valueBytes = null;
try
{
redis = getJedis();
valueBytes = redis.hget(key, field);
}
finally
{
if(redis!=null)
redis.close();
}
return valueBytes;
}
#Override
public void close() throws IOException {
if(jedisPool!=null)
jedisPool.close();
}
}

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

My KafkaProducer is able to use KafkaAvroSerializer to serialize objects to my topic. However, KafkaConsumer.poll() returns deserialized GenericRecord instead of my serialized class.
MyKafkaProducer
KafkaProducer<CharSequence, MyBean> producer;
try (InputStream props = Resources.getResource("producer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
properties.put("schema.registry.url", "http://localhost:8081");
MyBean bean = new MyBean();
producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<>(topic, bean.getId(), bean));
My KafkaConsumer
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
properties.load(props);
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroDeserializer.class);
properties.put("schema.registry.url", "http://localhost:8081");
consumer = new KafkaConsumer<>(properties);
}
consumer.subscribe(Arrays.asList(topic));
try {
while (true) {
ConsumerRecords<CharSequence, MyBean> records = consumer.poll(100);
if (records.isEmpty()) {
continue;
}
for (ConsumerRecord<CharSequence, MyBean> record : records) {
MyBean bean = record.value(); // <-------- This is throwing a cast Exception because it cannot cast GenericRecord to MyBean
System.out.println("consumer received: " + bean);
}
}
MyBean bean = record.value(); That line throws a cast Exception because it cannot cast GenericRecord to MyBean.
I'm using kafka-client-0.9.0.1, kafka-avro-serializer-3.0.0.

KafkaAvroDeserializer supports SpecificData
It's not enabled by default. To enable it:
properties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
KafkaAvroDeserializer does not support ReflectData
Confluent's KafkaAvroDeserializer does not know how to deserialize using Avro ReflectData. I had to extend it to support Avro ReflectData:
/**
* Extends deserializer to support ReflectData.
*
* #param <V>
* value type
*/
public abstract class ReflectKafkaAvroDeserializer<V> extends KafkaAvroDeserializer {
private Schema readerSchema;
private DecoderFactory decoderFactory = DecoderFactory.get();
protected ReflectKafkaAvroDeserializer(Class<V> type) {
readerSchema = ReflectData.get().getSchema(type);
}
#Override
protected Object deserialize(
boolean includeSchemaAndVersion,
String topic,
Boolean isKey,
byte[] payload,
Schema readerSchemaIgnored) throws SerializationException {
if (payload == null) {
return null;
}
int schemaId = -1;
try {
ByteBuffer buffer = ByteBuffer.wrap(payload);
if (buffer.get() != MAGIC_BYTE) {
throw new SerializationException("Unknown magic byte!");
}
schemaId = buffer.getInt();
Schema writerSchema = schemaRegistry.getByID(schemaId);
int start = buffer.position() + buffer.arrayOffset();
int length = buffer.limit() - 1 - idSize;
DatumReader<Object> reader = new ReflectDatumReader(writerSchema, readerSchema);
BinaryDecoder decoder = decoderFactory.binaryDecoder(buffer.array(), start, length, null);
return reader.read(null, decoder);
} catch (IOException e) {
throw new SerializationException("Error deserializing Avro message for id " + schemaId, e);
} catch (RestClientException e) {
throw new SerializationException("Error retrieving Avro schema for id " + schemaId, e);
}
}
}
Define a custom deserializer class which deserializes to MyBean:
public class MyBeanDeserializer extends ReflectKafkaAvroDeserializer<MyBean> {
public MyBeanDeserializer() {
super(MyBean.class);
}
}
Configure KafkaConsumer to use the custom deserializer class:
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, MyBeanDeserializer.class);

Edit : reflect data support got merged (see below)
To add to Chin Huang's answer, for minimal code and better performance, you should probably implement it this way :
/**
* Extends deserializer to support ReflectData.
*
* #param <V>
* value type
*/
public abstract class SpecificKafkaAvroDeserializer<V extends SpecificRecordBase> extends AbstractKafkaAvroDeserializer implements Deserializer<V> {
private final Schema schema;
private Class<T> type;
private DecoderFactory decoderFactory = DecoderFactory.get();
protected SpecificKafkaAvroDeserializer(Class<T> type, Map<String, ?> props) {
this.type = type;
this.schema = ReflectData.get().getSchema(type);
this.configure(this.deserializerConfig(props));
}
public void configure(Map<String, ?> configs) {
this.configure(new KafkaAvroDeserializerConfig(configs));
}
#Override
protected T deserialize(
boolean includeSchemaAndVersion,
String topic,
Boolean isKey,
byte[] payload,
Schema readerSchemaIgnore) throws SerializationException {
if (payload == null) {
return null;
}
int schemaId = -1;
try {
ByteBuffer buffer = ByteBuffer.wrap(payload);
if (buffer.get() != MAGIC_BYTE) {
throw new SerializationException("Unknown magic byte!");
}
schemaId = buffer.getInt();
Schema schema = schemaRegistry.getByID(schemaId);
Schema readerSchema = ReflectData.get().getSchema(type);
int start = buffer.position() + buffer.arrayOffset();
int length = buffer.limit() - 1 - idSize;
SpecificDatumReader<T> reader = new SpecificDatumReader(schema, readerSchema);
BinaryDecoder decoder = decoderFactory.binaryDecoder(buffer.array(), start, length, null);
return reader.read(null, decoder);
} catch (IOException e) {
throw new SerializationException("Error deserializing Avro message for id " + schemaId, e);
} catch (RestClientException e) {
throw new SerializationException("Error retrieving Avro schema for id " + schemaId, e);
}
}
}

How to serialize Java POJOs generated from Avro

I am trying to write a generic serializer for my Avro-generated Java objects. By begging, borrowing and stealing I came up with the following method:
public byte[] serialize(T data) {
SpecificDatumWriter<T> writer = new SpecificDatumWriter<>(tClass);
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
try {
writer.write(data, encoder);
encoder.flush();
ByteBuffer serialized = ByteBuffer.allocate(out.toByteArray().length);
serialized.put(out.toByteArray());
return serialized.array();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
I have two questions:
Is the schema included in this byte array? If I try to deserialize this byte array with a tClass of a different version, will it be ok? (as long as the schemas are backwards-compatible)
If this is how I'm meant to serialize Avro POJOs, what if anything are the following used for in my Avro generated POJO:
public org.apache.avro.Schema getSchema() {
return SCHEMA$;
}
private static final org.apache.avro.io.DatumWriter WRITER$ = new org.apache.avro.specific.SpecificDatumWriter(SCHEMA$);
#Override
public void writeExternal(java.io.ObjectOutput out) throws java.io.IOException {
WRITER$.write(this, SpecificData.getEncoder(out));
}
private static final org.apache.avro.io.DatumReader READER$ = new org.apache.avro.specific.SpecificDatumReader(SCHEMA$);
#Override
public void readExternal(java.io.ObjectInput in) throws java.io.IOException {
READER$.read(this, SpecificData.getDecoder(in));
}
Am I missing something?

Converting pojos to generic records in confluent.io to send through a KafkaProducer

I am completely new to Kafka and avro and trying to use the confluent package. We have existing POJOs we use for JPA and I'd like to be able to simply produce an instance of my POJOs without having to reflect each value into a generic record manually. I seem to be missing how this is done in the documentation.
The examples use a generic record and set each value one by one like so:
String key = "key1";
String userSchema = "{\"type\":\"record\"," +
"\"name\":\"myrecord\"," +
"\"fields\":[{\"name\":\"f1\",\"type\":\"string\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("f1", "value1");
record = new ProducerRecord<Object, Object>("topic1", key, avroRecord);
try {
producer.send(record);
} catch(SerializationException e) {
// may need to do something with it
}
There are several examples for getting a schema from a class and I found the annotations to alter that schema as necessary. Now how do I take an instance of a POJO and just send it to the serializer as is and have the library do the work of matching up the schema from the class and then copying the values into a generic record? Am I going about this all wrong? What I want to end up doing is something like this:
String key = "key1";
Schema schema = ReflectData.get().getSchema(myObject.getClass());
GenericRecord avroRecord = ReflectData.get().getRecord(myObject, schema);
record = new ProducerRecord<Object, Object>("topic1", key, avroRecord);
try {
producer.send(record);
} catch(SerializationException e) {
// may need to do something with it
}
Thanks!

I wound up creating my own serializer in this instance:
public class KafkaAvroReflectionSerializer extends KafkaAvroSerializer {
private final EncoderFactory encoderFactory = EncoderFactory.get();
#Override
protected byte[] serializeImpl(String subject, Object object) throws SerializationException {
//TODO: consider caching schemas
Schema schema = null;
if(object == null) {
return null;
} else {
try {
schema = ReflectData.get().getSchema(object.getClass());
int e = this.schemaRegistry.register(subject, schema);
ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write(0);
out.write(ByteBuffer.allocate(4).putInt(e).array());
BinaryEncoder encoder = encoderFactory.directBinaryEncoder(out, null);
DatumWriter<Object> writer = new ReflectDatumWriter<>(schema);
writer.write(object, encoder);
encoder.flush();
out.close();
byte[] bytes = out.toByteArray();
return bytes;
} catch (IOException ioe) {
throw new SerializationException("Error serializing Avro message", ioe);
} catch (RestClientException rce) {
throw new SerializationException("Error registering Avro schema: " + schema, rce);
} catch (RuntimeException re) {
throw new SerializationException("Error serializing Avro message", re);
}
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Connect JsonConverter with validation - java

Related

TupleTag not found in DoFn

serialization external library

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

How to serialize Java POJOs generated from Avro

Converting pojos to generic records in confluent.io to send through a KafkaProducer

Categories

Resources