How to implement FlinkKafkaProducer serializer for Kafka 2.2 - java

I've been working on updating a Flink processor (Flink version 1.9) that reads from Kafka and then writes to Kafka. We have written this processor to run towards a Kafka 0.10.2 cluster and now we have deployed a new Kafka cluster running version 2.2. Therefore I set out to update the processor to use the latest FlinkKafkaConsumer and FlinkKafkaProducer (as suggested by the Flink docs). However I've run into some problems with the Kafka producer. I'm unable to get it to Serialize data using deprecated constructors (not surprising) and I've been unable to find any implementations or examples online about how to implement a Serializer (all the examples are using older Kafka Connectors)
The current implementation (for Kafka 0.10.2) is as follows
FlinkKafkaProducer010<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer010<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
(FlinkKafkaPartitioner) null
);
When trying to implement the following FlinkKafkaProducer
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
null
);
I get the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:525)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:483)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:357)
at com.ebs.flink.sessionprocessor.SessionProcessor.main(SessionProcessor.java:122)
and I haven't been able to figure out why.
The constructor for FlinkKafkaProducer is also deprecated and when I try implementing the non-deprecated constructor I can't figure out how to serialize the data.
The following is how it would look:
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new KafkaSerializationSchema<String>() {
#Override
public ProducerRecord<byte[], byte[]> serialize(String s, #Nullable Long aLong) {
return null;
}
},
producerProps,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
);
But I don't understand how to implement the KafkaSerializationSchema and I find no examples of this online or in the Flink docs.
Does anyone have any experience implementing this or any tips on why the FlinkProducer gets NullPointerException in the step?

If you are just sending String to Kafka:
public class ProducerStringSerializationSchema implements KafkaSerializationSchema<String>{
private String topic;
public ProducerStringSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(String element, Long timestamp) {
return new ProducerRecord<byte[], byte[]>(topic, element.getBytes(StandardCharsets.UTF_8));
}
}
For sending a Java Object:
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonProcessingException;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema;
import org.apache.kafka.clients.producer.ProducerRecord;
public class ObjSerializationSchema implements KafkaSerializationSchema<MyPojo>{
private String topic;
private ObjectMapper mapper;
public ObjSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(MyPojo obj, Long timestamp) {
byte[] b = null;
if (mapper == null) {
mapper = new ObjectMapper();
}
try {
b= mapper.writeValueAsBytes(obj);
} catch (JsonProcessingException e) {
// TODO
}
return new ProducerRecord<byte[], byte[]>(topic, b);
}
}
In your code
.addSink(new FlinkKafkaProducer<>(producerTopic, new ObjSerializationSchema(producerTopic),
params.getProperties(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE));

To the deal with the timeout in the case of FlinkKafkaProducer.Semantic.EXACTLY_ONCE you should read https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-011-and-newer, particularly this part:
Semantic.EXACTLY_ONCE mode relies on the ability to commit transactions that were started before taking a checkpoint, after recovering from the said checkpoint. If the time between Flink application crash and completed restart is larger than Kafka’s transaction timeout there will be data loss (Kafka will automatically abort transactions that exceeded timeout time). Having this in mind, please configure your transaction timeout appropriately to your expected down times.
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

Related

Apache Flink does not return data for idle partitions

I am trying to calculate the rate of incoming events per minute from a Kafka topic based on event time. I am using TumblingEventTimeWindows of 1 minute for this. The code snippet is given below.
I have observed that if I am not receiving any event for a particular window, e.g. from 2.34 to 2.35, then the previous window of 2.33 to 2.34 does not get closed. I understand the risk of losing data for the window of 2.33 to 2.34 (may happen due to system failure, bigger Kafka lag, etc.), but I cannot wait indefinitely. I need to close this window after waiting for a certain period of time, and subsequent windows can continue after the system recovers. How can I achieve this?
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3,
org.apache.flink.api.common.time.Time.of(10, TimeUnit.SECONDS)
));
executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
executionEnvironment.setParallelism(1);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "AllEventCountConsumerGroup");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("event_input_topic", new SimpleStringSchema(), properties);
DataStreamSource<String> kafkaDataStream = environment.addSource(kafkaConsumer);
kafkaDataStream
.flatMap(new EventFlatter())
.filter(Objects::nonNull)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<Entity>forMonotonousTimestamps()
.withIdleness(Duration.ofSeconds(60))
.withTimestampAssigner((SerializableTimestampAssigner<Entity>) (element, recordTimestamp) -> element.getTimestamp()))
.assignTimestampsAndWatermarks(new EntityWatermarkStrategy())
.keyBy((KeySelector<Entity, String>) Entity::getTenant)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.seconds(10))
.aggregate(new EventCountAggregator())
.addSink(eventRateProducer);
private static class EntityWatermarkStrategy implements WatermarkStrategy<Entity> {
#Override
public WatermarkGenerator<Entity> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new EntityWatermarkGenerator();
}
}
private static class EntityWatermarkGenerator implements WatermarkGenerator<Entity> {
private long maxTimestamp;
public EntityWatermarkGenerator() {
this.maxTimestamp = Long.MIN_VALUE + 1;
}
#Override
public void onEvent(Entity event, long eventTimestamp, WatermarkOutput output) {
maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(maxTimestamp + 2));
}
}
Also, I tried adding some custom triggers, but it didn't help. I am using Apache Flink 1.11
Can somebody suggest, what wrong am I doing?
When I tried to push some more data with the newer timestamp (say t+1) of a topic, data from an earlier timeframe (t) gets pushed. but again for t+1 data, the same issues occur as of t.
One reason why withIdleness() isn't helping in your case is that you are calling assignTimestampsAndWatermarks on the datastream after it has been emitted by the kafka source, rather than calling it on the FlinkKafkaConsumer itself. If you were to do the latter, then the FlinkKafkaConsumer would be able to assign timestamps and watermarks on a per-partition basis, and would consider idleness at the granularity of each individual kafka partition. See Watermark Strategies and the Kafka Connector for more info.
To make this work, however, you'll need to use a deserializer other than a SimpleStringSchema (such as a KafkaDeserializationSchema) that is able to create individual stream records, with timestamps. See https://stackoverflow.com/a/62072265/2000823 for an example of how to implement a
KafkaDeserializationSchema.
Keep in mind, however, that withIdleness() will not advance the watermark if all partitions are idle. What it will do is to prevent idle partitions from holding back the watermark, which may advance if there are events from other partitions.
See the idle partitions documentation for an approach to solving your problem.
using flink 1.11+ watermarkstrategy api should help you avoid pumping dummy data. What you need is to generate watermark at the end of minute periodically. this is the reference:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/event_timestamps_watermarks.html
Create a flinkKafkaConsumer with CustomKafkaSerializer:
FlinkKafkaConsumer otherConsumer = new FlinkKafkaConsumer(
topics, new CustomKafkaSerializer(apacheFlinkEnvironmentLoader), props);
How to create CustomKafkaSerializer ?
Ans -
Two questions about Flink deserializing
Now use watermark Strategy for this flinkKafkaConsumer:
FlinkKafkaConsumer<Tuple3<String,String,String>> flinkKafkaConsumer = apacheKafkaConfig.getOtherConsumer();
flinkKafkaConsumer.assignTimestampsAndWatermarks(new ApacheFlinkWaterMarkStrategy(envConfig.getOutOfOrderDurationSeconds()).
withIdleness(Duration.ofSeconds(envConfig.getIdlePartitionTimeout())));
So this is How WaterMark Strategy Looks Like?
Ans ->
public class ApacheFlinkWaterMarkStrategy implements WatermarkStrategy<Tuple3<String, String, String>> {
private long outOfOrderDuration;
public ApacheFlinkWaterMarkStrategy(long outOfOrderDuration)
{
super();
this.outOfOrderDuration = outOfOrderDuration;
}
#Override
public TimestampAssigner<Tuple3<String, String, String>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new ApacheFlinkTimeForEvent();
}
#Override
public WatermarkGenerator<Tuple3<String, String, String>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new ApacheFlinkWaterMarkGenerator(this.outOfOrderDuration);
} }
This is how we get event time from payload:
Ans->
public class ApacheFlinkTimeForEvent implements SerializableTimestampAssigner<Tuple3<String,String,String>> {
public static final Logger logger = LoggerFactory.getLogger(ApacheFlinkTimeForEvent.class);
private static final FhirContext fhirContext = FhirContext.forR4();
#Override
public long extractTimestamp(Tuple3<String,String,String> o, long l) {
//get timestamp from payload
}
}
This is how we generate watermarks periodically so that irrespective
whether data arrives or not watermark gets updated every minute in
each partition.
public class ApacheFlinkWaterMarkGenerator implements WatermarkGenerator<Tuple3<String,String,String>> {
public static final Logger logger = LoggerFactory.getLogger(ApacheFlinkWaterMarkGenerator.class);
private long outOfOrderGenerator;
private long maxEventTimeStamp;
public ApacheFlinkWaterMarkGenerator(long outOfOrderGenerator)
{
super();
this.outOfOrderGenerator = outOfOrderGenerator;
}
#Override
public void onEvent(Tuple3<String, String, String> stringStringStringTuple3, long l, WatermarkOutput watermarkOutput) {
maxEventTimeStamp = Math.max(maxEventTimeStamp,l);
Watermark eventWatermark = new Watermark(maxEventTimeStamp);
watermarkOutput.emitWatermark(eventWatermark);
logger.info("Current Watermark emitted from event is {}",eventWatermark.getFormattedTimestamp());
}
#Override
public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
long currentUtcTime = Instant.now().toEpochMilli();
Watermark periodicWaterMark = new Watermark(currentUtcTime-outOfOrderGenerator);
watermarkOutput.emitWatermark(periodicWaterMark);
logger.info("Current Watermark emitted periodically is {}",periodicWaterMark.getFormattedTimestamp());
}
}
Also, periodic emitting of watermark has to be set at start of the
application.
streamExecutionEnvironment.getConfig().setAutoWatermarkInterval(This is in milliseconds long);
This is how we add to custom watermark and timestamps to flinkKafkaConsumer.
flinkKafkaConsumer.assignTimestampsAndWatermarks(new ApacheFlinkWaterMarkStrategy(Out of Order seconds).
withIdleness(IdlePartiton Seconds);

Spring Boot: Kafka health indicator

I have something like below which works well, but I would prefer checking health without sending any message, (not only checking socket connection). I know Kafka has something like KafkaHealthIndicator out of the box, does someone have experience or example using it ?
public class KafkaHealthIndicator implements HealthIndicator {
private final Logger log = LoggerFactory.getLogger(KafkaHealthIndicator.class);
private KafkaTemplate<String, String> kafka;
public KafkaHealthIndicator(KafkaTemplate<String, String> kafka) {
this.kafka = kafka;
}
#Override
public Health health() {
try {
kafka.send("kafka-health-indicator", "❥").get(100, TimeUnit.MILLISECONDS);
} catch (InterruptedException | ExecutionException | TimeoutException e) {
return Health.down(e).build();
}
return Health.up().build();
}
}
In order to trip health indicator, retrieve data from one of the future objects otherwise indicator is UP even when Kafka is down!!!
When Kafka is not connected future.get() throws an exception which in turn set this indicator down.
#Configuration
public class KafkaConfig {
#Autowired
private KafkaAdmin kafkaAdmin;
#Bean
public AdminClient kafkaAdminClient() {
return AdminClient.create(kafkaAdmin.getConfigurationProperties());
}
#Bean
public HealthIndicator kafkaHealthIndicator(AdminClient kafkaAdminClient) {
final DescribeClusterOptions options = new DescribeClusterOptions()
.timeoutMs(1000);
return new AbstractHealthIndicator() {
#Override
protected void doHealthCheck(Health.Builder builder) throws Exception {
DescribeClusterResult clusterDescription = kafkaAdminClient.describeCluster(options);
// In order to trip health indicator DOWN retrieve data from one of
// future objects otherwise indicator is UP even when Kafka is down!!!
// When Kafka is not connected future.get() throws an exception which
// in turn sets the indicator DOWN.
clusterDescription.clusterId().get();
// or clusterDescription.nodes().get().size()
// or clusterDescription.controller().get();
builder.up().build();
// Alternatively directly use data from future in health detail.
builder.up()
.withDetail("clusterId", clusterDescription.clusterId().get())
.withDetail("nodeCount", clusterDescription.nodes().get().size())
.build();
}
};
}
}
Use the AdminClient API to check the health of the cluster via describing the cluster and/or the topic(s) you'll be interacting with, and verifying those topics have the required number of insync replicas, for example
Kafka has something like KafkaHealthIndicator out of the box
It doesn't. Spring's Kafka integration might

How to skip an Avro serialization exception in KafkaStreams API?

I have a Kafka application that is written by KafkaStreams Java api. It reads data from Mysql binlog and do some stuff that is irrelevant to my question. The problem is one particular row produces error in deserialization from avro. I can dig into Avro schema file and find the problem but as a whole what I need is a forgiving exception handler that upon encountering such error does not bring the whole application to halt.
This is the main part of my stream app:
StreamsBuilder streamsBuilder = watchForCourierUpdate(builder);
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), properties);
kafkaStreams.start();
Runtime.getRuntime().addShutdownHook(new Thread(kafkaStreams::close));
}
private static StreamsBuilder watchForCourierUpdate(StreamsBuilder builder){
CourierUpdateListener courierUpdateListener = new CourierUpdateListener(builder);
courierUpdateListener.start();
return builder;
}
private static Properties configProperties(){
Properties streamProperties = new Properties();
streamProperties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, Configs.getConfig("schemaRegistryUrl"));
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, "courier_app");
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, Configs.getConfig("bootstrapServerUrl"));
streamProperties.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000);
streamProperties.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp/state_dir");
streamProperties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3");
streamProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
streamProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
streamProperties.put(StreamsConfig.METRICS_RECORDING_LEVEL_CONFIG, "DEBUG");
streamProperties.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG,
CourierSerializationException.class);
return streamProperties;
}
This is my CourierSerializationException class:
public class CourierSerializationException implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(ProducerRecord<byte[], byte[]> producerRecord, Exception e) {
Logger.logError("Failed to de/serialize entity from " + producerRecord.topic() + " topic.\n" + e);
return ProductionExceptionHandlerResponse.CONTINUE;
}
#Override
public void configure(Map<String, ?> map) {
}
}
Still, whenever an avro deserialization exception happens the stream shuts down and the application does not continue. Am I missing something!
Have you tried to do this with the default.deserialization.exception.handler provided by kafka? you can use LogAndContinueExceptionHandler which will log and continue.
I may be wrong but i think creating a Customexception by implementing ProductionExceptionHandler only works for network related error on the kafka side.
add this to the properties and see what happens:
> props.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class);

How to get a Kafka Topic Lag in Java

I want to see the lag position of a kafka topic in java. someone here says that below code will work.
AdminClient client = AdminClient.createSimplePlaintext("localhost:9092");
Map<TopicPartition, Object> offsets = JavaConversions.asJavaMap(
client.listGroupOffsets("groupID"));
Long offset = (Long) offsets.get(new TopicPartition("topic", 0));
But when I tried to import kafka.admin.AdminClient that listGroupOffsets method is not there. Please help me with this.
You can use https://github.com/yahoo/kafka-manager and can use their http Rest APIs to get consumer groups lag and other details.
Method listGroupOffsets was introduced to AdminClient.scala starting 0.10.2. See KAFKA-3853 for details. So you should use Kafka 0.10.2.0 or upwards.
I am using Spring framework. Using the below code, you can get the metrics via java.The code works.
#Component
public class Receiver {
private static final Logger LOGGER =
LoggerFactory.getLogger(Receiver.class);
#Autowired
private KafkaListenerEndpointRegistry kafkaListenerEndpointRegistry;
public void testlag() {
for (MessageListenerContainer messageListenerContainer : kafkaListenerEndpointRegistry
.getListenerContainers()) {
Map<String, Map<MetricName, ? extends Metric>> metrics = messageListenerContainer.metrics();
metrics.forEach( (clientid, metricMap) ->{
System.out.println("------------------------For client id : "+clientid);
metricMap.forEach((metricName,metricValue)->{
//if(metricName.name().contains("lag"))
System.out.println("------------Metric name: "+metricName.name()+"-----------Metric value: "+metricValue.metricValue());
});
});
}
}

How to create Processor with Transaction and DLQ with Rabbit binding?

I'm just starting to learn Spring Cloud Streams and Dataflow and I want to know one of important use cases for me. I created example processor Multiplier which takes message and resends it 5 times to output.
#EnableBinding(Processor.class)
public class MultiplierProcessor {
#Autowired
private Source source;
private int repeats = 5;
#Transactional
#StreamListener(Processor.INPUT)
public void handle(String payload) {
for (int i = 0; i < repeats; i++) {
if(i == 4) {
throw new RuntimeException("EXCEPTION");
}
source.output().send(new GenericMessage<>(payload));
}
}
}
What you can see is that before 5th sending this processor crashes. Why? Because it can (programs throw exceptions). In this case I wanted to practice fault prevention on Spring Cloud Stream.
What I would like to achieve is to have input message backed in DLQ and 4 messages that were send before to be reverted and not consumed by next operand (just like in normal JMS transaction). I tried already to define following properties in my processor project but without success.
spring.cloud.stream.bindings.output.producer.autoBindDlq=true
spring.cloud.stream.bindings.output.producer.republishToDlq=true
spring.cloud.stream.bindings.output.producer.transacted=true
spring.cloud.stream.bindings.input.consumer.autoBindDlq=true
Could you tell me if it possible and also what am I doing wrong? I would be overwhelmingly thankful for some examples.
You have several issues with your configuration:
missing .rabbit in the rabbit-specific properties)
you need a group name and durable subscription to use autoBindDlq
autoBindDlq doesn't apply on the output side
The consumer has to be transacted so that the producer sends are performed in the same transaction.
I just tested this with 1.0.2.RELEASE:
spring.cloud.stream.bindings.output.destination=so8400out
spring.cloud.stream.rabbit.bindings.output.producer.transacted=true
spring.cloud.stream.bindings.input.destination=so8400in
spring.cloud.stream.bindings.input.group=so8400
spring.cloud.stream.rabbit.bindings.input.consumer.durableSubscription=true
spring.cloud.stream.rabbit.bindings.input.consumer.autoBindDlq=true
spring.cloud.stream.rabbit.bindings.input.consumer.transacted=true
and it worked as expected.
EDIT
Actually, no, the published messages were not rolled back. Investigating...
EDIT2
OK; it does work, but you can't use republishToDlq - because when that is enabled, the binder publishes the failed message to the DLQ and the transaction is committed.
When that is false, the exception is thrown to the container, the transaction is rolled back, and RabbitMQ moves the failed message to the DLQ.
Note, however, that retry is enabled by default (3 attempts) so, if your processor succeeds during retry, you will get duplicates in your output.
For this to work as you want, you need to disable retry by setting the max attempts to 1 (and don't use republishToDlq).
EDIT3
OK, if you want more control over the publishing of the errors, this will work, when the fix for this JIRA is applied to Spring AMQP...
#SpringBootApplication
#EnableBinding({ Processor.class, So39018400Application.Errors.class })
public class So39018400Application {
public static void main(String[] args) {
SpringApplication.run(So39018400Application.class, args);
}
#Bean
public Foo foo() {
return new Foo();
}
public interface Errors {
#Output("errors")
MessageChannel errorChannel();
}
private static class Foo {
#Autowired
Source source;
#Autowired
Errors errors;
#StreamListener(Processor.INPUT)
public void handle (Message<byte[]> in) {
try {
source.output().send(new GenericMessage<>("foo"));
source.output().send(new GenericMessage<>("foo"));
throw new RuntimeException("foo");
}
catch (RuntimeException e) {
errors.errorChannel().send(MessageBuilder.fromMessage(in)
.setHeader("foo", "bar") // add whatever you want, stack trace etc.
.build());
throw e;
}
}
}
}
with properties:
spring.cloud.stream.bindings.output.destination=so8400out
spring.cloud.stream.bindings.errors.destination=so8400errors
spring.cloud.stream.rabbit.bindings.errors.producer.transacted=false
spring.cloud.stream.rabbit.bindings.output.producer.transacted=true
spring.cloud.stream.bindings.input.destination=so8400in
spring.cloud.stream.bindings.input.group=so8400
spring.cloud.stream.rabbit.bindings.input.consumer.transacted=true
spring.cloud.stream.rabbit.bindings.input.consumer.requeue-rejected=false
spring.cloud.stream.bindings.input.consumer.max-attempts=1

Categories

Resources