KStream-KTable Inner Join Lost Messages with Exactly Once Config

KStream-KTable Inner Join Lost Messages with Exactly Once Config - java

When I do not set processing.gurantee which means stream will be started with its default value (at_least_once), this code can log successfully and send joined messages to relevant topic.
When the exactly_once config is enabled on this same stream application, some of the data are not able to pass through join successfully. Even there are logs for the first peek block, I can not see some of the second peek logs and some of the messages that I need to have.
I'm sure that both kstream and ktable have required values to be which are not null. And both side gets messages regularly.
Stream Configs:
processing.guarantee=exactly_once
replication.factor=3 (this increases replication factor for internal topics)
Kafka (with 3 broker) Details:
version=2.2.0
log.roll.ms=3600000
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
message.max.bytes=2000024
Question is, How exactly_once processing guarantee setting can cause this kind of situation?
final KStream<String, UserProfile> userProfileStream = builder.stream(TOPIC_USER_PROFILE);
final KTable<String, Device> deviceKTable = builder.table(TOPIC_DEVICE);
userProfileStream
.peek((genericId, userProfile) ->
log.debug("[{}] Processing user profile: {}", openUserId, userProfile)
)
.join(
deviceKTable,
(userProfile, device) -> {
userProfile.setDevice(device);
return userProfile;
},
Joined.with(Serdes.String(), userProfileSerde, deviceSerde)
)
.peek((genericId, userProfile) ->
log.debug("[{}] Updated user profile: {}", genericId, userProfile)
)
.to(TOPIC_UPDATED_USER_PROFILE, Produced.with(Serdes.String(), userProfileSerde));

Related

How to re-consume messages in Kafka

I am very new to Kafka and I am working on a project to learn and understand Kafka.
I am running Kafka on my laptop so I have 1 consumer and 1 producer and I'm working with Java (Spring Boot) to listen to those streams and consume the messages.
Let's say I have 2 different groups created, called "automatic" and "manual".
For the "automatic" one, I do not want the messages to perform things right away. I want to aggregate the messages for 1 minute and when 1 minute passes, I want it to fire off some custom event.
But for the "manual" one, I want the message to consume it and fire off the event right away.
But when I send message from producer it will go to this the common topic itself and there is a property in the message which says if it is a "manual" or "automatic" type.
Here is my Kafka topic declaration in my application.properties file.
spring.cloud.stream.kafka.bindings.automatic.consumer.configuration.client.id=automatic-consumption-event
spring.cloud.stream.bindings.automatic.destination=main.event
spring.cloud.stream.bindings.automatic.binder=test-stream-app
spring.cloud.stream.bindings.automatic.group=consumer-automatic-group
spring.cloud.stream.bindings.automatic.consumer.concurrency=1
spring.cloud.stream.kafka.bindings.manual.consumer.configuration.client.id=manual-consumption-event
spring.cloud.stream.bindings.manual.destination=main.event
spring.cloud.stream.bindings.manual.binder=test-app
spring.cloud.stream.bindings.manual.group=consumer-manual-group
spring.cloud.stream.bindings.manual.consumer.concurrency=1
I have created 2 separate methods to be consumed and perform different actions like this.
private windows;
#PostConstruct()
private void init() {
this.windows = SessionWindows.with(Duration.ofSeconds(5)).grace(Duration.Zero);
}
public void automatic(Ktream<string, CustomObjectType> eventStream) {
eventStream.filter((x, y) -> y != null && !y.isManual(), Named.as("automatic_event"))
.groupByKey(Grouped.with("participant_id", Serdes.String(), Serdes.Long()))
.windowedBy(windows)
.reduce(Long::sum, Named.as("participant_id_sum"))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream(Named.as("participant_id_stream"))
.foreach(this::fireCustomEvent);
}
#StreamListener("manual-event")
public void manual(#Payload String payload) {
var parsedObject = this.parseJSON(payload);
if(!payload.isManual()) {
return;
}
this.fireCustomEvent();
}
private CustomObjectType parseJSON(String json) {
return JSONObject.parseObject(json.substring(json.indexOf("{")), CustomObjectType.class);
}
private void fireCustomEvent(){
// Should do something.
}
I ran the producer with this command on my system.
bin/kafka-console-producer.sh --topic main.event --property "parse.key=true" --property "key.separator=:" --bootstrap-server localhost:62341
And I ran consumer with this command:
bin/kafka-consumer.sh --topic main.event --from-beginning --bootstrap-server localhost:62341
These are the events I'm passing by the producer.
123: {"eventHeader": null, "data": {"headline": "You are winner", "id": "42", "isManual": true}}
987: {"eventHeader": null, "data": {"headline": "You will win", "id": "43", "isManual": false}}
Whenever the event is passed by producer, I can see my manual() triggering with the message. But it is doing expected thing of taking message and firing the event right away. But, it is consuming both the type of messages and the problem is that the "automatic" messages are not aggregating anymore. Because they have been taken by the consumer.
Every time I restart my spring boot application, the automatic() method triggers but it does not find any messages to be filtered because they were consumed already, as per my understanding.
Can someone help me figure out where am I causing the confusion?

I'm not sure I understand the question. Spring will start both functions "automatically". But you have a typo of Ktream in the automatic() parameters
consuming both the type of messages
Right... Because both exist in the same topic. Perhaps you want to use branch/split operator in Kafka Streams to make a separate topic of all manual events, which your "manual" method reads instead?
because they were consumed already
That doesn't matter. What matters is that offsets were committed. You can reconsume a topic as many times as you want, as long as the data is retained in the topic.
To force reconsumption, you can use
KafkaConsumer.seek
kafka-consumer-groups --reset-offsets after you stop the app
give the app a new application.id/group.id along with consumer config auto.offset.reset=earliest

Adjusting parallism based on number of partitions assigned in Consumer.committablePartitionedSource

I am trying to use Consumer.committablePartitionedSource() and creating stream per partition as shown below
public void setup() {
control = Consumer.committablePartitionedSource(consumerSettings,
Subscriptions.topics("chat").withPartitionAssignmentHandler(new PartitionAssignmentListener()))
.mapAsyncUnordered(Integer.MAX_VALUE, pair -> setupSource(pair, committerSettings))
.toMat(Sink.ignore(), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
}
private CompletionStage<Done> setupSource(Pair<TopicPartition, Source<ConsumerMessage.CommittableMessage<String, String>, NotUsed>> pair, CommitterSettings committerSettings) {
LOGGER.info("SETTING UP PARTITION-{} SOURCE", pair.first().partition());
return pair.second().mapAsync(16, msg -> CompletableFuture.supplyAsync(() -> consumeMessage(msg), actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(ActorAttributes.supervisionStrategy(ex -> Supervision.restart()))
.runWith(Committer.sink(committerSettings), Materializer.matFromSystem(actorSystem));
}
While setting up the source per partition I am using parallelism which I want to change based on no of partitions assigned to the node. That I can do that in the first assignment of partitions to the node. But as new nodes join the cluster assigned partitions are revoked and assigned. This time stream not emitting already existing partitions(due to kafka cooperative rebalancing protocol) to reconfigure parallelism.
Here I am sharing the same dispatcher across all sources and if I keep the same parallelism on rebalancing I feel the fair chance to each partition message processing is not possible. Am I correct? Please correct me

If I understand you correctly you want to have a fixed parallelism across dynamically changing number of Sources that come and go as Kafka is rebalancing topic partitions.
Have a look at first example in the Alpakka Kafka documentation here. It can be adjusted to your example like this:
Consumer.DrainingControl<Done> control =
Consumer.committablePartitionedSource(consumerSettings, Subscriptions.topics("chat"))
.wireTap(p -> LOGGER.info("SETTING UP PARTITION-{} SOURCE", p.first().partition()))
.flatMapMerge(Integer.MAX_VALUE, Pair::second)
.mapAsync(
16,
msg -> CompletableFuture
.supplyAsync(() -> consumeMessage(msg),
actorSystem.dispatcher())
.thenApply(param -> msg.committableOffset()))
.withAttributes(
ActorAttributes.supervisionStrategy(
ex -> Supervision.restart()))
.toMat(Committer.sink(committerSettings), Consumer::createDrainingControl)
.run(Materializer.matFromSystem(actorSystem));
So basically the Consumer.committablePartitionedSource() will emit a Source anytime Kafka assigns partition to this consumer and will terminate such Source when previously assigned partition is rebalanced and taken away from this consumer.
The flatMapMerge will take those Sources and merge the messages they output.
All those messages will compete in the mapAsync stage to get processed. The fairness of this competing is really down to the flatMapMerge above that should give equal chance for all the Sources to emit their messages. Regardless of how many Sources are outputing messages, they will all share a fixed parallelism here, which I believe is what you're after.
All those messages eventually get to the Commiter.sink that handles offset committing.

Handling commits for errors with #KafkaListener in batch consumers

We have a Kafka Consumer setup like below
#Bean
public ConsumerFactory<String, Object> consumerFactory() {
final Map<String, Object> props = kafkaProperties.buildConsumerProperties();
return new DefaultKafkaConsumerFactory<>(props);
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> batchFactory(
final ConsumerFactory<String, Object> consumerFactory,
#Value("${someProp.batch}") final boolean enableBatchListener,
#Value("${someProp.concurrency}") final int consumerConcurrency,
#Value("${someProp.error.backoff.ms}") final int errorBackoffInterval
) {
final SeekToCurrentBatchErrorHandler errorHandler = new SeekToCurrentBatchErrorHandler();
errorHandler.setBackOff(new FixedBackOff(errorBackoffInterval, UNLIMITED_ATTEMPTS));
final var containerFactory = new ConcurrentKafkaListenerContainerFactory<String, Object>();
containerFactory.setConsumerFactory(consumerFactory);
containerFactory.getContainerProperties().setAckMode(MANUAL_IMMEDIATE);
containerFactory.getContainerProperties().setMissingTopicsFatal(false);
containerFactory.setBatchListener(enableBatchListener);
containerFactory.setConcurrency(consumerConcurrency);
containerFactory.setBatchErrorHandler(errorHandler);
return containerFactory;
}
someProp:
concurrency: 16
batch: true
error.backoff.ms: 2000
spring:
kafka:
bootstrap-servers: ${KAFKA_BOOTSTRAP_SERVERS}
consumer:
groupId: some-grp
autoOffsetReset: earliest
keyDeserializer: org.apache.kafka.common.serialization.StringDeserializer
valueDeserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
properties:
schema.registry.url: ${SCHEMA_REGISTRY_URL}
specific.avro.reader: true
security.protocol: SSL
In batch listener method annotated with #KafkaListener, we call acknowledgment.acknowledge() at the end of processing of the list. Assuming that when the service comes up, I already have a million messages in the topic ready to be consumed by the service, I have following questions with respect to this scenario as I could not find documentation which talks in detail regarding the batch listening:
The listener will read 500 messages in the list. 500 because max.poll.records is not set and hence defaults to 500, so the list will have 500 messages. Is this understanding correct?
Given the above, where does the consumer concurrency come into picture? Does the stated configuration mean I will have 16 consumers each of which can read 500 messages in parallel from the same topic?
I understand, in this case I must have at least 16 partitions to make use of all the consumers otherwise I would be left with consumers who do nothing?
Due to SeekToCurrentBatchErrorHandler, the batch will be replayed in case there is any exception in processing inside the listener method. So, if in a particular batch there is an exception processing the 50th message, first 49 will be played again (basically duplicates, which I am fine with), next 50 to 500 messages will be played and tried for processing as usual. Is this understanding correct?
If there are multiple batches being read continuously and a particular consumer thread gets stuck with the SeekToCurrentBatchErrorHandler, how is the offset commit handled, as other consumer threads would still be processing the messages successfully thus moving the offset pointer way forward then the stuck consumers offsets
The doc for MANUAL_IMMEDIATE states
/**
* User takes responsibility for acks using an
* {#link AcknowledgingMessageListener}. The consumer
* immediately processes the commit.
*/
MANUAL_IMMEDIATE,
Does this mean calling acknowledgment.acknowledge() is not sufficient and AcknowledgingMessageListener has to be used in some way? If yes, what is the preferred approach.

You will get "up to" 500; there is no guarantee you will get exactly 500.
Yes; 16 consumers (assuming you have at least 16 partitions).
Correct.
Correct; but version 2.5 now has the RecoveringBatchErrorHandler whereby you can throw a special exception to tell it where in the batch the error occurred; it will commit the offsets of the successful records and seek the remaining ones.
The consumers get unique partitions so a consumer that is "stuck" has no impact on other consumers.
I am not sure what you are asking there; if you are calling ack.acknowledge() you are already using an AcknowledgingMessageListener (#KafkaListener always has that capability; we only populate the ack with a manual ack mode.
However, you really don't need to use manual acks for this use case; the container will commit the offsets automatically when the listener exits normally; no need to unnecessarily complicate your code.

Supress aggregation until custom condition

I am using Kafka DSL. How would I proceed to suppress the output of an aggregation (similar behavior to this) with a custom condition?
Let's say for every key I may have a START and a STOP event. I only want to aggregate this key when the STOP event arrives or after a timeout.
The desired flow would be something roughly like this:
time input-topic output-topic
1 key1:{type:start, time: 0} ...
3 key2:{type:start, time: 2} ...
4 key1:{type:stop, time:3} ...
4+e ... key1:{type:closed, duration:3}
61 ... ...
61+e ... key2:{type:timeout, duration:60}
where the timeout is 60 units of time and e is a an arbitrary time the stream takes to process the event.
The code (pseudocode for now) would be something like
KStream<String,String> sourceStream = builder.stream("input-topic", Consumed.with(stringSerializer, stringSerializer));
KGroupedStream<String, String> groupedStream = sourceStream
.groupByKey();
KTable<String, String> aggregatedStream = groupedStream
.suppress(Suppressed.untilWindowCloses(myCustomCondition()))
.aggregate(
() -> null,
(aggKey, newValue, aggValue) -> aggregateStartStop(aggValue, newValue),
Materialized
.<String, String, KeyValueStore<Bytes, byte[]>>as("aggregated-stream-store")
.withValueSerde(Serdes.String())
);
aggregatedStream.toStream();
KafkaStreams streams = new KafkaStreams(builder.build(), streamsSettings);
streams.start();

You could use the KTable to store the state (in your case, the type) along with a 60 second window. Whenever you receive an event for that particular key you update the state and time. Then you can use a filter before a .to() method to either send or not send a message to the outgoing topic based on the state (type).
Take a look at Neil Avery's blog post here:
https://www.confluent.io/blog/journey-to-event-driven-part-4-four-pillars-of-event-streaming-microservices
And scroll down to Event Flow Breakdown 1. Payments inflight
It's where I got the idea from.

Does KStream process one message at a time?

I am using Kafka streaming and i have a doubt.
My code is
final KStream<String, Entity> inStream = builder.stream(TOPIC);
inStream.map((key, entity) -> {
....
return new KeyValue<>(key, entity);
}).to(NEW_TOPIC);`
Value of NEW_TOPIC is present in entity object. My problem is how to i extract the value of this NEW_TOPIC from entity in case of multiple task running.
My problems drill down to the fact that if there are multiple task, would kafka-streaming process the incoming message till end (by calling the to() method to push it back to new kafka topic) and then pull new message from incoming topic? If this is the behavior, i can store this value in a local /final variable to use it later. If this is not the behavior then i need to use some other way.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

KStream-KTable Inner Join Lost Messages with Exactly Once Config - java

Related

How to re-consume messages in Kafka

Adjusting parallism based on number of partitions assigned in Consumer.committablePartitionedSource

Handling commits for errors with #KafkaListener in batch consumers

Supress aggregation until custom condition

Does KStream process one message at a time?

Categories

Resources