Reduce kafka offset by one message size

Reduce kafka offset by one message size - java

I have a kafka cluster setup. Most of the times the consumer runs unattended. It reads the messages and invokes an external API. But if the external API is down(which happens rarely), I need to retry the message for a fixed time. If retry fails, I need to stop the consumer and start again after a fixed time.
The problem is I need to handle the last read message from topic. Is there a way i can reset the zookeeper offset to the one before reading the message?(That is reduce the offset by one message). So next time i start the consumer, i can read the message again.
This can be done by using low level consumer. But is there a way to do that with High level consumer?
I am using Java based consumer client.

You need to commit offsets to zookeeper.
Look to the next consumer configs:
http://kafka.apache.org/documentation.html#consumerconfigs
auto.commit.enable: If true, periodically commit to ZooKeeper the
offset of messages already fetched by the consumer. This committed
offset will be used when the process fails as the position from which
the new consumer will begin.
auto.commit.interval.ms: The frequency in
ms that the consumer offsets are committed to zookeeper.
If you want to commit offset after each message you need:
auto.commit.enable=false
And commit offsets after successful operation:
consumer.commitOffsets(true)
It's not recommended to do it after each message or to have small commit interval because it increases read/write load on zookeeper.
Also you can look to the new offset management(offsets are stored in brokers instead of zookeper):
http://kafka.apache.org/documentation.html
Kafka provides the option to store all the offsets for a given
consumer group in a designated broker (for that group) called the
offset manager. i.e., any consumer instance in that consumer group
should send its offset commits and fetches to that offset manager
(broker)

Related

Kafka - Offset Commit & Seek

I am currently working on fetching messages from topics with a specific offset. I am using seek() to achieve it. But when I am setting enable.auto.commit to true or using a manual sync (commitSync()/commitAsync()), Seek() does not work, as it did not poll the messages from the specific offset rather picks from the last committed offset.
So when using Seek() is it mandatory store the offsets in an external DB and not commit to Kafka ? Both Seek and Commit will not work in parallel?
Client Version - kafka-clients - 2.4.0
Thanks!!

When you commit (either auto or manual makes little difference) you are storing at the broker end a record of how far in a partition a consumer has reached. This committed offset is only ever used in the event of a rebalance, so that when a consumer is assigned that partition they can pick up from a point where all previous messages are known to have been processed. This provides a guarantee that as long as consumers are coded correctly messages will not be lost on consumption in the event of changes in group membership, when messages are being processed sequentially.
When the group membership is stable then committed offset does nothing. Each consumer has its own in-memory offset that it maintains and is used each time it fetches a batch of records from the broker. By default this offset increases sequentially. The seek method only changes this in-memory offset so that the next poll will fetch from whatever arbitrary offset you have specified, unless it doesn't exist in which case an Exception will be thrown.
If you are storing commit offsets externally then seek may be used after a rebalance to retrieve the externally stored offsets and fetch from there but in that case you would have to call seek in a RebalanceListener - if you call seek before poll it will have no effect as the consumer only finds out about the rebalance and new partition assignment during the poll method, and so without intervening during poll it will consume from the last committed offset.
This slightly unintuitive situation also arises when you pause consumers, something I wrote about at https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html?m=1

Kafka Streams does not increment offset by 1 when producing to topic

I have implemented a simple Kafka Dead letter record processor.
It works perfectly when using records produced from the Console producer.
However I find that our Kafka Streams applications do not guarantee that producing records to the sink topics that the offsets will be incremented by 1 for each record produced.
Dead Letter Processor Background:
I have a scenario where records may be received before all data required to process it is published.
When records are not matched for processing by the streams app they are move to a Dead letter topic instead of continue to flow down stream. When new data is published we dump the latest messages from the Dead letter topic back in to the stream application's source topic for reprocessing with the new data.
The Dead Letter processor:
At the start of the run application records the ending offsets of each partition
The ending offsets marks the point to stop processing records for a given Dead Letter topic to avoid infinite loop if reprocessed records return to Dead Letter topic.
Application resumes from the last Offsets produced by the previous run via consumer groups.
Application is using transactions and KafkaProducer#sendOffsetsToTransaction to commit the last produced offsets.
To track when all records in my range are processed for a topic's partition my service compares its last produced offset from the producer to the the consumers saved map of ending offsets. When we reach the ending offset the consumer pauses that partition via KafkaConsumer#pause and when all partitions are paused (meaning they reached the saved Ending offset)then calls it exits.
The Kafka Consumer API States:
Offsets and Consumer Position
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5.
The Kafka Producer API references the next offset is always +1 as well.
Sends a list of specified offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered committed only if the transaction is committed successfully. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
But you can clearly see in my debugger that the records consumed for a single partition are anything but incremented 1 at a time...
I thought maybe this was a Kafka configuration issue such as max.message.bytes but none really made sense.
Then I thought perhaps it is from joining but didn't see any way that would change the way the producer would function.
Not sure if it is relevant or not but all of our Kafka applications are using Avro and Schema Registry...
Should the offsets always increment by 1 regardless of method of producing or is it possible that using Kafka streams API does not offer the same guarantees as the normal Producer Consumer clients?
Is there just something entirely that I am missing?

It is not an official API contract that message offsets are increased by one, even if the JavaDocs indicate this (it seems that the JavaDocs should be updated).
If you don't use transactions, you get either at-least-once semantics or no guarantees (some call this at-most-once semantics). For at-least-once, records might be written twice and thus, offsets for two consecutive messages are not really increased by one as the duplicate write "consumes" two offsets.
If you use transactions, each commit (or abort) of a transaction writes a commit (or abort) marker into the topic -- those transactional markers also "consume" one offset (this is what you observe).
Thus, in general you should not rely on consecutive offsets. The only guarantee you get is, that each offset is unique within a partition.

I know that knowing offset of messages can be useful. However, Kafka will only guarantee that the offset of a message-X would be greater than the last message(X-1)'s offset. BTW an ideal solution should not be based on offset calculations.
Under the hood, kafka producer may try to resend messages. Also, if a broker goes down then re-balancing may occur. Exactly-once-semantics may append an additional message. Therefore, offset of your message may change if any of above events occur.
Kafka may add additional messages for internal purpose to the topic. But Kafka's consumer API might be discarding those internal messages. Therefore, you can only see your messages and your message's offsets might not necessarily increment by 1.

Apache Spark Time based Kafka off set

I am using spark consumer (from spa‌rk-streaming-kafka_2‌.10 version 1.6.0).
My spark launcher listen message from kafka queue with 5 partition.Suppose I stop my spark application then it will read either smallest or largest offset value based on what I configure. But I Spark application should read message after I stop.for example I stop process 3.00PM and start spark launcher at 3.30PM.Then i am want to read all message between 3.00pm to 3.30PM.

I hope you are using the high level consumer from Kafka library. In that case they will keep periodically committing the offsets and Kafka itself maintains the offset records either in Zookeeper or in some Kafka topics. So, when you restart the consumers in the group after some time, they will start from where they left. The offset records function as marker for where a consumer should start consuming in case of restart or rebalance. The offset commit may happen automatically or may be committed explicitly. In either case, the message processing and offset commit don't happen atomically, so there is a chance that few messages will be processed again in case of restart of consumers.
The smallest and largest offset values are only relevant when we start the consumers in the consumer group for the first time, as there is no offset records available for the consumers to signal them from which offsets (of the partitions) they should start consuming.

who is responsible for offset maintenance?

Here are the Kafka docs for public ConsumerRecords<K,V> poll(long timeout)
Fetch data for the topics or partitions specified using one of the
subscribe/assign APIs. It is an error to not have subscribed to any
topics or partitions before polling for data. On each poll, consumer
will try to use the last consumed offset as the starting offset and
fetch sequentially. The last consumed offset can be manually set
through seek(TopicPartition, long) or automatically set as the last
committed offset for the subscribed list of partitions
My question is who(Broker or consumer or zookeper) is responsible for maintaining the offset and where it is stored(memory or disc)? If consumer maintains it in memory, will consumer start reading it from beginning or
consumer application need to persist in disc?

As the "Offsets and Consumer Position" section in the docs you referenced mentions, the offsets are stored by Kafka (the broker):
Kafka maintains a numerical offset for each record in a partition
Specifically, it stores them in an "internal" consumer offsets topic called "__consumer_offsets".
The "old consumer" api (deprecated in upcoming v0.11) allows you to chose to store offset in kafka or zookeeper.
Additionally, you are free to save offsets on the consumer side and always seek to those offsets at startup, if you so choose.
So, in summary, depending on your consumer api version and your preference, offsets can be stored on the broker or zookeeper and/or on the consumer side.

How to handle offset commit failures with enable.auto.commit disabled in Spark Streaming with Kafka?

I use Spark 2.0.0 with Kafka 0.10.2.
I have an application that is processing messages from Kafka and is a long running job.
From time to time I see the following message in the logs. Which I understand how I can increase the timeout and everything but what I wanted to know was given that I do have this error how can I recover from it ?
ERROR ConsumerCoordinator: Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
This is not on how I escape this error but how to handle it once it occurs
Background: In normal situations I will not see commit errors, but if I do get one I should be able to recover from it. I am using AT_LEAST_ONCE setup, So I am completely happy with reprocessing a few messages.
I am running Java and using DirectKakfaStreams with Manual commits.
Creating the stream:
JavaInputDStream<ConsumerRecord<String, String>> directKafkaStream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
Commiting the offsets
((CanCommitOffsets) directKafkaStream.inputDStream()).commitAsync(offsetRanges);

My understanding of the situation is that you use the Kafka Direct Stream integration (using spark-streaming-kafka-0-10_2.11 module as described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)).
As said in the error message:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
Kafka manages what topic partition a consumer consumes so the Direct Stream will create a pool of consumers (inside a single consumer group).
As with any consumer group you should expect rebalancing which (quoting Chapter 4. "Kafka Consumers - Reading Data from Kafka" from Kafka: The Definitive Guide):
consumers in a consumer group share ownership of the partitions in the topics they subscribe to. When we add a new consumer to the group it starts consuming messages from partitions which were previously consumed by another consumer. The same thing happens when a consumer shuts down or crashes, it leaves the group, and the partitions it used to consume will be consumed by one of the remaining consumers. Reassignment of partitions to consumers also happen when the topics the consumer group is consuming are modified, for example if an administrator adds new partitions.
There are quite a few cases when rebalancing can occur and should be expected. And you do.
You asked:
how can I recover from it? This is not on how I escape this error but how to handle it once it occurs?
My answer would be to use the other method of CanCommitOffsets:
def commitAsync(offsetRanges: Array[OffsetRange], callback: OffsetCommitCallback): Unit
that gives you access to Kafka's OffsetCommitCallback:
OffsetCommitCallback is a callback interface that the user can implement to trigger custom actions when a commit request completes. The callback may be executed in any thread calling poll().
I think onComplete gives you a handle on how the async commit has finished and act accordingly.
Something I can't help you with much is how to revert the changes in a Spark Streaming application when some offsets could not have been committed. That I think requires tracking offsets and accept a case where some offsets can't be committed and be re-processed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.