Kafka last offset increases on application restart

Kafka last offset increases on application restart - java

I have a Java Akka application that reads from Kafka, process the messages and commits manually.
I'm using the High Level consumer of the 0.10.1.1 API.
The strange thing is when I shutdown the application and start it again the offset is a little bigger than the last commit and I cannot find why.
I have only one commit point in the code.
else if(message.getClass() == ProcessedBatches.class) {
try {
Logger.getRootLogger().info("[" + this.name + "/Reader] Commiting ...");
ProcessedBatches msg = (ProcessedBatches) message;
consumer.commitSync(msg.getCommitInfo());
lastCommitData = msg.getCommitInfo();
lastCommit = System.currentTimeMillis();
} catch (CommitFailedException e) {
Logger.getRootLogger().info("[" + this.name + "/Reader] Failed to commit... Last commit: " + lastCommit + " | Last batch: " + lastBatch + ". Current uncommited messages: " + uncommitedMessages);
self().tell(HarakiriMessage.getInstance(), self());
}
}
After commit I save the offsets HashMap in the lastCommitData in order to debug it.
Then I've added a shutdown hook to print the lastCommitData variable to check what is the last offset commited for each partition.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
String output =
"############## SHUTTING DOWN CONSUMER ############### \n" +
lastCommitData+"\n";
System.out.println(output);
}));
Also I have a consumer rebalance listener to check start position of each partition when consumer starts.
new ConsumerRebalanceListener() {
#Override
public void onPartitionsRevoked(Collection<TopicPartition> collection) {}
#Override
public void onPartitionsAssigned(Collection<TopicPartition> collection) {
for (TopicPartition p:collection
) {
System.out.println("Starting position "+p.toString()+":" + consumer.position(p));
}
coordinator.setRebalanceTimestamp(System.currentTimeMillis());
}
});
Example for one partition:
Offset before shutdown: 3107169023
Offset when partition assigned: 3107180350
As you can see is almost 10K messages between each.
The consumer properties are the following:
Properties props = new Properties();
props.put("bootstrap.servers", bootstrapServers);
props.put("group.id", group_id);
props.put("enable.auto.commit", "false");
props.put("auto.commit.interval.ms", "100000000");
props.put("session.timeout.ms", "10000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
props.put("max.poll.records", "40000");
props.put("auto.offset.reset", "latest");
I'm not sure what I'm doing wrong.

Am I correct in thinking you base your assumed "Offset before shutdown: 3107169023" on what your shutdown hook prints?
If so, I see 2 potentials issues.
When you register your shutdown hook you are closing over the lastCommitData field.
Since you are accessing it from another thread, the shutdown hook thread, is the field declared volatile? Otherwise you may be printing a stale value.
Also, java.lang.Runtime.addShutdownHook says:
When the virtual machine begins its shutdown sequence it will start all registered shutdown hooks in some unspecified order and let them run concurrently
so there is no guarantee that your consumer won't manage to commit offsets further after your shutdown hook has already printed the lastCommitData value.
I suggest you inspect Kafka to check what are the actual committed offsets after your app shuts down to be sure.

Check retention policy of our topic
it maybe the case when you start back your consumer, the last committed offset might have been purged from the partition and the consumer will move forward to the latest offset for that partition.

When you poll Kafka using the Consumer API, it reads from the last consumed offset in the partition. There must be other consumers in the system which must have got the partitions which were previously been consumed by the instance which you just stopped - thus the latest offset would have changed. Since you know which offset you were at before exiting, you would need to save it to some durable store - use the ConsumerRebalanceListener#onPartitionsRevoked for this. Read that offset when you restart the consumer process and point your consumer to start from there - do this by calling seek(partition, offset) in ConsumerRebalanceListener#onPartitionsAssigned

Related

Kafka - commiting offset before consumer is shut down on app stop + commiting offset from the past

Ive got a spring-kafka consumer set up. It is consuming avro data from the topic, maps the values and writes a CSV files. I manually commit the offset once the file is either of 25000 records long or each 5 minutes - whichever comes first.
A problem occurs when we restart the app because of patching/releases.
I have a method like this:
#PreDestroy
public void destroy() {
LOGGER.info("shutting down");
writeCsv(true);
acknowledgment.acknowledge(); // this normally commits the current offset
LOGGER.info("package commited: " + acknowledgment.toString());
LOGGER.info("shutting down completed");
}
So ive added some loggers there and this is how the log looks:
08:05:47 INFO KafkaMessageListenerContainer$ListenerConsumer - myManualConsumer: Consumer stopped
08:05:47 INFO CsvWriter - shutting down
08:05:47 INFO CsvWriter - created file: FEEDBACK1630476236079.csv
08:05:47 INFO CsvWriter - package commited: Acknowledgment for ConsumerRecord(topic = feedback-topic, partition = 1, leaderEpoch = 17, offset = 544, CreateTime = 1630415419703, serialized key size = -1, serialized value size = 156)
08:05:47 INFO CsvWriter - shutting down completed
The offset is never commited since the consumer stops working before the acknowledge() method is called. There are no erros in the log and we are getting duplicates after the app is started again.
Is there a way to call a method before the consumer is shut down?
Also one more question:
i want to set up a filter on consumer like this:
if(event.getValue().equals("GOOD") {
addCsvRecord(event)
} else {
acknowledgement.acknowledge() //to let it read next event
Lets say i got offset 100 - and GOOD event comes, i am adding it to the csv file, the file waits for more records and offset is not commited yet.
A BAD event comes up next, it is filtered out and offset 101 is commited immiedietely.
Then the file reaches its timeout and is about to close and call
acknowlegdment.acknowledge()
on offset 100.
What can possibly happen there? Can the previous offset be commited?

#PreDestroy is too late in the context lifecyle - the containers have already been stopped by then.
Implement SmartLifecycle and do the acknowledgment in stop().
For your second question, just don't commit the bad offset; you will still get the next record(s).
Kafka maintains two pointers position and committed. They are related, but independent for a running application.

How to manage Async in failure?

I'm working on one requirement where I need to consume messages from Kafka broker. The frequency is very high, so that's why I've choosen Async mechanism.
I want to know, while consuming messages, lets say connection break down with broker or broker itself failed due to any reason and offset could not get commit back to broker. So after restarting, I've to consume same messages again which was consumed earlier but not commited back in broker.
private static void startConsumer() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("consumed: key = %s, value = %s, partition id= %s, offset = %s%n",
record.key(), record.value(), record.partition(), record.offset());
}
if (records.isEmpty()) {
System.out.println("-- terminating consumer --");
break;
}
printOffsets("before commitAsync() call", consumer, topicPartition);
consumer.commitAsync();
printOffsets("after commitAsync() call", consumer, topicPartition);
}
printOffsets("after consumer loop", consumer, topicPartition);
}
may I know please, what can be done to overcome this situation where I dont need to consume same message again after restart ?

You need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally (or in your case printing it to the logs).
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.

What does KafkaConsumer.commitSync() actually commit?

Does KafkaConsumer.commitSync just commit "offsets returned on the last poll()" as JavaDoc claims, (which may miss some partitions not included in the last poll result), or it is actually committing the latest positions for all subscribed partitions? Asking because the code suggests the latter, considering allConsumed:
https://github.com/apache/kafka/blob/2.4.0/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L1387
#Override
public void commitSync(Duration timeout) {
acquireAndEnsureOpen();
try {
maybeThrowInvalidGroupIdException();
if (!coordinator.commitOffsetsSync(subscriptions.allConsumed(), time.timer(timeout))) {
throw new TimeoutException("Timeout of " + timeout.toMillis() + "ms expired before successfully " +
"committing the current consumed offsets");
}
} finally {
release();
}
}

It only commits the offsets that were actually polled and processed. If some offsets were not included in the last poll, then those offsets will not be committed.
It will not commit the latest positions for all subscribed partitions. This would interfere with the Consumer Offset management concept of Kafka to be able to re-start an application where it left off.
From my understanding, the allConsumed is equivalent to all offsets included in the last poll which the comment of the commitSync also documents
Commit offsets returned on the last poll() for all the subscribed list of topics and partitions.

Kafka-consumer. commitSync vs commitAsync

The quote from https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html#callout_kafka_consumers__reading_data_from_kafka_CO2-1
The drawback is that while commitSync() will retry the commit until it
either succeeds or encounters a non-retriable failure, commitAsync()
will not retry.
This phrase is not clear to me. I suppose that consumer sends commit request to broker and in case if the broker doesn't respond within some timeout it means that the commit failed. Am I wrong?
Can you clarify the difference of commitSync and commitAsync in details?
Also, please provide use cases when which commit type should I prefer.

As it is said in the API documentation:
commitSync
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
That means, the commitSync is a blocking method. Calling it will block your thread until it either succeeds or fails.
For example,
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitSync();
}
}
For each iteration in the for-loop, only after consumer.commitSync() successfully returns or interrupted with exception thrown, your code will move to the next iteration.
commitAsync
This is an asynchronous call and will not block. Any errors encountered are either passed to the callback (if provided) or discarded.
That means, the commitAsync is a non-blocking method. Calling it will not block your thread. Instead, it will continue processing the following instructions, no matter whether it will succeed or fail eventually.
For example, similar to previous example, but here we use commitAsync:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitAsync(callback);
}
}
For each iteration in the for-loop, no matter what will happen to consumer.commitAsync() eventually, your code will move to the next iteration. And, the result of the commit is going to be handled by the callback function you defined.
Trade-offs: latency vs. data consistency
If you have to ensure the data consistency, choose commitSync() because it will make sure that, before doing any further actions, you will know whether the offset commit is successful or failed. But because it is sync and blocking, you will spend more time on waiting for the commit to be finished, which leads to high latency.
If you are ok of certain data inconsistency and want to have low latency, choose commitAsync() because it will not wait to be finished. Instead, it will just send out the commit request and handle the response from Kafka (success or failure) later, and meanwhile, your code will continue executing.
This is all generally speaking, the actually behaviour will depend on your actual code and where you are calling the method.

Robust Retry handling with commitAsync()
In the book "Kafka - The Definitive Guide", there is a hint on how to mitigate the potential problem of commiting lower offsets due to an asynchronous commit:
Retrying Async Commits: A simple pattern to get commit order right for asynchronous retries is to use a monotonically increasing sequence number. Increase the sequence number every time you commit and add the sequence number at the time of the commit to the commitAsync callback. When you’re getting ready to send a retry, check if the commit sequence number the callback got is equal to the instance variable; if it is, there was no newer commit and it is safe to retry. If the instance sequence number is higher, don’t retry because a newer commit was already sent.
The following code depicts a possible solution:
import java.util._
import java.time.Duration
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, KafkaConsumer, OffsetAndMetadata, OffsetCommitCallback}
import org.apache.kafka.common.{KafkaException, TopicPartition}
import collection.JavaConverters._
object AsyncCommitWithCallback extends App {
// define topic
val topic = "myOutputTopic"
// set properties
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, "AsyncCommitter")
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// [set more properties...]
// create KafkaConsumer and subscribe
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(List(topic).asJavaCollection)
// initialize global counter
val atomicLong = new AtomicLong(0)
// consume message
try {
while(true) {
val records = consumer.poll(Duration.ofMillis(1)).asScala
if(records.nonEmpty) {
for (data <- records) {
// do something with the records
}
consumer.commitAsync(new KeepOrderAsyncCommit)
}
}
} catch {
case ex: KafkaException => ex.printStackTrace()
} finally {
consumer.commitSync()
consumer.close()
}
class KeepOrderAsyncCommit extends OffsetCommitCallback {
// keeping position of this callback instance
val position = atomicLong.incrementAndGet()
override def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata], exception: Exception): Unit = {
// retrying only if no other commit incremented the global counter
if(exception != null){
if(position == atomicLong.get) {
consumer.commitAsync(this)
}
}
}
}
}

Both commitSync and commitAsync uses kafka offset management feature and both has demerits.
If the message processing succeeds and commit offset failed(not atomic) and at same time partition re balancing happens, your processed message gets processed again(duplicate processing) by some other consumer. If you are okay with duplicate message processing, then you can go for commitAsync(because it doesn't block and provide low latency, and it provides a higher order commit. so you should be okay). Otherwise go for a custom offset management that takes care of atomicity while processing and updating the offset(use an external offset storage)

commitAync will not retry because if it retries it will make a mess.
Imagine that you are trying to commit offset 20 (async), and it did not commit (failed), and then the next poll block tries to commit the offset 40 (async), and it succeeded.
Now, commit offset 20 is still waiting to commit, if it reties and succeed it will make a mess.
The mess is that the committed offset should be 40 not 20.

Meaning of sendOffsetsToTransaction in Kafka 0.11

The new Kafka version (0.11) supports exactly once semantics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I've got a producer setup with kafka transactional code in java like this.
producer.initTransactions();
try {
producer.beginTransaction();
for (ProducerRecord<String, String> record : payload) {
producer.send(record);
}
Map<TopicPartition, OffsetAndMetadata> groupCommit = new HashMap<TopicPartition, OffsetAndMetadata>() {
{
put(new TopicPartition(TOPIC, 0), new OffsetAndMetadata(42L, null));
}
};
producer.sendOffsetsToTransaction(groupCommit, "groupId");
producer.commitTransaction();
} catch (ProducerFencedException e) {
producer.close();
} catch (KafkaException e) {
producer.abortTransaction();
}
I'm not quite sure how to use the sendOffsetsToTransaction and the the intended use case of it. AFAIK, consumer groups is a multithreaded read feature on consumer end.
javadoc says
" Sends a list of consumed offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered consumed only if the transaction is committed successfully. This method should be used when you need to batch consumed and produced messages together, typically in a consume-transform-produce pattern."
How would produce maintain a list of consumed offsets? Whats the point of it?

This is only relevant to workflows in which you are consuming and then producing messages based on what you consumed. This function allows you to commit offsets you consumed only if the downstream producing succeeds. If you consume data, process it somehow, and then produce the result, this enables transactional guarantees across the consumption/production.
Without transactions, you normally use Consumer#commitSync() or Consumer#commitAsync() to commit consumer offsets. But if you use these methods before you've produced with your producer, you will have committed offsets before knowing whether the producer succeeded sending.
So, instead of committing your offsets with the consumer, you can use Producer#sendOffsetsToTransaction() on the producer to commit the offsets instead. This sends the offsets to the transaction manager handling the transaction. It will commit the offsets only if the entire transactions—consuming and producing—succeeds.
(Note: when you send the offsets to commit, you should add 1 to the offset last read, so that future reads resume from the offset you haven't read. This is true regardless of whether you commit with the consumer or the producer. See: KafkaProducer sendOffsetsToTransaction need offset+1 to successfully commit current offset).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka last offset increases on application restart - java

Check retention policy of our topic it maybe the case when you start back your consumer, the last committed offset might have been purged from the partition and the consumer will move forward to the latest offset for that partition.

Related

Kafka - commiting offset before consumer is shut down on app stop + commiting offset from the past

How to manage Async in failure?

What does KafkaConsumer.commitSync() actually commit?

Kafka-consumer. commitSync vs commitAsync

Meaning of sendOffsetsToTransaction in Kafka 0.11

Categories

Resources