I'm working on one requirement where I need to consume messages from Kafka broker. The frequency is very high, so that's why I've choosen Async mechanism.
I want to know, while consuming messages, lets say connection break down with broker or broker itself failed due to any reason and offset could not get commit back to broker. So after restarting, I've to consume same messages again which was consumed earlier but not commited back in broker.
private static void startConsumer() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("consumed: key = %s, value = %s, partition id= %s, offset = %s%n",
record.key(), record.value(), record.partition(), record.offset());
}
if (records.isEmpty()) {
System.out.println("-- terminating consumer --");
break;
}
printOffsets("before commitAsync() call", consumer, topicPartition);
consumer.commitAsync();
printOffsets("after commitAsync() call", consumer, topicPartition);
}
printOffsets("after consumer loop", consumer, topicPartition);
}
may I know please, what can be done to overcome this situation where I dont need to consume same message again after restart ?
You need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally (or in your case printing it to the logs).
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
Related
The quote from https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html#callout_kafka_consumers__reading_data_from_kafka_CO2-1
The drawback is that while commitSync() will retry the commit until it
either succeeds or encounters a non-retriable failure, commitAsync()
will not retry.
This phrase is not clear to me. I suppose that consumer sends commit request to broker and in case if the broker doesn't respond within some timeout it means that the commit failed. Am I wrong?
Can you clarify the difference of commitSync and commitAsync in details?
Also, please provide use cases when which commit type should I prefer.
As it is said in the API documentation:
commitSync
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
That means, the commitSync is a blocking method. Calling it will block your thread until it either succeeds or fails.
For example,
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitSync();
}
}
For each iteration in the for-loop, only after consumer.commitSync() successfully returns or interrupted with exception thrown, your code will move to the next iteration.
commitAsync
This is an asynchronous call and will not block. Any errors encountered are either passed to the callback (if provided) or discarded.
That means, the commitAsync is a non-blocking method. Calling it will not block your thread. Instead, it will continue processing the following instructions, no matter whether it will succeed or fail eventually.
For example, similar to previous example, but here we use commitAsync:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitAsync(callback);
}
}
For each iteration in the for-loop, no matter what will happen to consumer.commitAsync() eventually, your code will move to the next iteration. And, the result of the commit is going to be handled by the callback function you defined.
Trade-offs: latency vs. data consistency
If you have to ensure the data consistency, choose commitSync() because it will make sure that, before doing any further actions, you will know whether the offset commit is successful or failed. But because it is sync and blocking, you will spend more time on waiting for the commit to be finished, which leads to high latency.
If you are ok of certain data inconsistency and want to have low latency, choose commitAsync() because it will not wait to be finished. Instead, it will just send out the commit request and handle the response from Kafka (success or failure) later, and meanwhile, your code will continue executing.
This is all generally speaking, the actually behaviour will depend on your actual code and where you are calling the method.
Robust Retry handling with commitAsync()
In the book "Kafka - The Definitive Guide", there is a hint on how to mitigate the potential problem of commiting lower offsets due to an asynchronous commit:
Retrying Async Commits: A simple pattern to get commit order right for asynchronous retries is to use a monotonically increasing sequence number. Increase the sequence number every time you commit and add the sequence number at the time of the commit to the commitAsync callback. When you’re getting ready to send a retry, check if the commit sequence number the callback got is equal to the instance variable; if it is, there was no newer commit and it is safe to retry. If the instance sequence number is higher, don’t retry because a newer commit was already sent.
The following code depicts a possible solution:
import java.util._
import java.time.Duration
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, KafkaConsumer, OffsetAndMetadata, OffsetCommitCallback}
import org.apache.kafka.common.{KafkaException, TopicPartition}
import collection.JavaConverters._
object AsyncCommitWithCallback extends App {
// define topic
val topic = "myOutputTopic"
// set properties
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, "AsyncCommitter")
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// [set more properties...]
// create KafkaConsumer and subscribe
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(List(topic).asJavaCollection)
// initialize global counter
val atomicLong = new AtomicLong(0)
// consume message
try {
while(true) {
val records = consumer.poll(Duration.ofMillis(1)).asScala
if(records.nonEmpty) {
for (data <- records) {
// do something with the records
}
consumer.commitAsync(new KeepOrderAsyncCommit)
}
}
} catch {
case ex: KafkaException => ex.printStackTrace()
} finally {
consumer.commitSync()
consumer.close()
}
class KeepOrderAsyncCommit extends OffsetCommitCallback {
// keeping position of this callback instance
val position = atomicLong.incrementAndGet()
override def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata], exception: Exception): Unit = {
// retrying only if no other commit incremented the global counter
if(exception != null){
if(position == atomicLong.get) {
consumer.commitAsync(this)
}
}
}
}
}
Both commitSync and commitAsync uses kafka offset management feature and both has demerits.
If the message processing succeeds and commit offset failed(not atomic) and at same time partition re balancing happens, your processed message gets processed again(duplicate processing) by some other consumer. If you are okay with duplicate message processing, then you can go for commitAsync(because it doesn't block and provide low latency, and it provides a higher order commit. so you should be okay). Otherwise go for a custom offset management that takes care of atomicity while processing and updating the offset(use an external offset storage)
commitAync will not retry because if it retries it will make a mess.
Imagine that you are trying to commit offset 20 (async), and it did not commit (failed), and then the next poll block tries to commit the offset 40 (async), and it succeeded.
Now, commit offset 20 is still waiting to commit, if it reties and succeed it will make a mess.
The mess is that the committed offset should be 40 not 20.
I want to verify the partition size before producing the the record in Kafka.
I have a custom partitioned class which gives me exact partition number in which my message is supposed to drop.
Now my requirement is I want to check the size of partition before sending my record.
List<String> users = userService.findAllUsers();
for (String user : users) {
String msg = "Hello " + user;
//Check size here
producer.send(new ProducerRecord<String, String>(topic, user, msg), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
e.printStackTrace();
}
}
});
Is there any way in kafka I can achieve this ?
Capacity is purely a Kafka-broker level aspect.
Basically, if there is disk space in broker's data directory, you can deliver the message. The message are cleaned up by broker based on time and partition size (that's in broker configuration), so if you configure your broker accordingly, you might always have space - the old messages would just get thrown away. It might not suit your business usecase though.
Also, responding to your comment Can we check size of Topic, you can actually check the current size of partition by using beginningOffsets & endOffsets methods in KafkaConsumer. Beware that these methods might block if the partitions do not exist (at least in 0.10.2). e.g. when you request data for partition 4 when topic actually contains 3 partitions.
Kafka 0.11 introduces administrative capabilities in client, however it is still work in progress.
The new Kafka version (0.11) supports exactly once semantics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I've got a producer setup with kafka transactional code in java like this.
producer.initTransactions();
try {
producer.beginTransaction();
for (ProducerRecord<String, String> record : payload) {
producer.send(record);
}
Map<TopicPartition, OffsetAndMetadata> groupCommit = new HashMap<TopicPartition, OffsetAndMetadata>() {
{
put(new TopicPartition(TOPIC, 0), new OffsetAndMetadata(42L, null));
}
};
producer.sendOffsetsToTransaction(groupCommit, "groupId");
producer.commitTransaction();
} catch (ProducerFencedException e) {
producer.close();
} catch (KafkaException e) {
producer.abortTransaction();
}
I'm not quite sure how to use the sendOffsetsToTransaction and the the intended use case of it. AFAIK, consumer groups is a multithreaded read feature on consumer end.
javadoc says
" Sends a list of consumed offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered consumed only if the transaction is committed successfully. This method should be used when you need to batch consumed and produced messages together, typically in a consume-transform-produce pattern."
How would produce maintain a list of consumed offsets? Whats the point of it?
This is only relevant to workflows in which you are consuming and then producing messages based on what you consumed. This function allows you to commit offsets you consumed only if the downstream producing succeeds. If you consume data, process it somehow, and then produce the result, this enables transactional guarantees across the consumption/production.
Without transactions, you normally use Consumer#commitSync() or Consumer#commitAsync() to commit consumer offsets. But if you use these methods before you've produced with your producer, you will have committed offsets before knowing whether the producer succeeded sending.
So, instead of committing your offsets with the consumer, you can use Producer#sendOffsetsToTransaction() on the producer to commit the offsets instead. This sends the offsets to the transaction manager handling the transaction. It will commit the offsets only if the entire transactions—consuming and producing—succeeds.
(Note: when you send the offsets to commit, you should add 1 to the offset last read, so that future reads resume from the offset you haven't read. This is true regardless of whether you commit with the consumer or the producer. See: KafkaProducer sendOffsetsToTransaction need offset+1 to successfully commit current offset).
I have a simple kafka setup. A producer is producing messages to a single partition with a single topic at a high rate. A single consumer is consuming messages from this partition. During this process, the consumer may pause processing messages several times. The pause can last a couple of minutes. After the producer stops producing messages, all messages queued up will be processed by the consumer. It appears that messages produced by the producer are not being seen immediately by the consumer. I am using kafka 0.10.1.0. What can be happening here? Here is the section of code that consumes the message:
while (true)
{
try
{
ConsumerRecords<String, byte[]> records = consumer.poll(100);
for (final ConsumerRecord<String, byte[]> record : records)
{
serviceThread.submit(() ->
{
externalConsumer.accept(record);
});
}
consumer.commitAsync();
} catch (org.apache.kafka.common.errors.WakeupException e)
{
}
}
where consumer is a KafkaConsumer with auto commit disabled, max poll record of 100, and session timeout of 30000. serviceThread is an ExecutorService.
The producer just involves the KafkaProducer.send call to send a ProducerRecord.
All configurations on the broker are left as kafka defaults.
I am also using kafka-consumer-groups.sh to check what is happening when consumer is not consuming the message. But when this happens, the kafka-consumer-groups.sh will be hanging there also, not able to get information back. Sometimes it triggers the consumer re-balance. But not always.
For those who can find this helpful. I've encountered this problem (when kafka silently supposedly stops consuming) often enough and every single time it wasn't actually problem with Kafka.
Usually it is some long-running or hanged silent process that keeps Kafka from committing the offset. For example a DB client trying to connect to the DB. If you wait for long enough (e.g. 15 minutes for SQLAlchemy and Postgres), you will see a exception will be printed to the STDOUT, saying something like connection timed out.
I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit ideal.
Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? Please suggest answers.
We can subscribe for multiple topic using following API :
consumer.subscribe(Arrays.asList(topic1,topic2), ConsumerRebalanceListener obj)
Consumer has the topic info and we can comit using consumer.commitAsync or consumer.commitSync() by creating OffsetAndMetadata object as follows.
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
There is no need for multiple threads, you can have one consumer, consuming from multiple topics.
Offsets are maintained by zookeeper, as kafka-server itself is stateless.
Whenever a consumer consumes a message,its offset is commited with zookeeper to keep a future track to process each message only once. So even in case of kafka failure, consumer will start consuming from the next of last commited offset.