The new Kafka version (0.11) supports exactly once semantics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I've got a producer setup with kafka transactional code in java like this.
producer.initTransactions();
try {
producer.beginTransaction();
for (ProducerRecord<String, String> record : payload) {
producer.send(record);
}
Map<TopicPartition, OffsetAndMetadata> groupCommit = new HashMap<TopicPartition, OffsetAndMetadata>() {
{
put(new TopicPartition(TOPIC, 0), new OffsetAndMetadata(42L, null));
}
};
producer.sendOffsetsToTransaction(groupCommit, "groupId");
producer.commitTransaction();
} catch (ProducerFencedException e) {
producer.close();
} catch (KafkaException e) {
producer.abortTransaction();
}
I'm not quite sure how to use the sendOffsetsToTransaction and the the intended use case of it. AFAIK, consumer groups is a multithreaded read feature on consumer end.
javadoc says
" Sends a list of consumed offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered consumed only if the transaction is committed successfully. This method should be used when you need to batch consumed and produced messages together, typically in a consume-transform-produce pattern."
How would produce maintain a list of consumed offsets? Whats the point of it?
This is only relevant to workflows in which you are consuming and then producing messages based on what you consumed. This function allows you to commit offsets you consumed only if the downstream producing succeeds. If you consume data, process it somehow, and then produce the result, this enables transactional guarantees across the consumption/production.
Without transactions, you normally use Consumer#commitSync() or Consumer#commitAsync() to commit consumer offsets. But if you use these methods before you've produced with your producer, you will have committed offsets before knowing whether the producer succeeded sending.
So, instead of committing your offsets with the consumer, you can use Producer#sendOffsetsToTransaction() on the producer to commit the offsets instead. This sends the offsets to the transaction manager handling the transaction. It will commit the offsets only if the entire transactions—consuming and producing—succeeds.
(Note: when you send the offsets to commit, you should add 1 to the offset last read, so that future reads resume from the offset you haven't read. This is true regardless of whether you commit with the consumer or the producer. See: KafkaProducer sendOffsetsToTransaction need offset+1 to successfully commit current offset).
Related
I have a Beam pipeline to consume streaming events with multiple stages (PTransforms) to process them. See the following code,
pipeline.apply("Read Data from Stream", StreamReader.read())
.apply("Decode event and extract relevant fields", ParDo.of(new DecodeExtractFields()))
.apply("Deduplicate process", ParDo.of(new Deduplication()))
.apply("Conversion, Mapping and Persisting", ParDo.of(new DataTransformer()))
.apply("Build Kafka Message", ParDo.of(new PrepareMessage()))
.apply("Publish", ParDo.of(new PublishMessage()))
.apply("Commit offset", ParDo.of(new CommitOffset()));
The streaming events read by using the KafkaIO and the StreamReader.read() method implementation is like this,
public static KafkaIO.Read<String, String> read() {
return KafkaIO.<String, String>read()
.withBootstrapServers(Constants.BOOTSTRAP_SERVER)
.withTopics(Constants.KAFKA_TOPICS)
.withConsumerConfigUpdates(Constants.CONSUMER_PROPERTIES)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class);
}
After we read a streamed event/message through the KafkaIO, we can commit the offset.
What i need to do is commit the offset manually, inside the last Commit offset PTransform when all the previous PTransforms executed.
The reason is, I am doing some conversions, mappings and persisting in the middle of the pipeline and when all the things done without failing, I need to commit the offset.
By doing so, if the processing fails in the middle, i can consume same record/event again and process.
My question is, how do I commit the offset manually?
Appreciate if its possible to share resources/sample codes.
Well, for sure, there are Read.commitOffsetsInFinalize() method, that is supposed to commit offsets while finalising the checkpoints, and AUTO_COMMIT consumer config option, that is used to auto-commit read records by Kafka consumer.
Though, in your case, it won't work and you need to do it manually by grouping the offsets of the same topic/partitiona/window and creating a new instance of Kafka client in your CommitOffset DoFn which will commit these offsets. You need to group the offsets by partition, otherwise it may be a race condition with committing the offsets of the same partition on different workers.
I need to get exactly-once semantics, so I use Kafka Transactional API. And I'm trying to understand how to work with Producer efficiently. As I read in some articles, it's more optimized way to use only one Producer per application instance (within one TCP connection) because of its buffering mechanism. On the other hand, when I call producer.commitTransaction() for a single message, it flushes message immediately without using message buffer.
Do I need to implement buffer manually and call producer.commitTransaction() for buffered messages? Or is there another way to use buffering with transactions?
I know that in Spring producers are cached when transactions are enabled. But I don't use Spring and I'm not sure how Spring producers cache actually works. Maybe I should implement something similar and create new Producer if another is busy?
Example of my method:
public void produce(#NotNull T payload) {
var key = UUID.randomUUID();
var value = JsonUtils.toJson(payload);
try {
ProducerRecord<UUID, String> record = new ProducerRecord<>(topic, key, value);
producer.beginTransaction();
producer.send(record);
producer.commitTransaction();
} catch (ProducerFencedException e) {
log.error("Producer with the same transactional id already exists", e);
producer = KafkaProducerFactory.getInstance().recreateProducer();
} catch (KafkaException e) {
log.error("Failed to produce to kafka", e);
producer.abortTransaction();
}
log.info("Message with key {} produced to topic {}", key, topic);
}
Let's begin with Non-transactional kafka Producer, there are set of configurable properties that control the buffering mechanism:
batch.size
linger.ms
buffer.memory
Basically Kafka internally batch as per configuration.
If linger.ms=0, producer will always send immediately even if batch is not full, non zero value will wait for define number of time if batch is not full.
When it come to Transactional Producer, there are some differences:
commitTransaction() will immediately send the message from transaction, doesn't wait for batch size to be full fill.
This is the reason one message is sent immediately in above example.
If there are multiple producer.send() in transaction boundary, all will be part of single transaction. This will not be true for non-transactional because of batch and other configuration.
When commitTransaction() is called, it basically wakeup the thread to send the messages.
I'm working on one requirement where I need to consume messages from Kafka broker. The frequency is very high, so that's why I've choosen Async mechanism.
I want to know, while consuming messages, lets say connection break down with broker or broker itself failed due to any reason and offset could not get commit back to broker. So after restarting, I've to consume same messages again which was consumed earlier but not commited back in broker.
private static void startConsumer() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("consumed: key = %s, value = %s, partition id= %s, offset = %s%n",
record.key(), record.value(), record.partition(), record.offset());
}
if (records.isEmpty()) {
System.out.println("-- terminating consumer --");
break;
}
printOffsets("before commitAsync() call", consumer, topicPartition);
consumer.commitAsync();
printOffsets("after commitAsync() call", consumer, topicPartition);
}
printOffsets("after consumer loop", consumer, topicPartition);
}
may I know please, what can be done to overcome this situation where I dont need to consume same message again after restart ?
You need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally (or in your case printing it to the logs).
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
I have a custom Kafka Consumer in which I use to send some requests to a REST API.
According to the response from the API, I either commit the offset or skip the message without commit.
Minimal example:
while (true) {
ConsumerRecords<String, Object> records = consumer.poll(200);
for (ConsumerRecord<String, Object> record : records) {
// Sending a POST request and retrieving the answer
// ...
if (responseCode.startsWith("2")) {
try {
consumer.commitSync();
} catch(CommitFailedException ex) {
ex.printStackTrace();
}
} else {
// Do Nothing
}
}
}
Now when a response from the REST API does not start with a 2 the offset is not committed, but the message is not re-consumed. How can I force the consumer to re-consume messages with uncommitted offsets?
Make sure your data is idempotent if you are planning to use seek(). Since you are selectively committing offsets, the records left out are possibly going to be before committed (successfully processed) records. If you do seek() - which is moving your groupId's pointer to uncommitted offset and start the replay, you will get those successfully processed messages also. It also has potential of becoming an infinite loop.
Alternatively, you can save unsuccessful record's metadata in memory or db and replay topic from beginning with "poll(retention.ms)" so that all records are replayed back but add a filter to process only those through API whose metadata matches with what you had saved earlier. Do this as a batch processing once every hour or few hours.
Committing offsets is just a way to store the current offset, also know as position, of the Consumer. So in case it stops, it (or the new consumer instance taking over) can find its previous position and restart consuming from there.
So even if you don't commit, the consumer's position is moved once you receive records. If you want to reconsume some records, you have to change the consumer's current position.
With the Java client, you can set the position using seek().
In your scenario, you probably want to calculate the new position relative to the current position. If so you can find the current position using position().
Below are the alternate approaches you can take(instead of seek) :
When REST is failed, move the message to a adhoc kafka topic. You can write another program to read the messages of this topic on a scheduled manner.
When REST is failed, write the Request to a flat flat. Use a shell/any script to read each request and send it on a scheduled basis.
The quote from https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html#callout_kafka_consumers__reading_data_from_kafka_CO2-1
The drawback is that while commitSync() will retry the commit until it
either succeeds or encounters a non-retriable failure, commitAsync()
will not retry.
This phrase is not clear to me. I suppose that consumer sends commit request to broker and in case if the broker doesn't respond within some timeout it means that the commit failed. Am I wrong?
Can you clarify the difference of commitSync and commitAsync in details?
Also, please provide use cases when which commit type should I prefer.
As it is said in the API documentation:
commitSync
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
That means, the commitSync is a blocking method. Calling it will block your thread until it either succeeds or fails.
For example,
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitSync();
}
}
For each iteration in the for-loop, only after consumer.commitSync() successfully returns or interrupted with exception thrown, your code will move to the next iteration.
commitAsync
This is an asynchronous call and will not block. Any errors encountered are either passed to the callback (if provided) or discarded.
That means, the commitAsync is a non-blocking method. Calling it will not block your thread. Instead, it will continue processing the following instructions, no matter whether it will succeed or fail eventually.
For example, similar to previous example, but here we use commitAsync:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
consumer.commitAsync(callback);
}
}
For each iteration in the for-loop, no matter what will happen to consumer.commitAsync() eventually, your code will move to the next iteration. And, the result of the commit is going to be handled by the callback function you defined.
Trade-offs: latency vs. data consistency
If you have to ensure the data consistency, choose commitSync() because it will make sure that, before doing any further actions, you will know whether the offset commit is successful or failed. But because it is sync and blocking, you will spend more time on waiting for the commit to be finished, which leads to high latency.
If you are ok of certain data inconsistency and want to have low latency, choose commitAsync() because it will not wait to be finished. Instead, it will just send out the commit request and handle the response from Kafka (success or failure) later, and meanwhile, your code will continue executing.
This is all generally speaking, the actually behaviour will depend on your actual code and where you are calling the method.
Robust Retry handling with commitAsync()
In the book "Kafka - The Definitive Guide", there is a hint on how to mitigate the potential problem of commiting lower offsets due to an asynchronous commit:
Retrying Async Commits: A simple pattern to get commit order right for asynchronous retries is to use a monotonically increasing sequence number. Increase the sequence number every time you commit and add the sequence number at the time of the commit to the commitAsync callback. When you’re getting ready to send a retry, check if the commit sequence number the callback got is equal to the instance variable; if it is, there was no newer commit and it is safe to retry. If the instance sequence number is higher, don’t retry because a newer commit was already sent.
The following code depicts a possible solution:
import java.util._
import java.time.Duration
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, KafkaConsumer, OffsetAndMetadata, OffsetCommitCallback}
import org.apache.kafka.common.{KafkaException, TopicPartition}
import collection.JavaConverters._
object AsyncCommitWithCallback extends App {
// define topic
val topic = "myOutputTopic"
// set properties
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, "AsyncCommitter")
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// [set more properties...]
// create KafkaConsumer and subscribe
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(List(topic).asJavaCollection)
// initialize global counter
val atomicLong = new AtomicLong(0)
// consume message
try {
while(true) {
val records = consumer.poll(Duration.ofMillis(1)).asScala
if(records.nonEmpty) {
for (data <- records) {
// do something with the records
}
consumer.commitAsync(new KeepOrderAsyncCommit)
}
}
} catch {
case ex: KafkaException => ex.printStackTrace()
} finally {
consumer.commitSync()
consumer.close()
}
class KeepOrderAsyncCommit extends OffsetCommitCallback {
// keeping position of this callback instance
val position = atomicLong.incrementAndGet()
override def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata], exception: Exception): Unit = {
// retrying only if no other commit incremented the global counter
if(exception != null){
if(position == atomicLong.get) {
consumer.commitAsync(this)
}
}
}
}
}
Both commitSync and commitAsync uses kafka offset management feature and both has demerits.
If the message processing succeeds and commit offset failed(not atomic) and at same time partition re balancing happens, your processed message gets processed again(duplicate processing) by some other consumer. If you are okay with duplicate message processing, then you can go for commitAsync(because it doesn't block and provide low latency, and it provides a higher order commit. so you should be okay). Otherwise go for a custom offset management that takes care of atomicity while processing and updating the offset(use an external offset storage)
commitAync will not retry because if it retries it will make a mess.
Imagine that you are trying to commit offset 20 (async), and it did not commit (failed), and then the next poll block tries to commit the offset 40 (async), and it succeeded.
Now, commit offset 20 is still waiting to commit, if it reties and succeed it will make a mess.
The mess is that the committed offset should be 40 not 20.