I've looked up throught a lot of different articles about Apache Kafka transactions, recovery and exactly-once new features. Still don't understand an issue with consumer recovery. How to be sure that every message from queue will be processed even if one of consumers dies?
Let's say we have a topic partition assigned to consumer. Consumer polls a message and started to work on it. And shutted down due to power failure without commit. What will happens? Will any other consumer from the same group repoll this message?
Consumers periodically send heartbeats, telling the broker that they are alive. If broker does not receive heartbeats from the consumer, it considers the consumer dead and reassigns its partitions. So, if consumer dies, its partitions will be assigned to another consumer from the group and uncommitted messages will be sent to the newly assigned consumer.
Related
One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:
[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.
I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.
Therefore I have two questions:
Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?
Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
It depends on the version. In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died. This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657.
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?
Even in older versions, you can register an uncaught exception handler on the KafkaStreams client. This handler is invoked each time a StreamThreads dies.
Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753
FYI there's a similar discussion going on right now on the user mailing list -- subject line "kafka stream zombie state"
I'll start by telling you guys what I said there, since there seem to be some misconceptions in the conversation so far: basically, the error message is kind of misleading because it implies this is being logged by consumer itself and that it is currently sending this LeaveGroup/has noticed already that it missed the poll interval. But this message is actually getting logged by the heartbeat thread when it notices that the main consumer thread hasn't polled within the max poll timeout, and is technically just marking it as "needs to rejoin" so that the consumer knows to send this LeaveGroup when it finally does poll again. However, if the consumer thread is actually stuck somewhere in the user/application code and cannot break out to continue the poll loop, then the consumer will never actually trigger a rebalance, try to rejoin, send a LeaveGroup request, etc. So that's why the state continues to be RUNNING rather than REBALANCING.
For the above reason, metrics like num-alive-stream-threads won't help either, since the thread isn't dying -- it's just stuck. In fact even if the thread became unstuck, it would just rejoin and then continue as usual, it wouldn't "die" (since that only occurs when encountering a fatal exception).
Long story short: the broker and heartbeat thread have noticed that the consumer is no longer in the group, but the StreamThread is likely stuck somewhere in the topology and thus the consumer itself actually has no idea that it's been kicked out of the consumer group
I would like to better understand Kafka message retry process.
I have heard that failed processing of consumed messages can be addressed using 2 options:
SeekToCurrentErrorHandler (offset reset)
publishing a message to a Dead Letter Queues (DLQs)
The 2nd option is a pretty clear, that if a message failed to be processed it is simply pushed to an error queue. I am more curious about the first option.
AFAIK, the 1st option is the most widely used one, but how does it work when multiple consumers concurrently consume messages from the same topic? Does it work that if a particular message has failed the offset for the consumer-id is being reset to the message's offset? What will happen with the messages successfully processed simultaneously/after the failed one, will they be re-processed?
How can you advice me to deal with message re-tries?
Each partition can only be consumed by one consumer.
When you have multiple consumers, you must have at least that number of partitions.
The offset is maintained for each partition; the error handler will (can) only perform seeks on the partitions that are assigned to this consumer.
I have a batch job which will be triggered once a day. The requirement is to
consume all the messages available on the Kafka Topic at that point of time
Process the messages
If the process was successfully completed, commit the offsets.
Currently I poll() the messages in while loop until ConsumerRecords.isEmpty() is true. When ConsumerRecords.isEmpty() is true I assume all the records available on Topic at the point of time has been consumed. The application maintains the offsets and closes the kafka consumer.
When the processing on messages is done and successfully completed, I create a new KafkaConsumer and commit the offsets maintained by the application.
Note I close the KafkaConsumer initially used to read the messages and use another KafkaConsumer instance to commit the offsets to avoid the consumer rebalance exception.
I am expecting max of 5k messages on Topic. The topic is partitioned and replicated.
Is there any better way to consume all messages on Topic at a specific point of time ? Is there anything I am missing or need to take care of ? I don't think I need to take care of consumer rebalancing since I poll() for the messages in loop and process the messages after the polling is done.
I am using java kafka client v0.9 and can change to v0.10 if it helps in above scenario.
Thanks
Updated:
AtomicBoolean flag = new AtomicBoolean();
flag.set(true);
while(flag.get()) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeout);
if(consumerRecords.isEmpty()) {
flag.set(false);
continue;
}
//if the ConsumerRecords is not empty process the messages and continue to poll()
}
kafkaConsumer.close();
You can't assume that after a call to poll() you have read all the messages available in the topic in that moment due to the max.poll.records configuration parameter on the consumer. This is the maximum number of records returned by a single poll() and its default value is 500. It means that if in that moment there are i.e. 600 messages in the topic, you need two calls on poll() for reading all the messages (but consider that meanwhile some other messages could arrive).
The other thing I don't understand is why you are using a different consumer for committing offsets. What's the consumer rebalance exception you are talking about ?
I use Spark 2.0.0 with Kafka 0.10.2.
I have an application that is processing messages from Kafka and is a long running job.
From time to time I see the following message in the logs. Which I understand how I can increase the timeout and everything but what I wanted to know was given that I do have this error how can I recover from it ?
ERROR ConsumerCoordinator: Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
This is not on how I escape this error but how to handle it once it occurs
Background: In normal situations I will not see commit errors, but if I do get one I should be able to recover from it. I am using AT_LEAST_ONCE setup, So I am completely happy with reprocessing a few messages.
I am running Java and using DirectKakfaStreams with Manual commits.
Creating the stream:
JavaInputDStream<ConsumerRecord<String, String>> directKafkaStream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
Commiting the offsets
((CanCommitOffsets) directKafkaStream.inputDStream()).commitAsync(offsetRanges);
My understanding of the situation is that you use the Kafka Direct Stream integration (using spark-streaming-kafka-0-10_2.11 module as described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)).
As said in the error message:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
Kafka manages what topic partition a consumer consumes so the Direct Stream will create a pool of consumers (inside a single consumer group).
As with any consumer group you should expect rebalancing which (quoting Chapter 4. "Kafka Consumers - Reading Data from Kafka" from Kafka: The Definitive Guide):
consumers in a consumer group share ownership of the partitions in the topics they subscribe to. When we add a new consumer to the group it starts consuming messages from partitions which were previously consumed by another consumer. The same thing happens when a consumer shuts down or crashes, it leaves the group, and the partitions it used to consume will be consumed by one of the remaining consumers. Reassignment of partitions to consumers also happen when the topics the consumer group is consuming are modified, for example if an administrator adds new partitions.
There are quite a few cases when rebalancing can occur and should be expected. And you do.
You asked:
how can I recover from it? This is not on how I escape this error but how to handle it once it occurs?
My answer would be to use the other method of CanCommitOffsets:
def commitAsync(offsetRanges: Array[OffsetRange], callback: OffsetCommitCallback): Unit
that gives you access to Kafka's OffsetCommitCallback:
OffsetCommitCallback is a callback interface that the user can implement to trigger custom actions when a commit request completes. The callback may be executed in any thread calling poll().
I think onComplete gives you a handle on how the async commit has finished and act accordingly.
Something I can't help you with much is how to revert the changes in a Spark Streaming application when some offsets could not have been committed. That I think requires tracking offsets and accept a case where some offsets can't be committed and be re-processed.
I have setup a JMS queue that is fed by a single producer, and consumed by 8 different consumers.
I would like to configure my queue/broker so that one message being delivered to a consumer blocks the queue until the consumer is done processing the message. During the processing of this first message, the following messages may not be delivered to another consumer. It doesn't matter which consumer processes which message, and it is acceptable for the same consumer to consume many messages in a row as long as when it dies another consumer is able to pick up the rest of the unprocessed messages.
In order to do this, I have configured all of my consumers to use the CLIENT acknowledgement mode, and I have coded them so that message.acknowledge() is called only at the end of the message processing.
My understanding was that this should be sufficient to satisfy my requirements.
However I am apparently wrong, because it looks like my brojer (OpenMQ) is delivering the messages to consumers as fast as possible, without waiting for the consumer acknowledgement. As a result, I get multiple messages processed in parallel, one for each consumer.
I'm obviously doing something wrong, but I can't figure out what.
As a workaround, I figure I could create a durable subscription with a fixed client ID shared between all my consumers. It would probably work by only allowing one consumer to even connect to the broker, but I can't shake the feeling that this is a rather ugly workaround.
Does anyone have an idea of how I should configure my Broker and/or my Client to make this possible?