How to detect a Kafka Streams app in zombie state - java

One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:
[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.
I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.
Therefore I have two questions:
Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
It depends on the version. In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died. This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657.
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?
Even in older versions, you can register an uncaught exception handler on the KafkaStreams client. This handler is invoked each time a StreamThreads dies.
Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753

FYI there's a similar discussion going on right now on the user mailing list -- subject line "kafka stream zombie state"
I'll start by telling you guys what I said there, since there seem to be some misconceptions in the conversation so far: basically, the error message is kind of misleading because it implies this is being logged by consumer itself and that it is currently sending this LeaveGroup/has noticed already that it missed the poll interval. But this message is actually getting logged by the heartbeat thread when it notices that the main consumer thread hasn't polled within the max poll timeout, and is technically just marking it as "needs to rejoin" so that the consumer knows to send this LeaveGroup when it finally does poll again. However, if the consumer thread is actually stuck somewhere in the user/application code and cannot break out to continue the poll loop, then the consumer will never actually trigger a rebalance, try to rejoin, send a LeaveGroup request, etc. So that's why the state continues to be RUNNING rather than REBALANCING.
For the above reason, metrics like num-alive-stream-threads won't help either, since the thread isn't dying -- it's just stuck. In fact even if the thread became unstuck, it would just rejoin and then continue as usual, it wouldn't "die" (since that only occurs when encountering a fatal exception).
Long story short: the broker and heartbeat thread have noticed that the consumer is no longer in the group, but the StreamThread is likely stuck somewhere in the topology and thus the consumer itself actually has no idea that it's been kicked out of the consumer group

Related

How to know if max.poll.interval.ms is reached in java kafka application client?

Is there an exception thrown somewhere when kafka max.poll.interval.ms is reached and rebalance happens?
Once your consumer gets kicked out of the consumer group due to long poll(), you will receive a CommitFailedException. According to the documentation:
It is also possible that the consumer could encounter a "livelock"
situation where it is continuing to send heartbeats, but no progress
is being made. To prevent the consumer from holding onto its
partitions indefinitely in this case, we provide a liveness detection
mechanism using the max.poll.interval.ms setting. Basically if you
don't call poll at least as frequently as the configured max interval,
then the client will proactively leave the group so that another
consumer can take over its partitions. When this happens, you may see
an offset commit failure (as indicated by a CommitFailedException
thrown from a call to commitSync()). This is a safety mechanism which
guarantees that only active members of the group are able to commit
offsets. So to stay in the group, you must continue to call poll.
Therefore, you could possibly catch CommitFailedException. Actually, you can keep calling poll() until re-balancing is complete and your consumer re-enters the consumer group.

Spring for Apache Kafka: KafkaTemplate Behavior with Async Requests, Batching, and Max In-flight of 1

Scenario/Use Case:
I have a Spring Boot application using Spring for Kafka to send messages to Kafka topics. Upon completion of a specific event (triggered by http request) a new thread is created (via Spring #Async) which calls kafkatemplate.send() and has a callback for the ListenableFuture that it returns. The original thread which handled the http request returns a response to the calling client and is freed.
Normal Behavior:
Under normal application load I've verified that the individual messages are all published to the topic as desired (application log entries upon callback success or failure as well as viewing the message in the topic on the kafka cluster). If I bring down all kafka brokers for 3-5 minutes and then bring the cluster back online the application's publisher immediately re-establishes connection to kafka and proceeds with publishing messages.
Problem Behavior:
However, when performing load testing, if I bring down all kafka brokers for 3-5 minutes and then bring the cluster back online, the Spring application's publisher continues to show failures for all publish attempts. This continues for approximately 7 hours at which time the publisher is able to successfully re-establish communication with kafka again (usually this is preceeded by a broken pipe exception but not always).
Current Findings:
While performing the load test, for approx. 10 minutes, I connected to the the application using JConsole and monitored the producer metrics exposed via kafka.producer. Within the first approx. 30 seconds of heavy load, buffer-available-bytes continues to decrease until it reaches 0 and stays at 0. waiting-threads remains between 6 and 10 (alternates everytime I hit refresh) and buffer-available-bytes remains at 0 for approx. 6.5 hours. After that buffer-available-bytes shows all of the originally allocated memory restored but kafka publish attempts continue failing for approx. another 30 minutes before finally the desired behavior restores.
Current Producer Configuration
request.timeout.ms=3000
max.retry.count=2
max.inflight.requests=1
max.block.ms=10000
retry.backoff.ms=3000
All other properties are using their default values
Questions:
Given my use case would altering batch.size or linger.ms have any positive impact in terms of eliminating the issue encountered when under heavy load?
Given that I have separate threads all calling kafkatemplate.send() with separate messages and callbacks and I havemax.in.flight.requests.per.connection set to 1, are batch.size and linger.ms ignored beyond limiting the size of each message? My understanding is that no batching is actually occurring in this scenario and that each message is sent as a separate request.
Given that I have max.block.ms set to 10 seconds, why does buffer memory remain utilized for so long and why do all messages continue to fail to be published for so many hours. My understanding is that after 10 seconds each new publish attempt should fail and return the failure callback which in turn frees up the associated thread
Update:
To try and clarify thread usage. I'm using the single producer instance as recommended in the JavaDocs. There are threads such as https-jsse-nio-22443-exec-* which are handling incoming https requests. When a request comes in some processing occurs and once all non-kafka related logic completes a call is made to a method in another class decorated with #Async. This method makes the call to kafkatemplate.send(). The response back to the client is shown in the logs before the publish to kafka is performed (this is how Im verifying its being performed via separate thread as the service doesn't wait to publish before returning a response).
There are task-scheduler-* threads which appear to be handling the callbacks from kafkatemplate.send(). My guess is that the single kafka-producer-network-thread handles all of the publishing.
My application was making an http request and sending each message to a deadletter table on a database platform upon failure of each kafka publish. The same threads being spun up to perform the publish to kafka were being re-used for this call to the database. I moved the database call logic into another class and decorated it with its own #Async and custom TaskExecutor. After doing this, I've monitored JConsole and can see that the calls to Kafka appear to be re-using the same 10 threads (TaskExecutor: core Pool size - 10, QueueCapacity - 0, and MaxPoolSize - 80) and the calls to the database service are now using a separate thread pool (TaskExecutor: core Pool size - 10, QueueCapacity - 0, and MaxPoolSize - 80) which is consistently closing and opening new threads but staying at a relatively constant number of threads. With this new behavior buffer-available-bytes is remaining at a healthy constant level and the application's kafka publisher is successfully re-establishing connection once brokers are brought back online.

Kafka behaviour if consumer fail

I've looked up throught a lot of different articles about Apache Kafka transactions, recovery and exactly-once new features. Still don't understand an issue with consumer recovery. How to be sure that every message from queue will be processed even if one of consumers dies?
Let's say we have a topic partition assigned to consumer. Consumer polls a message and started to work on it. And shutted down due to power failure without commit. What will happens? Will any other consumer from the same group repoll this message?
Consumers periodically send heartbeats, telling the broker that they are alive. If broker does not receive heartbeats from the consumer, it considers the consumer dead and reassigns its partitions. So, if consumer dies, its partitions will be assigned to another consumer from the group and uncommitted messages will be sent to the newly assigned consumer.

Kafka consumers rebalance unexpectedly

We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?
Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.
Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.
Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.
We expect partitions would not rebalance without also seeing some cause or failure.
Sometimes poll() gets stuck (exceeds the timeout) and we use wakeup() and close(), then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.
We use a ConsumerRebalanceListener to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.
The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.
Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".
People have asked good related question, but answers didn't help us:
Conditions in which Kafka Consumer (Group) triggers a rebalance
What exactly IS Kafka Rebalancing?
Continuous consumer group rebalancing with more consumers than partitions
Configuration:
Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
Java 8 brokers and clients
2 brokers, 1 zookeeper, stable running processes & no additions
5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
Topic A has 16 partitions and replication 2, and is created before consumers start.
One process writes to topic A; two processes read from topic A.
Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
All the consumers for topic A are in the same consumer group.
The timeout for KafkaConsumer.poll() is 1000 milliseconds.
The configuration that affects rebalance is:
max.poll.interval.ms=50000
max.poll.records=100
request.timeout.ms=40000
session.timeout.ms=20000
We use defaults for these:
heartbeat.interval.ms=3000
(broker) group.max.session.timeout.ms=300000
(broker) group.min.session.timeout.ms=6000
Check the gc log,and make sure there is not full gc frequently which will prevent heartbeat thread working.

How to handle offset commit failures with enable.auto.commit disabled in Spark Streaming with Kafka?

I use Spark 2.0.0 with Kafka 0.10.2.
I have an application that is processing messages from Kafka and is a long running job.
From time to time I see the following message in the logs. Which I understand how I can increase the timeout and everything but what I wanted to know was given that I do have this error how can I recover from it ?
ERROR ConsumerCoordinator: Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
This is not on how I escape this error but how to handle it once it occurs
Background: In normal situations I will not see commit errors, but if I do get one I should be able to recover from it. I am using AT_LEAST_ONCE setup, So I am completely happy with reprocessing a few messages.
I am running Java and using DirectKakfaStreams with Manual commits.
Creating the stream:
JavaInputDStream<ConsumerRecord<String, String>> directKafkaStream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
Commiting the offsets
((CanCommitOffsets) directKafkaStream.inputDStream()).commitAsync(offsetRanges);
My understanding of the situation is that you use the Kafka Direct Stream integration (using spark-streaming-kafka-0-10_2.11 module as described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)).
As said in the error message:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
Kafka manages what topic partition a consumer consumes so the Direct Stream will create a pool of consumers (inside a single consumer group).
As with any consumer group you should expect rebalancing which (quoting Chapter 4. "Kafka Consumers - Reading Data from Kafka" from Kafka: The Definitive Guide):
consumers in a consumer group share ownership of the partitions in the topics they subscribe to. When we add a new consumer to the group it starts consuming messages from partitions which were previously consumed by another consumer. The same thing happens when a consumer shuts down or crashes, it leaves the group, and the partitions it used to consume will be consumed by one of the remaining consumers. Reassignment of partitions to consumers also happen when the topics the consumer group is consuming are modified, for example if an administrator adds new partitions.
There are quite a few cases when rebalancing can occur and should be expected. And you do.
You asked:
how can I recover from it? This is not on how I escape this error but how to handle it once it occurs?
My answer would be to use the other method of CanCommitOffsets:
def commitAsync(offsetRanges: Array[OffsetRange], callback: OffsetCommitCallback): Unit
that gives you access to Kafka's OffsetCommitCallback:
OffsetCommitCallback is a callback interface that the user can implement to trigger custom actions when a commit request completes. The callback may be executed in any thread calling poll().
I think onComplete gives you a handle on how the async commit has finished and act accordingly.
Something I can't help you with much is how to revert the changes in a Spark Streaming application when some offsets could not have been committed. That I think requires tracking offsets and accept a case where some offsets can't be committed and be re-processed.

Categories

Resources