Kafka consumers rebalance unexpectedly

Kafka consumers rebalance unexpectedly - java

We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?
Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.
Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.
Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.
We expect partitions would not rebalance without also seeing some cause or failure.
Sometimes poll() gets stuck (exceeds the timeout) and we use wakeup() and close(), then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.
We use a ConsumerRebalanceListener to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.
The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.
Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".
People have asked good related question, but answers didn't help us:
Conditions in which Kafka Consumer (Group) triggers a rebalance
What exactly IS Kafka Rebalancing?
Continuous consumer group rebalancing with more consumers than partitions
Configuration:
Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
Java 8 brokers and clients
2 brokers, 1 zookeeper, stable running processes & no additions
5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
Topic A has 16 partitions and replication 2, and is created before consumers start.
One process writes to topic A; two processes read from topic A.
Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
All the consumers for topic A are in the same consumer group.
The timeout for KafkaConsumer.poll() is 1000 milliseconds.
The configuration that affects rebalance is:
max.poll.interval.ms=50000
max.poll.records=100
request.timeout.ms=40000
session.timeout.ms=20000
We use defaults for these:
heartbeat.interval.ms=3000
(broker) group.max.session.timeout.ms=300000
(broker) group.min.session.timeout.ms=6000

Check the gc log,and make sure there is not full gc frequently which will prevent heartbeat thread working.

Related

How to detect a Kafka Streams app in zombie state

One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:
[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.
I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.
Therefore I have two questions:
Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
It depends on the version. In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died. This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657.
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?
Even in older versions, you can register an uncaught exception handler on the KafkaStreams client. This handler is invoked each time a StreamThreads dies.
Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753

FYI there's a similar discussion going on right now on the user mailing list -- subject line "kafka stream zombie state"
I'll start by telling you guys what I said there, since there seem to be some misconceptions in the conversation so far: basically, the error message is kind of misleading because it implies this is being logged by consumer itself and that it is currently sending this LeaveGroup/has noticed already that it missed the poll interval. But this message is actually getting logged by the heartbeat thread when it notices that the main consumer thread hasn't polled within the max poll timeout, and is technically just marking it as "needs to rejoin" so that the consumer knows to send this LeaveGroup when it finally does poll again. However, if the consumer thread is actually stuck somewhere in the user/application code and cannot break out to continue the poll loop, then the consumer will never actually trigger a rebalance, try to rejoin, send a LeaveGroup request, etc. So that's why the state continues to be RUNNING rather than REBALANCING.
For the above reason, metrics like num-alive-stream-threads won't help either, since the thread isn't dying -- it's just stuck. In fact even if the thread became unstuck, it would just rejoin and then continue as usual, it wouldn't "die" (since that only occurs when encountering a fatal exception).
Long story short: the broker and heartbeat thread have noticed that the consumer is no longer in the group, but the StreamThread is likely stuck somewhere in the topology and thus the consumer itself actually has no idea that it's been kicked out of the consumer group

RabbitMQ (Java) multiple consumers performance issue

I'm implementing a daily job which get data from a MongoDB (around 300K documents) and for each of them publish a message on a RabbitMQ queue.
On the other side I have some consumers on the same queue, which ideally should work in parallel.
Everything is working but not as much as I would, specially regarding consumers performances.
This is how I declare the queue:
rabbitMQ.getChannel().queueDeclare(QUEUE_NAME, true, false, false, null);
This is how the publishing is done:
rabbitMQ.getChannel().basicPublish("", QUEUE_NAME, null, body.getBytes());
So the channel used to declare the queue is used to publish all the messages.
And this is how the consumers are instantiated in a for loop (10 in total, but it can be any number):
Channel channel = rabbitMQ.getConnection().createChannel();
MyConsumer consumer = new MyConsumer(customMapper, channel, subscriptionUpdater);
channel.basicQos(1); // also tried with 0, 10, 100, ...
channel.basicConsume(QUEUE_NAME, false, consumer);
So for each consumer I create a new channel and this is confirmed by logs:
...
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#bdd2027
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#5d1b9c3d
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#49a26d19
...
As far as I've understood from my very short RabbitMQ experience, this should guarantee that all the consumer are called.
By the way, consumers need between 0.5 to 1.2 seconds to complete their task. I have just spotted very few 3 seconds.
I have two separate queues and I repeat what I said above two times (using the same RabbitMQ connection).
So, I have tested publishing 100 messages for each queue. Both of them have 10 consumers with qos=1.
I didn't expect to have exactly a delivery/consume performance of 10/s, instead I noticed:
actual values are around 0.4 and 1.0.
at least all the consumers bound to the queue have received a message, but it doesn't look like "fair dispatching".
it took about 3 mins 30 secs to consume all the messages on both queues.
Am I missing the main concept of threading within RabbitMQ? Or any specific configuration which might be still at default value?
I'm on it since very few days so this might be possible.
Please notice that I'm in the fortunate position where I can control both publishing and consuming parts :)
I'm using RabbitMQ 3.7.3 locally, so it cannot be any network latency issue.
Thanks for your help!

The setup of RabbitMQ channels and consumers were correct in the end: so one channel for each consumer.
The problem was having the consumers calling a synchronized method to find and update a MongoDB document.
This was delaying the execution time of some consumers: even worst, the more consumers I was adding (thinking to speed up processing), the less message rate/s I was getting.
I have moved the MongoDB part on he publishing side where I don't have to care about synchronization because it's done in sequence by one publisher only. I have a slightly decreased delivery rate/s but now with just 5 consumers I easily reach an ack rate of 50-60/s.
Lessons learnt:
create a separate channel for the publisher.
create a separate channel for each consumer.
let RabbitMQ manage threading for the consumers (--> you can instantiate them on the main thread).
(if possible) back off publishing to give the queues 100% time to deal with consumers.
set a qos > 1 for each consumer channel. But this really depends on your scenario and architecture: you must do some performance test.
As a general rule:
(1) calculate/estimate delivery time.
(2) calculate/estimate ack time.
(3) calculate/estimate consumer time.
qos = (1) + (2) + (3) / (3)
This will give you an initial qos value to test and tweak based on your scenario. The final goal is to have 100% utilization for all the available consumers.

How to handle offset commit failures with enable.auto.commit disabled in Spark Streaming with Kafka?

I use Spark 2.0.0 with Kafka 0.10.2.
I have an application that is processing messages from Kafka and is a long running job.
From time to time I see the following message in the logs. Which I understand how I can increase the timeout and everything but what I wanted to know was given that I do have this error how can I recover from it ?
ERROR ConsumerCoordinator: Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
This is not on how I escape this error but how to handle it once it occurs
Background: In normal situations I will not see commit errors, but if I do get one I should be able to recover from it. I am using AT_LEAST_ONCE setup, So I am completely happy with reprocessing a few messages.
I am running Java and using DirectKakfaStreams with Manual commits.
Creating the stream:
JavaInputDStream<ConsumerRecord<String, String>> directKafkaStream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
Commiting the offsets
((CanCommitOffsets) directKafkaStream.inputDStream()).commitAsync(offsetRanges);

My understanding of the situation is that you use the Kafka Direct Stream integration (using spark-streaming-kafka-0-10_2.11 module as described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)).
As said in the error message:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
Kafka manages what topic partition a consumer consumes so the Direct Stream will create a pool of consumers (inside a single consumer group).
As with any consumer group you should expect rebalancing which (quoting Chapter 4. "Kafka Consumers - Reading Data from Kafka" from Kafka: The Definitive Guide):
consumers in a consumer group share ownership of the partitions in the topics they subscribe to. When we add a new consumer to the group it starts consuming messages from partitions which were previously consumed by another consumer. The same thing happens when a consumer shuts down or crashes, it leaves the group, and the partitions it used to consume will be consumed by one of the remaining consumers. Reassignment of partitions to consumers also happen when the topics the consumer group is consuming are modified, for example if an administrator adds new partitions.
There are quite a few cases when rebalancing can occur and should be expected. And you do.
You asked:
how can I recover from it? This is not on how I escape this error but how to handle it once it occurs?
My answer would be to use the other method of CanCommitOffsets:
def commitAsync(offsetRanges: Array[OffsetRange], callback: OffsetCommitCallback): Unit
that gives you access to Kafka's OffsetCommitCallback:
OffsetCommitCallback is a callback interface that the user can implement to trigger custom actions when a commit request completes. The callback may be executed in any thread calling poll().
I think onComplete gives you a handle on how the async commit has finished and act accordingly.
Something I can't help you with much is how to revert the changes in a Spark Streaming application when some offsets could not have been committed. That I think requires tracking offsets and accept a case where some offsets can't be committed and be re-processed.

Kafka consumer stuck (in fact pauses and resumes) at iterator.hasNext() even though there are plenty of messages to consume in the topic

I am working on a distributed solution with 2 consumers running on 2 different servers under the same consumer group and consuming from a 3-machine Kafka topic with 2 partitions and replication factor 3. Inside my consumer class (which is a Callable), the key part looks like below:
#Override
public Object call() throws Exception {
ConsumerIterator<byte[], byte[]> it = stream.iterator();
try {
while (it.hasNext()){
byte[] message = it.next().message();
// other code here
}
} catch (Throwable e) {
e.printStackTrace();
}
log.error("Shutting down Thread: " + streamNumber + ", kafka consumer offline!!!");
}
My consumer class also spawns 16 other threads to do stuffs with the consumed messages. When I start both my consumers on 2 different servers, first few minutes each of them seem to seamlessly consume messages from the Kafka topic (one partition each). However, after a certain time, each consumer seem to be stuck at the while (it.hasNext()) statement, even though there are thousands of messages left to be consumed in each partition. Below is the screenshot that shows the status of the Kafka consumer offsets at that point.
As you can see, the consumers are far behind the number of messages available in the topic. From my logs, it looks like while this consuming thread is paused, other threads are running fine and doing their jobs. From a longer run, interestingly I have also noticed that the consuming thread is kind of pausing and resuming after some time. However, each time it pauses, the number of messages consumed next time also decraeses ridiculously. For example, after I first started both consumers, each seemed to seamlessly consume some 15,000 messages until getting stuck at the stream iterator, then paused for like 20 - 25 minutes and consumed like 5,000 more, then again paused for like 30 minutes and consumed like a 100 more and this goes on. If I stop the consumer processes and restart, the whole cycle seem to repeat.
These are the consumer configs I am using:
group.id=ct_job_backfill
zookeeper.session.timeout.ms=1000
zookeeper.sync.time.ms=200
auto.commit.enable=true
auto.offset.reset=smallest
rebalance.max.retries=20
rebalance.backoff.ms=2000
topic.name=contentTaskProd
The consumer servers are each 32-thread 64 GB machines running on Linux.
Any idea what might be causing this? Thanks in advance. Let me know if you need additional information or if anything is unclear.
UPDATE: I have tried increasing the number of partitions from 2 to 32, and inside each of my consumer server spawning 16 consumers threads each consuming from a partition. However, that doesn't seem to change the behaviour. I notice the same pause and resume cycle.

I have came across exactly the same issue. While browsing for the resolution I came across issue already reported with kafka at https://issues.apache.org/jira/browse/KAFKA-2978.
Looks like they have resolved in 0.9.0.1 version. I am gonna try and update the library with this version. Will update if I am able to resolve this issue with new jar. In the meantime you can try the same.
~Cheers

How activemq wildcard consumers work?

I am using ActiveMQ 5.8 with wildcard consumers configured in camel route.
I am using default ActiveMQ configuration, so I have defaults as below
prefetch = 1
dispatch policy= Round Robin
Now I start a consumer jvm with 5 consumers each for 2 queues. both the queue has same type of message and same number of messages.
Consumers are doing nothing but printing the message (so no db blocking or slow consumer issue)
EDIT
I have set preFetch to 1 for each of the queue
What I observe is one of the queue getting drained faster than other.
What I expect is both the queue getting drained at equal pace, kind of load balance.
One surprising observation is
Though activemq webconsole shows 5 consumers for each of those queues
When I debug my consumer, I see only 5 threads / consumers from camel flow for a wildcard queue *.processQueue
What will be cause of above behavior?
How do I make sure that all the queue drain at equal pace?
Did anyone has experience to share on writting custom dispatch policy or overriding defaults of activemq?

I was able to find a reference to this behavior
Message distribution in case of wildcard queue consumers is random.
http://activemq.2283324.n4.nabble.com/Wildcard-and-message-distribution-td2346132.html#a2346133
Though this can be tuned by setting appropriate prefetch size.
After trial & error, I arrived at following formula, to have fair distribution across the consumers and all the queue getting de-queued at almost same pace.
prefetch = number of wildcard consumers

It's probably wrong to compare the rate the queues are consumed. The load balancing typically happens between consumers. So, the idea is that each of the five consumers on the first queue would get rather even load (given they are connected to the same broker).
However, I think you might want to double check your load test setup. It rarely gives predictable results when running broker and consumers on the same machine for instance.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.