I am new to kafka. I have two kafka brokers and I am trying to push data through these two brokers. One is primary and other one is backup.
I am doing a small analysis in which I am pushing data to kafka queue through a thread pool executor. While doing so, I kept max pool size as 1 and size of array blocking queue as 2. I triggered 10 requests through Jmeter and 7 of these requests went to rejection handler (as expected) and 3 went to be processed for kafka queue.
Took a thread dump to analyse state of kafka threads, 4 producer threads were spun up for kafka producer.
I could not understand this as I am using two brokers and here 3 messages are getting processed through kafka (2 in blocking queue, 1 in thread), then how 4 producer threads got spun up?
PS : I cannot share code piece here due to security concerns.
Related
How to limit the event consumption speed of kafka consumer so that service is not impacted .Getting huge data in kafka Topic and i need to process all the events .
We have 8 consumer reading from 12 partitions
Getting huge data in kafka topic but retention is 5 days. So how can i limit from consumer side so that consumer won't go down. Is there any way in kafka to put a message reading speed from consumer side eg . like 100 events in 10 minutes .
You can limit consumer request using kafka quotas.
If your consumer is "going down" there may be other issues in the code. For example, not blocking requests for more messages until previous messages are processed completely. But quotas should help limit the consumers.
Requirement: In a poller / worker scenario, pollers should stop polling remote service until a worker is available to take the task.
Background:
Due to throttling constraints on the number of requests to the remote service, I am trying to segregate polling from the consumers. The constraint here is, that once a task if picked up by a poller, it will timeout if not processed within a certain time. And our consumers can be arbitrarily long running (10s to 10minutes).
Currently looking at the direct solutions, SynchronousQueue seems to be the most simplified approach.
The problem is that if I have 2 pollers and 4 consumers, then while the 4 consumers are processing, the pollers will first pick/poll two tasks from remote, and then wait for consumers to be available. this will cause the 2 tasks to timeout.
Is there a decent workaround this? Or should i be going for lock based mechanism (like: Semaphore) ?
I'm implementing a daily job which get data from a MongoDB (around 300K documents) and for each of them publish a message on a RabbitMQ queue.
On the other side I have some consumers on the same queue, which ideally should work in parallel.
Everything is working but not as much as I would, specially regarding consumers performances.
This is how I declare the queue:
rabbitMQ.getChannel().queueDeclare(QUEUE_NAME, true, false, false, null);
This is how the publishing is done:
rabbitMQ.getChannel().basicPublish("", QUEUE_NAME, null, body.getBytes());
So the channel used to declare the queue is used to publish all the messages.
And this is how the consumers are instantiated in a for loop (10 in total, but it can be any number):
Channel channel = rabbitMQ.getConnection().createChannel();
MyConsumer consumer = new MyConsumer(customMapper, channel, subscriptionUpdater);
channel.basicQos(1); // also tried with 0, 10, 100, ...
channel.basicConsume(QUEUE_NAME, false, consumer);
So for each consumer I create a new channel and this is confirmed by logs:
...
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#bdd2027
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#5d1b9c3d
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#49a26d19
...
As far as I've understood from my very short RabbitMQ experience, this should guarantee that all the consumer are called.
By the way, consumers need between 0.5 to 1.2 seconds to complete their task. I have just spotted very few 3 seconds.
I have two separate queues and I repeat what I said above two times (using the same RabbitMQ connection).
So, I have tested publishing 100 messages for each queue. Both of them have 10 consumers with qos=1.
I didn't expect to have exactly a delivery/consume performance of 10/s, instead I noticed:
actual values are around 0.4 and 1.0.
at least all the consumers bound to the queue have received a message, but it doesn't look like "fair dispatching".
it took about 3 mins 30 secs to consume all the messages on both queues.
Am I missing the main concept of threading within RabbitMQ? Or any specific configuration which might be still at default value?
I'm on it since very few days so this might be possible.
Please notice that I'm in the fortunate position where I can control both publishing and consuming parts :)
I'm using RabbitMQ 3.7.3 locally, so it cannot be any network latency issue.
Thanks for your help!
The setup of RabbitMQ channels and consumers were correct in the end: so one channel for each consumer.
The problem was having the consumers calling a synchronized method to find and update a MongoDB document.
This was delaying the execution time of some consumers: even worst, the more consumers I was adding (thinking to speed up processing), the less message rate/s I was getting.
I have moved the MongoDB part on he publishing side where I don't have to care about synchronization because it's done in sequence by one publisher only. I have a slightly decreased delivery rate/s but now with just 5 consumers I easily reach an ack rate of 50-60/s.
Lessons learnt:
create a separate channel for the publisher.
create a separate channel for each consumer.
let RabbitMQ manage threading for the consumers (--> you can instantiate them on the main thread).
(if possible) back off publishing to give the queues 100% time to deal with consumers.
set a qos > 1 for each consumer channel. But this really depends on your scenario and architecture: you must do some performance test.
As a general rule:
(1) calculate/estimate delivery time.
(2) calculate/estimate ack time.
(3) calculate/estimate consumer time.
qos = (1) + (2) + (3) / (3)
This will give you an initial qos value to test and tweak based on your scenario. The final goal is to have 100% utilization for all the available consumers.
We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?
Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.
Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.
Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.
We expect partitions would not rebalance without also seeing some cause or failure.
Sometimes poll() gets stuck (exceeds the timeout) and we use wakeup() and close(), then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.
We use a ConsumerRebalanceListener to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.
The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.
Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".
People have asked good related question, but answers didn't help us:
Conditions in which Kafka Consumer (Group) triggers a rebalance
What exactly IS Kafka Rebalancing?
Continuous consumer group rebalancing with more consumers than partitions
Configuration:
Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
Java 8 brokers and clients
2 brokers, 1 zookeeper, stable running processes & no additions
5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
Topic A has 16 partitions and replication 2, and is created before consumers start.
One process writes to topic A; two processes read from topic A.
Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
All the consumers for topic A are in the same consumer group.
The timeout for KafkaConsumer.poll() is 1000 milliseconds.
The configuration that affects rebalance is:
max.poll.interval.ms=50000
max.poll.records=100
request.timeout.ms=40000
session.timeout.ms=20000
We use defaults for these:
heartbeat.interval.ms=3000
(broker) group.max.session.timeout.ms=300000
(broker) group.min.session.timeout.ms=6000
Check the gc log,and make sure there is not full gc frequently which will prevent heartbeat thread working.
I have been working with Kafka lately and have bit of confusion regarding the consumers under a consumer group. The center of the confusion is whether to implement consumers as processes or threads. For this question, assume I am using the high level consumer.
Let's consider a scenario that I have experimented with. In my topic there are 2 partitions (for simplicity let's assume replication factor is just 1). I created a consumer (ConsumerConnector) process consumer1 with group group1, then created a topic count map of size 2 and then spawned 2 consumer threads consumer1_thread1 and consumer1_thread2 under that process. It looks like consumer1_thread1 is consuming partition 0 and consumer1_thread2 is consuming partition 1. Is this behaviour always deterministic? Below is the code snippet. Class TestConsumer is my consumer thread class.
...
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(2));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
executor = Executors.newFixedThreadPool(2);
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new TestConsumer(stream, threadNumber));
threadNumber++;
}
...
Now, let's consider another scenario (which I haven't experimented but am curious) where I start 2 consumer processes consumer1 and consumer2 both having the same group group1 and each of them is a single threaded process. Now my questions are:
How will the two independent consumer processes (under the same group nevertheless) be related to the partitions in this case ? How is it different from the above single process multi-thread scenario?
In general, how are consumer threads or processes mapped / related to partitions in the topic?
The Kafka documentation does say that each consumer under a consumer group will consume one partition. However, does that refer to a consumer thread (like my above code example) or independent consumer processes?
Is there any subtle thing I am missing here regarding implementing consumers as processes vs threads? Thanks in advance.
A consumer group can have multiple consumer instances running (multiple process with the same group-id). While consuming each partition is consumed by exactly one consumer instance in the group.
E.g. if your topic contains 2 partitions and you start a consumer group group-A with 2 consumer instances then each one of them will be consuming messages from a particular partition of the topic.
If you start the same 2 consumer with different group id group-A & group-B then the message from both partitions of the topic will be broadcast to each one of them. So in that case the consumer instance running under group-A will have messages from both the partitions of the topic, and same is true for group-B as well.
Read more on this on their documentation
EDIT : Based on your comment which says,
I was wondering what is the effective difference between having 2 consumer threads under the same process as opposed to 2 consumer processes (group being the same in both cases)
The consumer group-id is same/global across the cluster. Suppose you have started process-one with 2 threads and then spawn another process (may be in a different machine) with the same groupId having 2 more threads then kafka will add these 2 new threads to consume messages from the topic. So eventually there will be 4 threads responsible for consuming from the same topic. Kafka will then trigger a re-balance to re-assign partitions to threads, so it could happen that for a particular partition which was being consumed by thread T1 of process P1may be allocated to be consumed by thread T2 of process P2. The below few lines are taken from the wiki page
When a new process is started with the same Consumer Group name, Kafka will add that processes' threads to the set of threads available to consume the Topic and trigger a 're-balance'. During this re-balance Kafka will assign available partitions to available threads, possibly moving a partition to another process. If you have a mixture of old and new business logic, it is possible that some messages go to the old logic.
The main design decision for opting for multiple consumer group instances with the same id vs a single consumer group instance is resiliency. For example if you have a single consumer with two threads then if this machine goes down you loose all consumers. If you have two separate consumer groups with the same id, each on different hosts then they can survive failure. Ideally each consumer group should have two threads in the above, therefore if one host goes down the other consumer group uses its dormant thread to take up the other partition. Indeed it is always desirable to have more threads than partitions to cover this factor.
You can run each consumer group on different hosts. With a single consumer group for a given name/id it will only ever run on a single host as it manages all its threads in a single runtime environment.
Kafka has an algorithm to determine which threads/consumer groups reads the various topic partitions. Kafka tries to evenly distribute these in a resilient fashion. When a consumer group fails, it enables other threads in other groups to read the given partition.
Refers to a single thread in the consumer group. If there are more threads than partitions then some of them will just remain dormant until other threads fail to offer resiliancy.
The preference relates to resilience. So with multiple consumer groups setup with the same id I can run on multiple hosts making my application tolerant to failure.
Thanks for the detailed answer from #user2720864, but I think the re-allocation case #user2720864 mentioned in the answer is not correct => one partition cannot be consumed by two consumers.
When there are more consumers (compared to the partitions), each partition will be exclusively allocated to one consumer only while the leftover consumers will stay lazy only until some working consumers being dead or being removed from the group.
Based on the Kafka Consumers document:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
And also its API specification at "Consumer Groups and Topic Subscriptions" section:
This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group.