I'm implementing a daily job which get data from a MongoDB (around 300K documents) and for each of them publish a message on a RabbitMQ queue.
On the other side I have some consumers on the same queue, which ideally should work in parallel.
Everything is working but not as much as I would, specially regarding consumers performances.
This is how I declare the queue:
rabbitMQ.getChannel().queueDeclare(QUEUE_NAME, true, false, false, null);
This is how the publishing is done:
rabbitMQ.getChannel().basicPublish("", QUEUE_NAME, null, body.getBytes());
So the channel used to declare the queue is used to publish all the messages.
And this is how the consumers are instantiated in a for loop (10 in total, but it can be any number):
Channel channel = rabbitMQ.getConnection().createChannel();
MyConsumer consumer = new MyConsumer(customMapper, channel, subscriptionUpdater);
channel.basicQos(1); // also tried with 0, 10, 100, ...
channel.basicConsume(QUEUE_NAME, false, consumer);
So for each consumer I create a new channel and this is confirmed by logs:
...
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#bdd2027
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#5d1b9c3d
com.rabbitmq.client.impl.recovery.AutorecoveringChannel#49a26d19
...
As far as I've understood from my very short RabbitMQ experience, this should guarantee that all the consumer are called.
By the way, consumers need between 0.5 to 1.2 seconds to complete their task. I have just spotted very few 3 seconds.
I have two separate queues and I repeat what I said above two times (using the same RabbitMQ connection).
So, I have tested publishing 100 messages for each queue. Both of them have 10 consumers with qos=1.
I didn't expect to have exactly a delivery/consume performance of 10/s, instead I noticed:
actual values are around 0.4 and 1.0.
at least all the consumers bound to the queue have received a message, but it doesn't look like "fair dispatching".
it took about 3 mins 30 secs to consume all the messages on both queues.
Am I missing the main concept of threading within RabbitMQ? Or any specific configuration which might be still at default value?
I'm on it since very few days so this might be possible.
Please notice that I'm in the fortunate position where I can control both publishing and consuming parts :)
I'm using RabbitMQ 3.7.3 locally, so it cannot be any network latency issue.
Thanks for your help!
The setup of RabbitMQ channels and consumers were correct in the end: so one channel for each consumer.
The problem was having the consumers calling a synchronized method to find and update a MongoDB document.
This was delaying the execution time of some consumers: even worst, the more consumers I was adding (thinking to speed up processing), the less message rate/s I was getting.
I have moved the MongoDB part on he publishing side where I don't have to care about synchronization because it's done in sequence by one publisher only. I have a slightly decreased delivery rate/s but now with just 5 consumers I easily reach an ack rate of 50-60/s.
Lessons learnt:
create a separate channel for the publisher.
create a separate channel for each consumer.
let RabbitMQ manage threading for the consumers (--> you can instantiate them on the main thread).
(if possible) back off publishing to give the queues 100% time to deal with consumers.
set a qos > 1 for each consumer channel. But this really depends on your scenario and architecture: you must do some performance test.
As a general rule:
(1) calculate/estimate delivery time.
(2) calculate/estimate ack time.
(3) calculate/estimate consumer time.
qos = (1) + (2) + (3) / (3)
This will give you an initial qos value to test and tweak based on your scenario. The final goal is to have 100% utilization for all the available consumers.
Related
in the RabbitMQ specification there can be found:
Section 4.7 of the AMQP 0-9-1 core specification explains the conditions under which ordering is guaranteed: messages published in one channel, passing through one exchange and one queue and one outgoing channel will be received in the same order that they were sent. RabbitMQ offers stronger guarantees since release 2.7.0.
but what if there is binding which goes like Exchange 1 -> Exchange 2 -> Queue 1.
Is the ordering still guaranteed?
We assumed it did but we found in our application that it might not be the case. We use spring-rabbit-2.1.6-RELEASE (which uses amqp-client-5.4.3).
The publishers, binding and consumers are following:
Client 1 publishes to Exchange 1 -> Exchange 2 -> Queue 1 - consumed by Client 2
-> Queue 2 - consumed by Client 3
We can see that Client 1 publishes 3 messages in following order:
Message 1
Message 2
Message 3
But the both Client 2 and Client 3 receive the messages in following order:
Message 3
Message 1
Message 2
EDIT 1 (Spring configuration)
For the publisher (Client 1) there is following XML configuration used (no extra properties set on rabbit's ConnectionFactory):
<rabbit:connection-factory channel-cache-size="1" cache-mode="CHANNEL" id="respConnFactory" addresses="..." virtual-host="..." username="..." password="..." executor="connExec"/>
<!-- the executor has no meaning for such usingas mentioned by Gary -->
The publishing is done via:
AmqpTemplate::send(String exchange, String routingKey, Message message)
in a dedicated thread.
Client 2 uses default spring configuration with SimpleMessageListenerContainer.
Client 3 isn't actually our application so I don't know the real setup. That was them who reported us a bug that the messages aren't ordered properly.
Of course there is still possibility that we logged the message publishing with some bug. But I triple checked it - it's from a single thread and there is sequence number in each message's custom header which is incremented correctly on Client 1.
EDIT 2
I did further analysis in order to find out how often the wrong message sorting happens. Here are the result:
I took the logs and data +-2 hours around the incident (4 hours in total) and there were 42706 messages sent and only 3 of them had wrong sorting on Client 2. All 3 messages were sent within interval of 7 ms.
Then I randomly took another time window of length 14 hours. There were 531904 messages sent and all of them received by Client 2 in correct order. The average message rate is ~11 messages per second.
The messages aren't distributed evenly so the 3 messages within 7 ms isn't anything especial - quite an opposite. It's common that within 3-5 ms there are multiple messages sent.
From this analysis I assume there was something weird going on on the rabbit cluster. Unfortunately I don't have the logs from it anymore.
The chance of some kind of race condition is from my point of view very low.
Thank you,
Frank
Spring AMQP uses a cache for channels; in a multi-threaded environment, there is no guarantee that the same thread will always use the same channel; hence ordering is not guaranteed.
With the current releases, the solution is to use scoped operations which will guarantee that a series of publications will occur on the same channel and guarantee order.
In the next release (2.3, available later this year), we have also added the ThreadChannelConnectionFactory which does the same thing.
It happened again and we were able to figure it out.
The whole time it was rabbit health indicator who was responsible for channel recreation and therefore for wrong order sorting. There was a job which periodically called the health endpoint.
As Gary correctly mentioned:
Spring AMQP uses a cache for channels; in a multi-threaded environment, there is no guarantee that the same thread will always use the same channel; hence ordering is not guaranteed.
The health status is checked from different thread and it uses the producer's channel.
As short term solution this will work:
management.health.rabbit.enabled=false
The sorting is guaranteed if the producer is really single thread and the connection factory is setup as in the description.
Another (and maybe proper) solution is to create separate ConnectionFactory and don't use the auto-configuration for rabbit health check.
#Bean("rabbitHealthIndicator")
public HealthIndicator rabbitHealthIndicator(ConnectionFactory healthCheckConnectionFactory) {
RabbitTemplate rabbitTemplate = new RabbitTemplate(healthCheckConnectionFactory); // make sure it's a different connection factory than the one with guaranteed sorting
return new RabbitHealthIndicator(rabbitTemplate);
}
That did the trick.
Cheers and thank you Gary for your help.
Frank
We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?
Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.
Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.
Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.
We expect partitions would not rebalance without also seeing some cause or failure.
Sometimes poll() gets stuck (exceeds the timeout) and we use wakeup() and close(), then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.
We use a ConsumerRebalanceListener to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.
The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.
Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".
People have asked good related question, but answers didn't help us:
Conditions in which Kafka Consumer (Group) triggers a rebalance
What exactly IS Kafka Rebalancing?
Continuous consumer group rebalancing with more consumers than partitions
Configuration:
Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
Java 8 brokers and clients
2 brokers, 1 zookeeper, stable running processes & no additions
5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
Topic A has 16 partitions and replication 2, and is created before consumers start.
One process writes to topic A; two processes read from topic A.
Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
All the consumers for topic A are in the same consumer group.
The timeout for KafkaConsumer.poll() is 1000 milliseconds.
The configuration that affects rebalance is:
max.poll.interval.ms=50000
max.poll.records=100
request.timeout.ms=40000
session.timeout.ms=20000
We use defaults for these:
heartbeat.interval.ms=3000
(broker) group.max.session.timeout.ms=300000
(broker) group.min.session.timeout.ms=6000
Check the gc log,and make sure there is not full gc frequently which will prevent heartbeat thread working.
Sometimes due to some external problems, I need to requeue a message by basic.reject with requeue = true.
But I don't need to consume it immediately because it will possibly fail again in a short time. If I continuously requeue it, this may result in infinite loop and requeue.
So I need to consume it later, say one minute later,
And I need to know how many times the messages has been requeue so that I can stop requeue it but only reject it to declare it fails to consume.
PS: I am using Java client.
There are multiple solutions to point 1.
First one is the one chosen by Celery (a Python producer/consumer library that can use RabbitMQ as broker). Inside your message, add a timestamp at which the task should be executed. When your consumer gets the message, do not ack it and check its timestamp. As soon as the timestamp is reached, the worker can execute the task. (Note that the worker can continue working on other tasks instead of waiting)
This technique has some drawbacks. You have to increase the QoS per channel to an arbitrary value. And if your worker is already working on a long running task, the delayed task wont be executed until the first task has finished.
A second technique is RabbitMQ-only and is much more elegant. It takes advantage of dead-letter exchanges and Messages TTL. You create a new queue which isn't consumed by anybody. This queue has a dead-letter exchange that will forward the messages to the consumer queue. When you want to defer a message, ack it (or reject it without requeue) from the consumer queue and copy the message into the dead-lettered queue with a TTL equal to the delay you want (say one minute later). At (roughly) the end of TTL, the defered message will magically land in the consumer queue again, ready to be consumed. RabbitMQ team has also made the Delayed Message Plugin (this plugin is marked as experimental yet fairly stable and potential suitable for production use as long as the user is aware of its limitations and has serious limitations in term of scalability and reliability in case of failover, so you might decide whether you really want to use it in production, or if you prefer to stick to the manual way, limited to one TTL per queue).
Point 2. just requires putting a counter in your message and handling this inside your app. You can choose to put this counter in a header or directly in the body.
I had a question on how rabbitmq works with batching acknowledgements. I understand that the Prefetch value is the max number of messages that will get queued before reaching its limit. However, I wasn't sure if the ack's manage themselves or if I have to manage this in code.
Which method is correct?
Send each basicAck with multiple set to true
or
wait until 10 acks were supposed to be sent out and send only the last one and AMQP will automatically send all previous in queue. (with multiple set to true)
TL;DR multiple = true is faster in some cases but requires a lot more careful book keeping and batch like requirements
The consumer gets messages that have a monotonic-ly growing id specific to that consumer. The id is a 64 bit number (it actually might be an unsigned 32 bit but since Java doesn't have that its a long) called the delivery tag. The prefetch is the most messages a consumer will receive that are unacked.
When you ack the highest delivery tag with multiple true it will acknowledge all the unacked messages with a lower delivery tag (smaller number) that the consumer has outstanding. Obviously if you have high prefetch this is faster than acking each message.
Now RabbitMQ knows the consumer received the messages (the unacked ones) but it doesn't know if all those messages have been correctly consumed. So it is on the burden of you the developer to make sure all the previous messages have been consumed. The consumer will deliver the messages in order (I believe internally the client uses a BlockingQueue) but depending on the library/client used downstream the messages might not be.
Thus this really only works well when you are batching the messages together in a single go (e.g. transaction or sending a group of messages off to some other system) or buffering reliably. Often this is done with a blocking queue and then periodically draining the queue to send a group of messages to a downstream system.
On the other hand if you are streaming each message in real time then you can't really do this (ie multiple = false).
There is also the case of one of the message being bad in the group (e.g. drained from internal queue... not rabbit queue) and you won't to nack that bad one. If that is the case you can't use multiple = true either.
Finally if you wait for a certain amount messages (instead of say time) more than the prefetch you will wait indefinitely.... not a good idea. You need to wait on time and number of messages must be <= prefetch.
As you can see its fairly nontrivial to correctly use multiple = true.
First one correction regarding Prefetch value is the max number of messages that will get queued before reaching its limit. - this is not what prefetch value is; prefetch value is the number of UN-ACKed messages that consumer "gets" from the queue. So they are kind of assigned to the consumer but remain in the queue until they are acknowledged. Quote from here, when prefetch is 1
This tells RabbitMQ not to give more than one message to a worker at a
time. Or, in other words, don't dispatch a new message to a worker
until it has processed and acknowledged the previous one.
And for your question:
I wasn't sure if the ack's manage themselves or if I have to manage
this in code.
You can set the auto ack flag to true and then you could say that the ack's manage themselves
I am using ActiveMQ 5.8 with wildcard consumers configured in camel route.
I am using default ActiveMQ configuration, so I have defaults as below
prefetch = 1
dispatch policy= Round Robin
Now I start a consumer jvm with 5 consumers each for 2 queues. both the queue has same type of message and same number of messages.
Consumers are doing nothing but printing the message (so no db blocking or slow consumer issue)
EDIT
I have set preFetch to 1 for each of the queue
What I observe is one of the queue getting drained faster than other.
What I expect is both the queue getting drained at equal pace, kind of load balance.
One surprising observation is
Though activemq webconsole shows 5 consumers for each of those queues
When I debug my consumer, I see only 5 threads / consumers from camel flow for a wildcard queue *.processQueue
What will be cause of above behavior?
How do I make sure that all the queue drain at equal pace?
Did anyone has experience to share on writting custom dispatch policy or overriding defaults of activemq?
I was able to find a reference to this behavior
Message distribution in case of wildcard queue consumers is random.
http://activemq.2283324.n4.nabble.com/Wildcard-and-message-distribution-td2346132.html#a2346133
Though this can be tuned by setting appropriate prefetch size.
After trial & error, I arrived at following formula, to have fair distribution across the consumers and all the queue getting de-queued at almost same pace.
prefetch = number of wildcard consumers
It's probably wrong to compare the rate the queues are consumed. The load balancing typically happens between consumers. So, the idea is that each of the five consumers on the first queue would get rather even load (given they are connected to the same broker).
However, I think you might want to double check your load test setup. It rarely gives predictable results when running broker and consumers on the same machine for instance.