I have this problem that in my queue there are messages from different source systems.
For example: in the first message I have source system name as: 'X' and in the second one, the source system name is: 'Y'.
Currently I have a JMS listener with concurrency level set as 1. So all messages are getting processed one by one as expected, but now I want to process messages concurrently such that if messages are from the same source system then only one message should be processed for that source system at a time and if there are messages for different source systems they must be executed in parallel.
Source system are getting created dynamically, that's why I can't have separate queue and consumer for each source system.
It would be great if someone push me into the right direction.
It sounds like your problem is around maintaining ordered delivery for messages originating from a given source, but to be able to process messages from different sources in parallel.
You can do this using message groups.
The broker allows messaging applications to classify a set of related
messages as belonging to a group. This allows a message producer to
indicate to the consumer that a group of messages should be considered
a single logical operation with respect to the application.
To make this work, have the producer systems set the JMSXGroupID header to the producer system name:
Mesasge message = session.createTextMessage("<message />");
message.setStringProperty("JMSXGroupID", "SourceSystem1Name");
Then the broker will enforce consumption ordering among messages belonging to that group.
Addendum
There might be N numbers of source systems as they are created
dynamically and there is a producer which puts all the messages from
these N source systems into the queue
So the message producer can set the JMSXGroupID header to the name of the source system.
The problem I'm facing is If the message for one source system is
getting processed, other source system messages have to wait till the
completion of processing of that message
So once the group header is set as described, the broker will ensure that it only releases messages for a given source system to the consumer sequentially. It does this by forcing the consumer to send an acknowledgement that a previous message has been processed, before releasing the next message in the group.
So by setting concurrency to some appropriate value, the consumer can process messages from different source systems in parallel but will be forced to sequentially process messages from any given source system, which is the behaviour you require.
Because the consumer is declared on the consumer side, you cannot dynamically (at my knowledge) change it according to another variable.
The solution I would suggest is to use selectors and use two consumer connections like the following:
Consumer endpoint for parallel processing (5 in this case) will looks like:
activemq:queue:MY_QUEUE_UNIQUE?concurrentConsumers=5&selector=MySourceHeader<>'X'
Consumer endpoint for sequencial processing will looks like:
activemq:queue:MY_QUEUE_UNIQUE?concurrentConsumers=1&selector=MySourceHeader='X'
It would be good as well to define a fetch size of one for the queue (even I am not sure it is mandatory).
Related
I am building a high volume system that will be processing up to a hundred million messages everyday. I have a microservice that is reading from a Kafka topic and doing some basic processing on them before forwarding them to the next microservice.
Kafka Topic -> Normalizer Microservice -> Ordering Microservice
Below is what the processing would look like:
Normalizer would be concurrently picking up messages from the Kafka topic.
Normalizer would read the messages from the topic and post them to an in-memory seda queue from where the message would be subsequently picked up, normalized and validated.
This normalization, validation and processing is expected to take around 1 second per message. Within this one second, the message will be stored to the database and will become persistent in the system.
My concern is that during this processing, if a message has been already read from the topic and posted to the seda queue and has either
not yet been picked up from the seda queue or,
has been picked up from the seda queue and is currently processing and has not yet been persisted to the database
and the Normalizer JVM crashes or is force-killed (kill -9), how do I ensure that I do NOT lose the message?
It is critical that I do NOT drop/lose any messages and even in case of a crash/failure, I should be able to retain the message such that I can trigger re-processing of that message if required.
One naïve approach that comes to mind is to push the message to a cache (which will be a very fast operation).
Read from topic -> Push to cache -> Push to seda queue
Needless to say, the problem still exists, it just makes it less probable that I will lose the message. Also, this is certainly not the smartest solution out there.
Please share your thoughts on how I can design this system such that I can preserve messages on my side once the messages have been read off of the Kafka topic even in the event of the Normalizer JVM crashing.
Our program is using Queue.
Multiple consumers are processing messages.
Consumers do the following:
Receive on or off status message from the Queue.
Get the latest status from the repository.
Compare the state of the repository and the state received from the message.
If the on/off status is different, update the data. (At this time, other related data are also updated.)
Assuming that this process is handled by multiple consumers, the following problems are expected.
Producer sends messages 1: on, 2: off, and 3: on.
Consumer A receives message #1 and stores message #1 in the storage because there is no latest data.
Consumer A receives message #2.
At this time, consumer B receives message #3 at the same time.
Consumers A and B read the latest data from the storage at the same time (message 1).
Consumer B finishes processing first. Don't update the repository as the on/off state is unchanged.(1: on, 3: on)
Then consumer A finishes the processing. The on/off state has changed, so it processes and saves the work. (1: on, 2: off)
In normal case, the latest data remaining in the DB should be on.
(This is because the message was sent in the order of on -> off -> on.)
However, according to the above scenario, off remains the latest data.
Is there any good way to solve this problem?
For reference, the queue we use is using AWS Amazon MQ and the storage is using AWS dynamoDB. And using Spring Boot.
The fundamental problem here is that you need to consume these "status" messages in order, but you're using concurrent consumers which leads to race-conditions and out-of-order message processing. In short, your basic architecture using concurrent consumers is causing this problem.
You could possibly work up some kind of solution in the database with timestamps as suggested in the comments, but that would be extra work for the clients and extra data stored in the database that isn't strictly necessary.
The simplest way to solve the problem is to just consume the messages serially rather than concurrently. There are a handful of different ways to do this, e.g.:
Define just 1 consumer for the queue with the "status" messages.
Use ActiveMQ's "exclusive consumer" feature to ensure that only one consumer receives messages.
Use message groups to group all the "status" messages together to ensure they are processed serially (i.e. in order).
As mentioned in the answer,
A message queue is a one-way pipe: one process writes to the queue, and another reads the data in the order
SysV message queue is one example
So, my understanding is,
one message queue is used by two processes, where one process(producer) insert an item in the queue and another process(consumer) consumes the item from the queue
1) Is RabbitMQ or Kafka message queue a 1:1 messaging system? used by only two processes, where one process writes and other process reads......
2) after the consumer consume the item, does the item get deleted? If no, why do we need queue data structure? Why not just shared memory?
Kafka is not strictly 1:1 messaging system. Multiple producers can write into a topic and multiple consumers can read from it. Moreover, in Kafka, multiple consumers can be assigned same or different consumer groups. Every message is consumed by only one consumer from every consumer group (load balancing) and all consumer groups receive a copy of every message (of course, if they are subscribed to corresponding topics and no messages are lost). A good description of this process can be found in this article: Scalability of Kafka Messaging using Consumer Groups.
In Kafka all messages are persisted on the disk and stored until the compaction reaps it, or the retention.ms passes, or the log size is exceeded. That's a very high-level point of view and there are a lot of nuances here. Like: the messages are stored in segments, every segment contains multiple messages. When the retention period passes for a message, it is not removed from the segment at that moment, instead Kafka waits until all messages in that segment are expired and delete the whole segment at once. Also, retention could come before the log exceeds the maximum size or vice versa: the log can exceed the size even before the retention period passes. And so on. Just read the docs and pay attention to topics about "log cleaner" and "retention".
After the Kafka consumer reads the message it is neither compacted, nor expired. So, it's not removed from the log and stays there. It also means that every message could be re-read by a consumer if needed (until it is deleted completely). It can be useful if some of your consumers went offline for some reason and were not able to process the messages as they come in. It also allows interesting features like transaction replays and so on. Persistence is one of the Kafka's features.
Shared memory? Well, strictly speaking shared memory is only allowed inside a single process. So you can't generally use "shared memory" when you need to access it from different processes. And there is absolutely no way to have "shared memory" when you app runs on multiple hosts. However, there are in-memory brokers. Like Redis can be used as a message broker, and it's all in-memory. However, if such a broker restarts for some reason you lose everything. Speaking about Redis: it has two persistence configurations specifically to handle the restarts.
I am not sure about RabbitMQ, but it probably deletes messages after the consumer acknowledged them by default. So it's closer to 1:1 mental model. However, RabbitMQ employs disk persistence as well.
We're using ActiveMQ (5.14.5).
We have a single producer, and multiple consumers on the same queue.
From time to time we set JMSXGroupID to group several messages together to be consumed on a single consumer. This works as expected.
In parallel, the producer continues to send non-grouped messages (i.e. without JMSXGroupID)
The problem:
We noticed that once a consumer was selected to process a specific group, it no longer gets the non-grouped messages. Even if it is completely idle. The non-grouped messages are always sent to the other consumers.
The rogue consumer returns to consume non-grouped messages only after we close the group that was assigned to it (by setting JMSXGroupSeq=-1).
Is this a normal behavior? We expected that non-grouped messages will continue to be delivered in the same round-robin fashion as usual, to all consumers.
We were unable to find a clear reference to this in ActiveMQ documentation.
There's a bit of a no-win situation for the message broker here. If there are active message groups in play, the the broker has to assume that further messages will be produced that fall into those groups. So a message consumer that has become bound to a particular group needs to remain available to consumer later messages of that group, rather than ungrouped messages. After all, an ungrouped message can be handled elsewhere, while a grouped message can't.
However, we also want to have a fair-ish distribution of messages between consumers. So it makes sense that a consumer that is bound to a group, or groups, could take some work when it is idle.
But how do we know it is idle? What happens if a consumer takes a bunch of ungrouped messages (and don't forget the default pre-fetch behaviour), and then new messages arrive that match its specific group?
The fact that closing a group restores the "group consumer" to default behaviour suggests to me that this is not a bug, but a deliberate attempt to make a reasonable compromise in a tricky situation. It seems reasonable to me to ask for a feature to be added, where "group consumers" can take part in ungrouped workload, but I would be inclined to see that as an enhancement.
Just my $0.02, of course.
I have a JMS Queue that is populated at a very high rate ( > 100,000/sec ).
It can happen that there can be multiple messages pertaining to the same entity every second as well. ( several updates to entity , with each update as a different message. )
On the other end, I have one consumer that processes this message and sends it to other applications.
Now, the whole set up is slowing down since the consumer is not able to cope up the rate of incoming messages.
Since, there is an SLA on the rate at which consumer processes messages, I have been toying with the idea of having multiple consumers acting in parallel to speed up the process.
So, what Im thinking to do is
Multiple consumers acting independently on the queue.
Each consumer is free to grab any message.
After grabbing a message, make sure its the latest version of the entity. For this, part, I can check with the application that processes this entity.
if its not latest, bump the version up and try again.
I have been looking up the Integration patterns, JMS docs so far without success.
I would welcome ideas to tackle this problem in a more elegant way along with any known APIs, patterns in Java world.
ActiveMQ solves this problem with a concept called "Message Groups". While it's not part of the JMS standard, several JMS-related products work similarly. The basic idea is that you assign each message to a "group" which indicates messages that are related and have to be processed in order. Then you set it up so that each group is delivered only to one consumer. Thus you get load balancing between groups but guarantee in-order delivery within a group.
Most EIP frameworks and ESB's have customizable resequencers. If the amount of entities is not too large you can have a queue per entity and resequence at the beginning.
For those ones interested in a way to solve this:
Use Recipient List EAI pattern
As the question is about JMS, we can take a look into an example from Apache Camel website.
This approach is different from other patterns like CBR and Selective Consumer because the consumer is not aware of what message it should process.
Let me put this on a real world example:
We have an Order Management System (OMS) which sends off Orders to be processed by the ERP. The Order then goes through 6 steps, and each of those steps publishes an event on the Order_queue, informing the new Order's status. Nothing special here.
The OMS consumes the events from that queue, but MUST process the events of each Order in the very same sequence they were published. The rate of messages published per minute is much greater than the consumer's throughput, hence the delay increases over time.
The solution requirements:
Consume in parallel, including as many consumers as needed to keep queue size in a reasonable amount.
Guarantee that events for each Order are processed in the same publish order.
The implementation:
On the OMS side
The OMS process responsible for sending Orders to the ERP, determines the consumer that will process all events of a certain Order and sends the Recipient name along with the Order.
How this process know what should be the Recipient? Well, you can use different approaches, but we used a very simple one: Round Robin.
On ERP
As it keeps the Recipient's name for each Order, it simply setup the message to be delivered to the desired Recipient.
On OMS Consumer
We've deployed 4 instances, each one using a different Recipient name and concurrently processing messages.
One could say that we created another bottleneck: the database. But it is not true, since there is no concurrency on the order line.
One drawback is that the OMS process which sends the Orders to the ERP must keep knowledge about how many Recipients are working.