concurrent consumers yet ensure order - java

I have a JMS Queue that is populated at a very high rate ( > 100,000/sec ).
It can happen that there can be multiple messages pertaining to the same entity every second as well. ( several updates to entity , with each update as a different message. )
On the other end, I have one consumer that processes this message and sends it to other applications.
Now, the whole set up is slowing down since the consumer is not able to cope up the rate of incoming messages.
Since, there is an SLA on the rate at which consumer processes messages, I have been toying with the idea of having multiple consumers acting in parallel to speed up the process.
So, what Im thinking to do is
Multiple consumers acting independently on the queue.
Each consumer is free to grab any message.
After grabbing a message, make sure its the latest version of the entity. For this, part, I can check with the application that processes this entity.
if its not latest, bump the version up and try again.
I have been looking up the Integration patterns, JMS docs so far without success.
I would welcome ideas to tackle this problem in a more elegant way along with any known APIs, patterns in Java world.

ActiveMQ solves this problem with a concept called "Message Groups". While it's not part of the JMS standard, several JMS-related products work similarly. The basic idea is that you assign each message to a "group" which indicates messages that are related and have to be processed in order. Then you set it up so that each group is delivered only to one consumer. Thus you get load balancing between groups but guarantee in-order delivery within a group.

Most EIP frameworks and ESB's have customizable resequencers. If the amount of entities is not too large you can have a queue per entity and resequence at the beginning.

For those ones interested in a way to solve this:
Use Recipient List EAI pattern
As the question is about JMS, we can take a look into an example from Apache Camel website.
This approach is different from other patterns like CBR and Selective Consumer because the consumer is not aware of what message it should process.
Let me put this on a real world example:
We have an Order Management System (OMS) which sends off Orders to be processed by the ERP. The Order then goes through 6 steps, and each of those steps publishes an event on the Order_queue, informing the new Order's status. Nothing special here.
The OMS consumes the events from that queue, but MUST process the events of each Order in the very same sequence they were published. The rate of messages published per minute is much greater than the consumer's throughput, hence the delay increases over time.
The solution requirements:
Consume in parallel, including as many consumers as needed to keep queue size in a reasonable amount.
Guarantee that events for each Order are processed in the same publish order.
The implementation:
On the OMS side
The OMS process responsible for sending Orders to the ERP, determines the consumer that will process all events of a certain Order and sends the Recipient name along with the Order.
How this process know what should be the Recipient? Well, you can use different approaches, but we used a very simple one: Round Robin.
On ERP
As it keeps the Recipient's name for each Order, it simply setup the message to be delivered to the desired Recipient.
On OMS Consumer
We've deployed 4 instances, each one using a different Recipient name and concurrently processing messages.
One could say that we created another bottleneck: the database. But it is not true, since there is no concurrency on the order line.
One drawback is that the OMS process which sends the Orders to the ERP must keep knowledge about how many Recipients are working.

Related

How to process messages in the same order from queue using multiple servers in java

We have a ActiveMQ queue which will receive 100k stock order messages(each message contains stock name, sell price, bid price in json format) per second.
Out of 100k messages/sec there can be n no.of messages of single stock. If we receive multiple messages of same stock then we need to process all those messages in the same order using java.
We can't process 100k messages/second using single listener in one server.
Need to process it by using multiple listeners & servers but display the result in UI using the same order that is placed in Queue.
Read Stock Queue--> Validate the request -->Update the Stock price in UI
Example message:-
{
stockName:"TCS",
sellPrice:"102",
bidPrice:"100"
}
Can you suggest solution for the above problem.
Here is my proposal:
You need to split the queue to sub queues based on the stock name. you can split based on the first letter(s) of the stock name. this will give you ample parallel capabilities while ensuring that all messages of the same stock land on one queue.
there will need to be one reader from the main queue but all it does is forwarding the messages to the sub queues.
I would suggest to use non-persistent publication to topics instead of queues. Topics give you the flexibility of
choosing subscription wildcards and
possibly adding other services later to this architecture. You might not require this now, but maybe in 5 years you will need another GUI or some form of monitoring or replay service. In case you used topics you can just plugin new subscribers - you don't have to change your publication side for those...
You can use durable subscriptions if you need more persistence.
Message order is guaranteed within the same publication topic so you should make the stock name part of the topic. You could publish on something like ORDER.STOCK.TCS.
But having a balanced load based on stock names is tricky because some letters like Z are very rare, while others are frequent. So in addition to the stock name add the stock name's hash%100 to the topic. For Example if the hashcode of TCS is 12357 and you do modulo 100, you publish this on ORDER.STOCK.TCS.57
Let's say you have 10 subscriber's, each subscriber could then make 10 subscriptions. For example subscriber 1 would subscribe to ORDER.STOCK.*.0, ORDER.STOCK.*.1, ... ORDER.STOCK.*.9
Subscriber 2 would subscribe to ORDER.STOCK.*.10, ORDER.STOCK.*.11, ... ORDER.STOCK.*.19
If you have 5 subscribers, each one does 20 subscriptions (you get the idea).
The reason for this is that
We had a similar requirement and we used an open source framework called LMAX Disruptor, supposedly highly performant concurrency framework. You can experiment around it, https://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started.
On a very high level:
Put the Stocks received into a ringbuffer [core data structure that
the framework is built upon], this would be the consumer for
ActiveMQ and producer for the ringbuffer.
The consumers/workers[in your case multiple - mulltiple here is a
worker-thread for each unique stock-name] pick up the Stocks from
ringbuffer in ordered fashion. In the worker/listener, you can
handle event based on condition.
I've just committed sample code trying to demonstrate your use case, for your reference:
https://github.com/reddy73/Disruptor-Example

one consumer for many queues

In my system, there are many users who write the blogs. I need to subscribe to different users. There is no centralized system(it's a swing application).
I am using JMS.
The user may follow one user, two users or 100 users.
m_destination1 = m_session.createQueue("USER.DEVID");
m_consumer1 = m_session.createConsumer(m_destination1);
m_destination2 = m_session.createQueue("USER.HARRY");
m_consumer2 = m_session.createConsumer(m_destination2);
Is there any generic way to write the above lines of code for unknown no. of users ? Like one consumer can receive message from many users.
Here wildcard will not work.
The best thing you can use is Mirrored Queue feature of activeMQ,
you can read the documentation here
http://activemq.apache.org/mirrored-queues.html
What mirrored queue basically does is,it forwards all the messages sent on queue to a similar named topic, this topic can then be subscribed by multiple consumers.
If you use mirrored queue, you will need your consumers to subscribe to different topics.
Your design cries out for publish-subscribe(topic) domain rather than a point-to-point architecture(i.e queue).As you already would be having an architecture which generates a queue for different people writing blogs,change to that system wont be required but your requirement will be catered.
I addition to this,if 2 consumers listen on a queue then they will pick up messages parallely from queue i.e If there are 2 messages on queue then both consumers will process 1 message independently,I don't think that's what you want.
Hope this helps!
Good luck!
#Vihar's answer is right that you should be using the publish-subscribe paradigm by using a topic, to allow multiple consumers to both be notified of new blog posts. It sounds like your primary pain point is that you've got one destination per author and users that want to consume messages have to subscribe to each of them individually.
Instead, have all new-post messages published into a single topic (let's call it NewPostNotificationTopic). Clients can then subscribe to all messages but immediately check them against the list of authors they care about and immediately stop processing any notification for an author they're not following. (This puts the filtering into the message handler rather than into the ActiveMQ network.) This does mean that each message will be passed to each client, but as long as the messages are small and your network is fast and your users are usually connected to the network, this might be a workable solution. But if you can't afford the network bandwidth of sending all messages to all clients, or if your consumers will be offline for long periods of time and you can't afford to hold a copy of all messages till they come back online, this may not work for you.
Alternatively, publish all messages into that same topic, but set the author's ID as a header on the message and use message selectors to tell ActiveMQ to only deliver messages matching a given author ID. This will be more efficient, but you're back to needing to explicitly tell ActiveMQ which authors you care about, either with a single subscription with a selector that contains ORs or with one subscription per author. The latter is cleaner but gets you back to your problem of one subscription per author per reader; the former results in only one subscription but it has to be updated each time you add/remove an author for a reader, and you'll need to make sure you handle the race conditions inherent in removing the subscription and adding another one. I'd go with the first solution I proposed (doing the filtering in the message handler instead of in the ActiveMQ subscriptions) if the performance concerns I raised there aren't a problem; otherwise I'd probably go with one subscription per author per reader, rather than having a single subscription with an ORed selector and needing to redo the subscription each time something changed.

Is there an API that allows ordering event in clustered application?

Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.
Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!
So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.

Would a JMS Topic suffice in this situation? Or should I look elsewhere?

There is one controlling entity and several 'worker' entities. The controlling entity requests certain data from the worker entities, which they will fetch and return in their own manner.
Since the controlling entity can agnostic about the worker entities (and the working entities can be added/removed at any point), putting a JMS provider in between them sounds like a good idea. That's the assumption at least.
Since it is an one-to-many relation (controller -> workers), a JMS Topic would be the right solution. But, since the controlling entity is depending on the return values of the workers, request/reply functionality would be nice as well (somewhere, I read about the TopicRequester but I cannot seem to find a working example). Request/reply is typical Queue functionality.
As an attempt to use topics in a request/reply sort-of-way, I created two JMS topis: request and response. The controller publishes to the request topic and is subscribed to the response topic. Every worker is subscribed to the request topic and publishes to the response topic. To match requests and responses the controller will subscribe for each request to the response topic with a filter (using a session id as the value). The messages workers publish to the response topic have the session id associated with them.
Now this does not feel like a solution (rather it uses JMS as a hammer and treats the problem (and some more) as a nail). Is JMS in this situation a solution at all? Or are there other solutions I'm overlooking?
Your approach sort of makes sense to me. I think a messaging system could work. I think using topics are wrong. Take a look at the wiki page for Enterprise Service Bus. It's a little more complicated than you need, but the basic idea for your use case, is that you have a worker that is capable of reading from one queue, doing some processing and adding the processed data back to another queue.
The problem with a topic is that all workers will get the message at the same time and they will all work on it independently. It sounds like you only want one worker at a time working on each request. I think you have it as a topic so different types of workers can also listen to the same queue and only respond to certain requests. For that, you are better off just creating a new queue for each type of work. You could potentially have them in pairs, so you have a work_a_request queue and work_a_response queue. Or if your controller is capable of figuring out the type of response from the data, they can all write to a single response queue.
If you haven't chosen an Message Queue vendor yet, I would recommend RabbitMQ as it's easy to set-up, easy to add new queues (especially dynamically) and has really good spring support (although most major messaging systems have spring support and you may not even be using spring).
I'm also not sure what you are accomplishing the filters. If you ensure the messages to the workers contain all the information needed to do the work and the response messages back contain all the information your controller needs to finish the processing, I don't think you need them.
I would simply use two JMS queues.
The first one is the one that all of the requests go on. The workers will listen to the queue, and process them in their own time, in their own way.
Once complete, they will put bundle the request with the response and put that on another queue for the final process to handle. This way there's no need for the the submitting process to retain the requests, they just follow along with the entire procedure. A final process will listen to the second queue, and handle the request/response pairs appropriately.
If there's no need for the message to be reliable, or if there's no need for the actual processes to span JVMs or machines, then this can all be done with a single process and standard java threading (such as BlockingQueues and ExecutorServices).
If there's a need to accumulate related responses, then you'll need to capture whatever grouping data is necessary and have the Queue 2 listening process accumulate results. Or you can persist the results in a database.
For example, if you know your working set has five elements, you can queue up the requests with that information (1 of 5, 2 of 5, etc.). As each one finishes, the final process can update the database, counting elements. When it sees all of the pieces have been completed (in any order), it marks the result as complete. Later you would have some audit process scan for incomplete jobs that have not finished within some time (perhaps one of the messages erred out), so you can handle them better. Or the original processors can write the request to a separate "this one went bad" queue for mitigation and resubmission.
If you use JMS with transaction, if one of the processors fails, the transaction will roll back and the message will be retained on the queue for processing by one of the surviving processors, so that's another advantage of JMS.
The trick with this kind of processing is to try and push the state with message, or externalize it and send references to the state, thus making each component effectively stateless. This aids scaling and reliability since any component can fail (besides catastrophic JMS failure, naturally), and just pick up where you left off when you get the problem resolved an get them restarted.
If you're in a request/response mode (such as a servlet needing to respond), you can use Servlet 3.0 Async servlets to easily put things on hold, or you can put a local object on a internal map, keyed with the something such as the Session ID, then you Object.wait() in that key. Then, your Queue 2 listener will get the response, finalize the processing, and then use the Session ID (sent with message and retained through out the pipeline) to look up
the object that you're waiting on, then it can simply Object.notify() it to tell the servlet to continue.
Yes, this sticks a thread in the servlet container while waiting, that's why the new async stuff is better, but you work with the hand you're dealt. You can also add a timeout to the Object.wait(), if it times out, the processing took to long so you can gracefully alert the client.
This basically frees you from filters and such, and reply queues, etc. It's pretty simple to set it all up.
Well actual answer should depend upon whether your worker entities are external parties, physical located outside network, time expected for worker entity to finish their work etc..but problem you are trying to solve is one-to-many communication...u added jms protocol in your system just because you want all entities to be able to talk in jms protocol or asynchronous is reason...former reason does not make sense...if it is latter reason, you can choose other communication protocol like one-way web service call.
You can use latest java concurrent APIs to create multi-threaded asynchronous one-way web service call to different worker entities...

Is MQ publish/subscribe domain-specific interface generally faster than point-to-point?

I'm working on the existing application that uses transport layer with point-to-point MQ communication.
For each of the given list of accounts we need to retrieve some information.
Currently we have something like this to communicate with MQ:
responseObject getInfo(requestObject){
code to send message to MQ
code to retrieve message from MQ
}
As you can see we wait until it finishes completely before proceeding to the next account.
Due to performance issues we need to rework it.
There are 2 possible scenarios that I can think off at the moment.
1) Within an application to create a bunch of threads that would execute transport adapter for each account. Then get data from each task. I prefer this method, but some of the team members argue that transport layer is a better place for such change and we should place extra load on MQ instead of our application.
2) Rework transport layer to use publish/subscribe model.
Ideally I want something like this:
void send (requestObject){
code to send message to MQ
}
responseObject receive()
{
code to retrieve message from MQ
}
Then I will just send requests in the loop, and later retrieve data in the loop. The idea is that while first request is being processed by the back end system we don't have to wait for the response, but instead send next request.
My question, is it going to be a lot faster than current sequential retrieval?
The question title frames this as a choice between P2P and pub/sub but the question body frames it as a choice between threaded and pipelined processing. These are two completely different things.
Either code snippet provided could just as easily use P2P or pub/sub to put and get messages. The decision should not be based on speed but rather whether the interface in question requires a single message to be delivered to multiple receivers. If the answer is no then you probably want to stick with point-to-point, regardless of your application's threading model.
And, incidentally, the answer to the question posed in the title is "no." When you use the point-to-point model your messages resolve immediately to a destination or transmit queue and WebSphere MQ routes them from there. With pub/sub your message is handed off to an internal broker process that resolves zero to many possible destinations. Only after this step does the published message get put on a queue where, for the remainder of it's journey, it then is handled like any other point-to-point message. Although pub/sub is not normally noticeably slower than point-to-point the code path is longer and therefore, all other things being equal, it will add a bit more latency.
The other part of the question is about parallelism. You proposed either spinning up many threads or breaking the app up so that requests and replies are handled separately. A third option is to have multiple application instances running. You can combine any or all of these in your design. For example, you can spin up multiple request threads and multiple reply threads and then have application instances processing against multiple queue managers.
The key to this question is whether the messages have affinity to each other, to order dependencies or to the application instance or thread which created them. For example, if I am responding to an HTTP request with a request/reply then the thread attached to the HTTP session probably needs to be the one to receive the reply. But if the reply is truly asynchronous and all I need to do is update a database with the response data then having separate request and reply threads is helpful.
In either case, the ability to dynamically spin up or down the number of instances is helpful in managing peak workloads. If this is accomplished with threading alone then your performance scalability is bound to the upper limit of a single server. If this is accomplished by spinning up new application instances on the same or different server/QMgr then you get both scalability and workload balancing.
Please see the following article for more thoughts on these subjects: Mission:Messaging: Migration, failover, and scaling in a WebSphere MQ cluster
Also, go to the WebSphere MQ SupportPacs page and look for the Performance SupportPac for your platform and WMQ version. These are the ones with names beginning with MP**. These will show you the performance characteristics as the number of connected application instances varies.
It doesn't sound like you're thinking about this the right way. Regardless of the model you use (point-to-point or publish/subscribe), if your performance is bounded by a slow back-end system, neither will help speed up the process. If, however, you could theoretically issue more than one request at a time against the back-end system and expect to see a speed up, then you still don't really care if you do point-to-point or publish/subscribe. What you really care about is synchronous vs. asynchronous.
Your current approach for retrieving the data is clearly synchronous: you send the request message, and wait for the corresponding response message. You could do your communication asynchronously if you simply sent all the request messages in a row (perhaps in a loop) in one method, and then had a separate method (preferably on a different thread) monitoring the incoming topic for responses. This would ensure that your code would no longer block on individual requests. (This roughly corresponds to option 2, though without pub/sub.)
I think option 1 could get pretty unwieldly, depending on how many requests you actually have to make, though it, too, could be implemented without switching to a pub/sub channel.
The reworked approach will use fewer threads. Whether that makes the application faster depends on whether the overhead of managing a lot of threads is currently slowing you down. If you have fewer than 1000 threads (this is a very, very rough order-of-magnitude estimate!), i would guess it probably isn't. If you have more than that, it might well be.

Categories

Resources