Java: High-performance message-passing (single-producer/single-consumer)

Java: High-performance message-passing (single-producer/single-consumer) - java

I initially asked this question here, but I've realized that my question is not about a while-true loop. What I want to know is, what's the proper way to do high-performance asynchronous message-passing in Java?
What I'm trying to do...
I have ~10,000 consumers, each consuming messages from their private queues. I have one thread that's producing messages one by one and putting them in the correct consumer's queue. Each consumer loops indefinitely, checking for a message to appear in its queue and processing it.
I believe the term is "single-producer/single-consumer", since there's one producer, and each consumer only works on their private queue (multiple consumers never read from the same queue).
Inside Consumer.java:
#Override
public void run() {
while (true) {
Message msg = messageQueue.poll();
if (msg != null) {
... // do something with the message
}
}
}
The Producer is putting messages inside Consumer message queues at a rapid pace (several million messages per second). Consumers should process these messages as fast as possible!
Note: the while (true) { ... } is terminated by a KILL message sent by the Producer as its last message.
However, my question is about the proper way to design this message-passing. What kind of queue should I use for messageQueue? Should it be synchronous or asynchronous? How should Message be designed? Should I use a while-true loop? Should Consumer be a thread, or something else? Will 10,000 threads slow down to a crawl? What's the alternative to threads?
So, what's the proper way to do high-performance message-passing in Java?

I would say that the context switching overhead of 10,000 threads is going to be very high, not to mention the memory overhead. By default, on 32-bit platforms, each thread uses a default stack size of 256kb, so that's 2.5GB just for your stack. Obviously you're talking 64-bit but even so, that quite a large amount of memory. Due to the amount of memory used, the cache is going to be thrashing lots, and the cpu will be throttled by the memory bandwidth.
I would look for a design that avoids using so many threads to avoid allocating large amounts of stack and context switching overhead. You cannot process 10,000 threads concurrently. Current hardware has typically less than 100 cores.
I would create one queue per hardware thread and dispatch messages in a round-robin fashion. If the processing times vary considerably, there is the danger that some threads finish processing their queue before they are given more work, while other threads never get through their allotted work. This can be avoided by using work stealing, as implemented in the JSR-166 ForkJoin framework.
Since communication is one way from the publisher to the subscribers, then Message does not need any special design, assuming the subscriber doesn't change the message once it has been published.
EDIT: Reading the comments, if you have 10,000 symbols, then create a handful of generic subscriber threads (one subscriber thread per core), that asynchornously recieve messages from the publisher (e.g. via their message queue). The subscriber pulls the message from the queue, retrieves the symbol from the message, and looks this up in a Map of message handlers, retrieves the handler, and invokes the handler to synchronously handle the message. Once done, it repeats, fetching the next message from the queue. If messages for the same symbol have to be processed in order (which is why I'm guessing you wanted 10,000 queues.), you need to map symbols to subscribers. E.g. if there are 10 subscribers, then symbols 0-999 go to subscriber 0, 1000-1999 to subscriber 1 etc.. A more refined scheme is to map symbols according to their frequency distribution, so that each subscriber gets roughly the same load. For example, if 10% of the traffic is symbol 0, then subscriber 0 will deal with just that one symbol and the other symbols will be distributed amongst the other subscribers.

You could use this (credit goes to Which ThreadPool in Java should I use?):
class Main {
ExecutorService threadPool = Executors.newFixedThreadPool(
Runtime.availableProcessors()*2);
public static void main(String[] args){
Set<Consumer> consumers = getConsumers(threadPool);
for(Consumer consumer : consumers){
threadPool.execute(consumer);
}
}
}
and
class Consumer {
private final ExecutorService tp;
private final MessageQueue messageQueue;
Consumer(ExecutorService tp,MessageQueue queue){
this.tp = tp;
this.messageQueue = queue;
}
#Override
public void run(){
Message msg = messageQueue.poll();
if (msg != null) {
try{
... // do something with the message
finally{
this.tp.execute(this);
}
}
}
}
}
This way, you can have okay scheduling with very little hassle.

First of all, there's no single correct answer unless you either put a complete design doc or you try different approaches for yourself.
I'm assuming your processing is not going to be computationally intensive otherwise you wouldn't be thinking of processing 10000 queues at the same time. One possible solution is to minimise context switching by having one-two threads per CPU. Unless your system is going to be processing data in strict real time that may possibly give you bigger delays on each queue but overall better throughput.
For example -- have your producer thread run on its own CPU and put batches of messages to consumer threads. Each consumer thread would then distribute messages to its N private queues, perform the processing step, receive new data batch and so on. Again, depends on your delay tolerance so the processing step may mean either processing all the queues, a fixed number of queues, as many queues it can unless a time threshold is reached. Being able to easily tell which queue belongs to which consumer thread (e.g. if queues are numbered sequentially: int consumerThreadNum = queueNum & 0x03) would be beneficial as looking them up in a hash table each time may be slow.
To minimise memory thrashing it may not be such a good idea to create/destroy queues all the time so you may want to pre-allocate a (max number of queues/number of cores) queue objects per thread. When a queue is finished instead of being destroyed it can be cleared and reused. You don't want gc to get in your way too often and for too long.
Another unknown is if your producer produces complete sets of data for each queue or will send data in chunks until the KILL command is received. If your producer sends complete data sets you may do away with the queue concept completely and just process the data as it arrives to a consumer thread.

Have a pool of consumer threads relative to the hardware and os capacity. These consumer threads could poll your message queue.
I would either have the Messages know how to process themselves or register processors with the consumer thread classes when they are initialized.

In the absence of more detail about the constraints of processing the symbols, its hard to give very specific advice.
You should take a look at this slashdot article:
http://developers.slashdot.org/story/10/07/27/1925209/Java-IO-Faster-Than-NIO
It has quite a bit of discussions and actual measured data about the many thread vs. single select vs. thread pool arguments.

Related

Letting a different thread handle process if another hasn't finished

I am playing around quickfix and I have a design question.
I process messages received in a function below:
void processFixMessage(Message message){
//do stuff here
}
There's almost a certain chance that I cosume(process) messages slower.
My question is, is there a way to handle such a situation where,
If I haven't finished a message and received another message, a different
thread should pick up and start processing.

You can hop the thread in your processFixMessage(Message message). Depending on the rate of incoming message and time to process a single message you can choose how many threads you would like to create.
One way is to create a ThreadPool of n threads and submit your message parsing to that pool.
You can refer code: https://www.journaldev.com/1069/threadpoolexecutor-java-thread-pool-example-executorservice
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
You can have dynamic number of threads based on machine as:
int cores = Runtime.getRuntime().availableProcessors();

is there a blocking queue in Java that only allows peek?

I need a blocking queue that has a size of 1, and every time put is applied it removes the last value and adds the next one. The consumers would be a thread pool in which each thread needs to read the message as it gets put on the queue and decide what to do with it, but they shouldn't be able to take from the queue since all of them need to read from it.
I was considering just taking and putting every time the producer sends out a new message, but having only peek in the run method of the consumers will result in them constantly peeking, won't it? Ideally the message will disappear as soon as the peeking stops, but I don't want to use a timed poll as it's not guaranteed that every consumer will peek the message in time.
My other option at the moment is to iterate over the collection of consumers and call a public method on them with the message, but I really don't want to do that since the system relies on real time updates, and a large collection will take a while to iterate through completely if I'm going through each method call on the stack.

After some consideration, I think you're best off, with each consumer having its own queue and the producer putting its messages on all queues.
If there are few consumers, then putting the messages on those few queues will not take too long (except when the producer blocks because a consumer can't keep up).
If there are many consumers this situation will be highly preferable over a situation where many consumers are in contention with each other.
At the very least this would be a good measure to compare alternate solutions against.

lmax disruptor is too slow in multi-producer mode compared to single-producer mode

Previously, when I use single-producer mode of disruptor, e.g.
new Disruptor<ValueEvent>(ValueEvent.EVENT_FACTORY,
2048, moranContext.getThreadPoolExecutor(), ProducerType.Single,
new BlockingWaitStrategy())
the performance is good. Now I am in a situation that multiple threads would write to a single ring buffer. What I found is that ProducerType.Multi make the code several times slower than single producer mode. That poor performance is not going to be accepted by me. So should I use single producer mode while multiple threads invoke the same event publish method with locks, is that OK? Thanks.

I'm somewhat new to the Disruptor, but after extensive testing and experimenting, I can say that ProducerType.MULTI is more accurate and faster for 2 or more producer threads.
With 14 producer threads on a MacBook, ProducerType.SINGLE shows more events published than consumed, even though my test code is waiting for all producers to end (which they do after a 10s run), and then waiting for the disruptor to end. Not very accurate: Where do those additional published events go?
Driver start: PID=38619 Processors=8 RingBufferSize=1024 Connections=Reuse Publishers=14[SINGLE] Handlers=1[BLOCK] HandlerType=EventHandler<Event>
Done: elpased=10s eventsPublished=6956894 eventsProcessed=4954645
Stats: events/sec=494883.36 sec/event=0.0000 CPU=82.4%
Using ProducerType.MULTI, fewer events are published than with SINGLE, but more events are actually consumed in the same 10 seconds than with SINGLE. And with MULTI, all of the published events are consumed, just what I would expect due to the careful way the driver shuts itself down after the elapsed time expires:
Driver start: PID=38625 Processors=8 RingBufferSize=1024 Connections=Reuse Publishers=14[MULTI] Handlers=1[BLOCK] HandlerType=EventHandler<Event>
Done: elpased=10s eventsPublished=6397109 eventsProcessed=6397109
Stats: events/sec=638906.33 sec/event=0.0000 CPU=30.1%
Again: 2 or more producers: Use ProducerType.MULTI.
By the way, each Producer publishes directly to the ring buffer by getting the next slot, updating the event, and then publishing the slot. And the handler gets the event whenever its onEvent method is called. No extra queues. Very simple.

IMHO, single producer accessed by multi threads with lock won't resolve your problem, because it simply shift the locking from the disruptor side to your own program.
The solution to your problem varies from the type of event model you need. I.e. do you need the events to be consumed chronologically; merged; or any special requirement. Since you are dealing with disruptor and multi producers, that sounds to me very much like FX trading systems :-) Anyway, based on my experience, assuming you need chronological order per producer but don't care about mixing events between producers, I would recommend you to do a queue merging thread. The structure is
Each producer produces data and put them into its own named queue
A worker thread constantly examine the queues. For each queue it remove one or several items and put it to the single producer of your single producer disruptor.
Note that in the above scenario,
Each producer queue is a single producer single consumer queue.
The disruptor is a single producer multi consumer disruptor.
Depends on your need, to avoid a forever running thread, if the thread examine for, say, 100 runs and all queues are empty, it can set some variable and go wait() and the event producers can yield() it when seeing it's waiting.
I think this resolve your problem. If not please post your need of event processing pattern and let's see.

RabbitMQ by Example: Multiple Threads, Channels and Queues

I just read RabbitMQ's Java API docs, and found it very informative and straight-forward. The example for how to set up a simple Channel for publishing/consuming is very easy to follow and understand. But it's a very simple/basic example, and it left me with an important question: How can I set up 1+ Channels to publish/consume to and from multiple queues?
Let's say I have a RabbitMQ server with 3 queues on it: logging, security_events and customer_orders. So we'd either need a single Channel to have the ability to publish/consume to all 3 queues, or more likely, have 3 separate Channels, each dedicated to a single queue.
On top of this, RabbitMQ's best practices dictate that we set up 1 Channel per consumer thread. For this example, let's say security_events is fine with only 1 consumer thread, but logging and customer_order both need 5 threads to handle the volume. So, if I understand correctly, does that mean we need:
1 Channel and 1 consumer thread for publishing/consuming to and from security_events; and
5 Channels and 5 consumer threads for publishing/consuming to and from logging; and
5 Channels and 5 consumer threads for publishing/consuming to and from customer_orders?
If my understanding is misguided here, please begin by correcting me. Either way, could some battle-weary RabbitMQ veteran help me "connect the dots" with a decent code example for setting up publishers/consumers that meet my requirements here?

I think you have several issues with initial understanding. Frankly, I'm a bit surprised to see the following: both need 5 threads to handle the volume. How did you identify you need that exact number? Do you have any guarantees 5 threads will be enough?
RabbitMQ is tuned and time tested, so it is all about proper design
and efficient message processing.
Let's try to review the problem and find a proper solution. BTW, message queue itself will not provide any guarantees you have really good solution. You have to understand what you are doing and also do some additional testing.
As you definitely know there are many layouts possible:
I will use layout B as the simplest way to illustrate 1 producer N consumers problem. Since you are so worried about the throughput. BTW, as you might expect RabbitMQ behaves quite well (source). Pay attention to prefetchCount, I'll address it later:
So it is likely message processing logic is a right place to make sure you'll have enough throughput. Naturally you can span a new thread every time you need to process a message, but eventually such approach will kill your system. Basically, more threads you have bigger latency you'll get (you can check Amdahl's law if you want).
(see Amdahl’s law illustrated)
Tip #1: Be careful with threads, use ThreadPools (details)
A thread pool can be described as a collection of Runnable objects
(work queue) and a connections of running threads. These threads are
constantly running and are checking the work query for new work. If
there is new work to be done they execute this Runnable. The Thread
class itself provides a method, e.g. execute(Runnable r) to add a new
Runnable object to the work queue.
public class Main {
private static final int NTHREDS = 10;
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
for (int i = 0; i < 500; i++) {
Runnable worker = new MyRunnable(10000000L + i);
executor.execute(worker);
}
// This will make the executor accept no new threads
// and finish all existing threads in the queue
executor.shutdown();
// Wait until all threads are finish
executor.awaitTermination();
System.out.println("Finished all threads");
}
}
Tip #2: Be careful with message processing overhead
I would say this is obvious optimization technique. It is likely you'll send small and easy to process messages. The whole approach is about smaller messages to be continuously set and processed. Big messages eventually will play a bad joke, so it is better to avoid that.
So it is better to send tiny pieces of information, but what about processing? There is an overhead every time you submit a job. Batch processing can be very helpful in case of high incoming message rate.
For example, let's say we have simple message processing logic and we do not want to have thread specific overheads every time message is being processed. In order to optimize that very simple CompositeRunnable can be introduced:
class CompositeRunnable implements Runnable {
protected Queue<Runnable> queue = new LinkedList<>();
public void add(Runnable a) {
queue.add(a);
}
#Override
public void run() {
for(Runnable r: queue) {
r.run();
}
}
}
Or do the same in a slightly different way, by collecting messages to be processed:
class CompositeMessageWorker<T> implements Runnable {
protected Queue<T> queue = new LinkedList<>();
public void add(T message) {
queue.add(message);
}
#Override
public void run() {
for(T message: queue) {
// process a message
}
}
}
In such a way you can process messages more effectively.
Tip #3: Optimize message processing
Despite the fact you know can process messages in parallel (Tip #1) and reduce processing overhead (Tip #2) you have to do everything fast. Redundant processing steps, heavy loops and so on might affect performance a lot. Please see interesting case-study:
Improving Message Queue Throughput tenfold by choosing the right XML Parser
Tip #4: Connection and Channel Management
Starting a new channel on an existing connection involves one network
round trip - starting a new connection takes several.
Each connection uses a file descriptor on the server. Channels don't.
Publishing a large message on one channel will block a connection
while it goes out. Other than that, the multiplexing is fairly transparent.
Connections which are publishing can get blocked if the server is
overloaded - it's a good idea to separate publishing and consuming
connections
Be prepared to handle message bursts
(source)
Please note, all tips are perfectly work together. Feel free to let me know if you need additional details.
Complete consumer example (source)
Please note the following:
channel.basicQos(prefetch) - As you saw earlier prefetchCount might be very useful:
This command allows a consumer to choose a prefetch window that
specifies the amount of unacknowledged messages it is prepared to
receive. By setting the prefetch count to a non-zero value, the broker
will not deliver any messages to the consumer that would breach that
limit. To move the window forwards, the consumer has to acknowledge
the receipt of a message (or a group of messages).
ExecutorService threadExecutor - you can specify properly configured executor service.
Example:
static class Worker extends DefaultConsumer {
String name;
Channel channel;
String queue;
int processed;
ExecutorService executorService;
public Worker(int prefetch, ExecutorService threadExecutor,
, Channel c, String q) throws Exception {
super(c);
channel = c;
queue = q;
channel.basicQos(prefetch);
channel.basicConsume(queue, false, this);
executorService = threadExecutor;
}
#Override
public void handleDelivery(String consumerTag,
Envelope envelope,
AMQP.BasicProperties properties,
byte[] body) throws IOException {
Runnable task = new VariableLengthTask(this,
envelope.getDeliveryTag(),
channel);
executorService.submit(task);
}
}
You can also check the following:
Solution Architecting Using Queues?
Some queuing theory: throughput, latency and bandwidth
A quick message queue benchmark: ActiveMQ, RabbitMQ, HornetQ, QPID, Apollo…

How can I set up 1+ Channels to publish/consume to and from multiple queues?
You can implement using threads and channels. All you need is a way to
categorize things, ie all the queue items from the login, all the
queue elements from security_events etc. The catagorization can be
achived using a routingKey.
ie: Every time when you add an item to the queue u specify the routing
key. It will be appended as a property element. By this you can get
the values from a particular event say logging.
The following Code sample explain how you make it done in client side.
Eg:
The routing key is used identify the type of the channel and retrive the types.
For example if you need to get all the channels about the type Login
then you must specify the routing key as login or some other keyword
to identify that.
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.exchangeDeclare(EXCHANGE_NAME, "direct");
string routingKey="login";
channel.basicPublish(EXCHANGE_NAME, routingKey, null, message.getBytes());
You can Look here for more details about the Categorization ..
Threads Part
Once the publishing part is over you can run the thread part..
In this part you can get the Published data on the basis of category. ie; routing Key which in your case is logging, security_events and customer_orders etc.
look in the Example to know how retrieve the data in threads.
Eg :
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("localhost");
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
//**The threads part is as follows**
channel.exchangeDeclare(EXCHANGE_NAME, "direct");
String queueName = channel.queueDeclare().getQueue();
// This part will biend the queue with the severity (login for eg:)
for(String severity : argv){
channel.queueBind(queueName, EXCHANGE_NAME, routingKey);
}
boolean autoAck = false;
channel.basicConsume(queueName, autoAck, "myConsumerTag",
new DefaultConsumer(channel) {
#Override
public void handleDelivery(String consumerTag,
Envelope envelope,
AMQP.BasicProperties properties,
byte[] body)
throws IOException
{
String routingKey = envelope.getRoutingKey();
String contentType = properties.contentType;
long deliveryTag = envelope.getDeliveryTag();
// (process the message components here ...)
channel.basicAck(deliveryTag, false);
}
});
Now a thread that process the Data in the Queue of the
type login(routing key) is created. By this way you can create multiple threads.
Each serving different purpose.
look here for more details about the threads part..

Straight answer
For your particular situation (logging and customer_order both need 5 threads) I would create 1 Channel with 1 Consumer for logging and 1 Channel with 1 Consumer for customer_order. I would also create 2 thread pools (5 threads each): one to be used by logging Consumer and the other by customer_order Consumer.
See Consumption below for why should it work.
PS: do not create the thread pool inside the Consumer; be also aware that Channel.basicConsume(...) is not blocking
Publish
According to Channels and Concurrency Considerations (Thread Safety):
Concurrent publishing on a shared channel is best avoided entirely,
e.g. by using a channel per thread. ... Consuming in one thread and publishing in another thread on a shared channel can be safe.
pretty clear ...
Consumption
The Channel might (I say might because of this) run all its Consumer(s) in the same thread; this ideea is almost explicitly conveyed by Receiving Messages by Subscription ("Push API"):
Each Channel has its own dispatch thread. For the most common use case
of one Consumer per Channel, this means Consumers do not hold up other
Consumers. If you have multiple Consumers per Channel be aware that a
long-running Consumer may hold up dispatch of callbacks to other
Consumers on that Channel.
This means that in certain conditions many Consumers pertaining to the same Channel would run on the same thread such that the 1th one would hold up dispatch of callbacks for the next ones. The dispatch word is very confusing because sometimes refers to "thread work dispatching" while here refers mainly to calling Consumer.handleDelivery (see this again).
But what own dispatch thread is about? is about one from the thread pool used with (see Channels and Concurrency Considerations (Thread Safety)):
Server-pushed deliveries ... uses a
java.util.concurrent.ExecutorService, one per connection.
Conclusion
If one has 1 Channel with 1 Consumer but wants to process the incoming messages in parallel than he better creates (outside the Consumer) and uses (inside the Consumer) his own thread pool; hence each Consumer received message will be processed on the user's thread pool instead on the Channel's own dispatch thread.
Is this approach (user's thread pool used from Consumer) even possible/valid/acceptable at all? it is, see Channels and Concurrency Considerations (Thread Safety):
thread that received the delivery (e.g. Consumer#handleDelivery
delegated delivery handling to a different thread) ...

LinkedBlockingDeque.take() - separate instance in multipe thread

In our multithreaded java app, we are using LinkedBlockingDeque separate instance for each thread, assume threads (c1, c2, .... c200)
Threads T1 & T2 receive data from socket & add the object to the specific consumer's Q between c1 to c200.
Infinite loop inside the run(), which calls LinkedBlockingDeque.take()
In the load run the CPU usage for the javae.exe itself is 40%. When we sum up the other process in the system the overall CPU usage reaches 90%.
By using JavaVisualVM the run() is taking more CPU and we suspect the LinkedBlockingDeque.take()
So tried alternatives like thread.wait and notify and thread.sleep(0) but no change.
The reason why each consumer having separate Q is for two reason,
1.there might be more than one request for consumer c1 from T1 or T2
2.if we dump all req in single q, the seach time for c1 to c200 will be more and search criteria will extend.
3.and let the consumer have the separate Q to process thier requests
Trying to reduce the CPU usage and in need of your inputs...
SD

do profiling and make sure that the queue methods take relatively much CPU time. Is your message processing so simple that is compared to putting/taking to/from queue?
How many messages are processed per second? How many CPUs are there? If each CPU is processing less than 100K messages per second, then it's likely that the reason is not the access to the queues, but message handling itself.
Putting in LinkedBlockingDeque creates an instance of a helper object. And I suspect, each new message is allocated from heap, so 2 creation per message. Try to use a pool of preallocated messages and circular buffers.
200 threads is a way too many. This means, too many context switches. Try to use actor libraries and thread pools, for example, https://github.com/rfqu/df4j (yes, it's mine).
Check if http://code.google.com/p/disruptor/ would fit for your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.