In our multithreaded java app, we are using LinkedBlockingDeque separate instance for each thread, assume threads (c1, c2, .... c200)
Threads T1 & T2 receive data from socket & add the object to the specific consumer's Q between c1 to c200.
Infinite loop inside the run(), which calls LinkedBlockingDeque.take()
In the load run the CPU usage for the javae.exe itself is 40%. When we sum up the other process in the system the overall CPU usage reaches 90%.
By using JavaVisualVM the run() is taking more CPU and we suspect the LinkedBlockingDeque.take()
So tried alternatives like thread.wait and notify and thread.sleep(0) but no change.
The reason why each consumer having separate Q is for two reason,
1.there might be more than one request for consumer c1 from T1 or T2
2.if we dump all req in single q, the seach time for c1 to c200 will be more and search criteria will extend.
3.and let the consumer have the separate Q to process thier requests
Trying to reduce the CPU usage and in need of your inputs...
SD
do profiling and make sure that the queue methods take relatively much CPU time. Is your message processing so simple that is compared to putting/taking to/from queue?
How many messages are processed per second? How many CPUs are there? If each CPU is processing less than 100K messages per second, then it's likely that the reason is not the access to the queues, but message handling itself.
Putting in LinkedBlockingDeque creates an instance of a helper object. And I suspect, each new message is allocated from heap, so 2 creation per message. Try to use a pool of preallocated messages and circular buffers.
200 threads is a way too many. This means, too many context switches. Try to use actor libraries and thread pools, for example, https://github.com/rfqu/df4j (yes, it's mine).
Check if http://code.google.com/p/disruptor/ would fit for your needs.
Related
I'm trying to see difference between DirectMessageListener and SimpleMessageListener. I have this drawing just to ask if it is correct.
Let me try to describe how I understood it and maybe you tell me if it is correct.
In front of spring-rabbit there is rabbit-client java library, that is connecting to rabbit-mq server and delivering messages to spring-rabbit library. This client has some ThreadPoolExecutor (which has in this case I think - 16 threads). So, it does not matter how many queues are there in rabbit - if there is a single connection, I get 16 threads. These same threads are reused if I use DirectMessageListener - and this handler method listen is executed in all of these 16 threads when messages arrive. So if I do something complex in handler, rabbit-client must wait for thread to get free in order to get next message using this thread. Also if I increase setConsumersPerQueue to lets say 20, It will create 20 consumer per queue, but not threads. These 20*5 consumers in my case will all reuse these 16 threads offered by ThreadPoolExecutor?
SimpleMessageListener on the other hand, would have its own threads. If concurrent consumers == 1 (I guess default as in my case) it has only one thread. Whenever there is a message on any of secondUseCase* queues, rabbit-client java library will use one of its 16 threads in my case, to forward message to single internal thread that I have in SimpleMessageListener. As soon as it is forwarded, rabbit-client java library thread is freed and it can go back fetching more messages from rabbit server.
Your understanding is correct.
The main difference is that, with the DMLC, all listeners in all listener containers are called on the shared thread pool in the amqp-client (you can increase the 16 if needed). You need to ensure the pool is large enough to handle your expected concurrency across all containers, otherwise you will get starvation.
It's more efficient because threads are shared.
With the SMLC, you don't have to worry about that, but at the expense of having a thread per concurrency. In that case, a small pool in the amqp-client will generally be sufficient.
Im chasing some memory issues in an app that pulls file names from a kafka queue and does some processing on each. This app runs in Docker with an instance / partition.
Each instance has a single consumer handle that retrieves the next file name and puts it into an ArrayBlockingQueue. Meanwhile there are several threads that take the next file from this queue and do the processing. Im using this secondary queuing as each file can take some time to copy and process (there are instances of "exponential backoff" used IE a thread may be sleeping) so it seemed prudent to have several 'in the pipeline' simultaneously.
My question is about the relative benefits (w/re memory mgmt) of doing it this way (several 'permanent' threads reading from a shared queue) vs launching a new thread for each file as it gets pulled from the queue. In this alternative track I would imagine a FixedThreadPool that would generate a new thread as each file was pulled from kafka.
Is there any advantage to one method vs the other?
Edit:
my primary concern is minimizing GC time. I want to avoid having anything substantial sent to old-gen. This makes me think that the 2nd model is a better way to go.
I'm really new to programming and having performance problems with my software. Basically I get some data and run a 100 loop on it(i=0;i<100;i++) and during that loop my program makes 1 of 3 decisions, keep the data its working on, discard it, or send a version of it back to the queue to process. The individual work each thread does is very small but there's a lot of it(which is why I'm using a queue server to scale horizontally).
My problem is it never takes close to my entire cpu, my program runs at around 40% per core. After profiling, it seems the majority of the time is spend sending/receiving data from the queue(64% approx. in a part called com.rabbitmq.client.impl.Frame.readFrom(DataInputStream) and com.rabbitmq.client.impl.SocketFrameHandler.readFrame(), 17% approx. is getting it in the format for the queue(I brought down from 40% before) and the rest is spend on my programs logic). Obviously, I want my work to be done faster and want to not have it spend so much time in the queue and I'm wondering if there's a better design I can use.
My code is actually quite large but here's a overview of what it does:
I create a connection to the queue server(rabbitmq and java)
I fork as many threads as I have cpu cores(using the same connection)
Data from thread is
each thread creates its own channel to the queue server using the shared connection.
There'a while loop that pools the server and gets X number of messages without acknowledgments
Once I get a message, I use thread executor to send an acknowledge while my job is running
I parse the message and run my loop
If data is sent back to the queue, I send it to a thread executor that sends it back so my program can proceed with the next data set.
One weird thing I did, was although I use thread executor for acknowledgments and sending to the queue, my main worker thread is just a forked thread(using public void run()) because my program is dedicated to this single process I did that to make sure there was always X number of threads ready to work(and there was no shutting down/respawning of them). The rest is in threads because I figured the rest could wait/be queued while my main program runs.
I'm not sure how to design it better so it spends less time gathering/sending data. Is there any designs, rabbitmq, Java things I can use to help?
If it's not IO wait, then I suspect that it's down to some locking going on inside those methods.
It looks to me like your threads are spending a significant amount of time waiting for them to return. Somewhat counter-intuitively, you might well be able to increase your performance by cutting down on the number of threads, since they'll spend less time tripping over each other and more time actively doing something.
Give it a try and see what affect it has on the profile.
I have a 1 producer, M consumer threaded pattern. The producer reads a raw document from disc and places it in a LinkedBlockingQueue. Each consumer thread then takes a raw document and parses the document using a class
ParsedDoc article = parseDoc(rawDocument);
The parseDoc class is a set of approximately 20 methods with the following pattern:
public String clearContent(String document) {
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(document);
matcher.find();
....
}
public String removeHTML(String document) {
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(document);
matcher.replaceAll("");
....
}
The problem I am facing is that the code runs reasonably fast on my local (2 core) machine. But when I run the same code on a 8 core machine the consumer performance degrades to almost a grinding halt. I've tried to optimize the jvm options to no avail. Removing the regex processing step resulted in expected x4 performance gains on the 8 core. So the problem is the regexes. I realize Pattern is thread safe and matcher may need to be reset(). But the problem is how to redesign the bank of regexes (in parseDoc class) so that it is thread-safe across M consumers.
Any help would be much appreciated
Thank you
Compiling the regex is slow. You should only do it once for a given pattern. Unless the pattern variable shown in your sample is really different for each invocation, the Pattern instances could probably be static class members. A Pattern is explicitly safe for concurrent use by multiple threads. (All of the modifiable state is held by the Matcher.)
Since the Matcher is confined to the stack of a single thread, you don't have any threading issues to worry about. Don't try to reuse a Matcher. It can be done, but I'd be surprised if recycling them saves much time compared to regex compilation.
Producer consumer pattern doesn't scale well.
The more producers or consumers you have the worse the performance you get. The reason is that the common queue becomes a bottleneck for the whole system. I hope you see how.
A better approach is to have no common queue; have each consumer have its own queue. When a request comes in have it goes to a load balancer. The load balancer will place the request in the consumer queue that is smallest. The balancer becomes the bottleneck, but it doesn't do a lot of Operations -just picks the right queue to send the incoming request - so it should be darn fast.
Here is an edit to answer your questions:
Problem (more in depth): the more cores you have the slower it gets. Why? Shared memory.
#Peyman use ConcurrentLinkedQueue (which is a non blocking wait free queue where one enqueue and one dequeue can proceed concurrently). Even try it in your initial design and benchmark both designs. I expect your revised design to perform better because you can have only 1 enqueue and 1 dequeue at the same time, as opposed one enqueue and n dequeue as in your initial design (but this is just my speculation).
A great paper on scalable consumer producer by using balancers
Read this page (or can look only at "migrate from the common worker queue approach to the queue-per-thread approach")
Here's a list from http://www.javaperformancetuning.com/news/newtips128.shtml. I think the last 3 points are more applicable to you:
Most server applications use a common worker queue and thread pool; a shared worker queue holds short tasks that arrive from remote sources; a pool of threads retrieves tasks from the queue and processes the tasks; threads are blocked on the queue if there is no task to process.
A feeder queue shared amongst threads is an access bottleneck (from contention) when the number of tasks is high and the task time is very short. The bottleneck gets worse the more cores that are used.
Solutions available for overcoming contention in accessing a shared queue include: Using lock-free data structures; Using concurrent data structures with multiple locks; Maintaining multiple queues to isolate the contention.
A queue-per-thread approach eliminates queue access contention, but is not optimal when a queue is emptied while there is unprocessed queued data in other queues. To improve this, idle threads should have the ability to steal work from other queues. To keep contention to a minimum, the 'steal' should be done from the tail of the other queue (where normal dequeuing from the thread's own queue is done from the head of the queue).
If your not getting the concurrency you expect due to synchronization that is out of your control, one solution that comes to mind is to dispatch work to sub-processes (additional JVMs) upto the number of cores-1.
I initially asked this question here, but I've realized that my question is not about a while-true loop. What I want to know is, what's the proper way to do high-performance asynchronous message-passing in Java?
What I'm trying to do...
I have ~10,000 consumers, each consuming messages from their private queues. I have one thread that's producing messages one by one and putting them in the correct consumer's queue. Each consumer loops indefinitely, checking for a message to appear in its queue and processing it.
I believe the term is "single-producer/single-consumer", since there's one producer, and each consumer only works on their private queue (multiple consumers never read from the same queue).
Inside Consumer.java:
#Override
public void run() {
while (true) {
Message msg = messageQueue.poll();
if (msg != null) {
... // do something with the message
}
}
}
The Producer is putting messages inside Consumer message queues at a rapid pace (several million messages per second). Consumers should process these messages as fast as possible!
Note: the while (true) { ... } is terminated by a KILL message sent by the Producer as its last message.
However, my question is about the proper way to design this message-passing. What kind of queue should I use for messageQueue? Should it be synchronous or asynchronous? How should Message be designed? Should I use a while-true loop? Should Consumer be a thread, or something else? Will 10,000 threads slow down to a crawl? What's the alternative to threads?
So, what's the proper way to do high-performance message-passing in Java?
I would say that the context switching overhead of 10,000 threads is going to be very high, not to mention the memory overhead. By default, on 32-bit platforms, each thread uses a default stack size of 256kb, so that's 2.5GB just for your stack. Obviously you're talking 64-bit but even so, that quite a large amount of memory. Due to the amount of memory used, the cache is going to be thrashing lots, and the cpu will be throttled by the memory bandwidth.
I would look for a design that avoids using so many threads to avoid allocating large amounts of stack and context switching overhead. You cannot process 10,000 threads concurrently. Current hardware has typically less than 100 cores.
I would create one queue per hardware thread and dispatch messages in a round-robin fashion. If the processing times vary considerably, there is the danger that some threads finish processing their queue before they are given more work, while other threads never get through their allotted work. This can be avoided by using work stealing, as implemented in the JSR-166 ForkJoin framework.
Since communication is one way from the publisher to the subscribers, then Message does not need any special design, assuming the subscriber doesn't change the message once it has been published.
EDIT: Reading the comments, if you have 10,000 symbols, then create a handful of generic subscriber threads (one subscriber thread per core), that asynchornously recieve messages from the publisher (e.g. via their message queue). The subscriber pulls the message from the queue, retrieves the symbol from the message, and looks this up in a Map of message handlers, retrieves the handler, and invokes the handler to synchronously handle the message. Once done, it repeats, fetching the next message from the queue. If messages for the same symbol have to be processed in order (which is why I'm guessing you wanted 10,000 queues.), you need to map symbols to subscribers. E.g. if there are 10 subscribers, then symbols 0-999 go to subscriber 0, 1000-1999 to subscriber 1 etc.. A more refined scheme is to map symbols according to their frequency distribution, so that each subscriber gets roughly the same load. For example, if 10% of the traffic is symbol 0, then subscriber 0 will deal with just that one symbol and the other symbols will be distributed amongst the other subscribers.
You could use this (credit goes to Which ThreadPool in Java should I use?):
class Main {
ExecutorService threadPool = Executors.newFixedThreadPool(
Runtime.availableProcessors()*2);
public static void main(String[] args){
Set<Consumer> consumers = getConsumers(threadPool);
for(Consumer consumer : consumers){
threadPool.execute(consumer);
}
}
}
and
class Consumer {
private final ExecutorService tp;
private final MessageQueue messageQueue;
Consumer(ExecutorService tp,MessageQueue queue){
this.tp = tp;
this.messageQueue = queue;
}
#Override
public void run(){
Message msg = messageQueue.poll();
if (msg != null) {
try{
... // do something with the message
finally{
this.tp.execute(this);
}
}
}
}
}
This way, you can have okay scheduling with very little hassle.
First of all, there's no single correct answer unless you either put a complete design doc or you try different approaches for yourself.
I'm assuming your processing is not going to be computationally intensive otherwise you wouldn't be thinking of processing 10000 queues at the same time. One possible solution is to minimise context switching by having one-two threads per CPU. Unless your system is going to be processing data in strict real time that may possibly give you bigger delays on each queue but overall better throughput.
For example -- have your producer thread run on its own CPU and put batches of messages to consumer threads. Each consumer thread would then distribute messages to its N private queues, perform the processing step, receive new data batch and so on. Again, depends on your delay tolerance so the processing step may mean either processing all the queues, a fixed number of queues, as many queues it can unless a time threshold is reached. Being able to easily tell which queue belongs to which consumer thread (e.g. if queues are numbered sequentially: int consumerThreadNum = queueNum & 0x03) would be beneficial as looking them up in a hash table each time may be slow.
To minimise memory thrashing it may not be such a good idea to create/destroy queues all the time so you may want to pre-allocate a (max number of queues/number of cores) queue objects per thread. When a queue is finished instead of being destroyed it can be cleared and reused. You don't want gc to get in your way too often and for too long.
Another unknown is if your producer produces complete sets of data for each queue or will send data in chunks until the KILL command is received. If your producer sends complete data sets you may do away with the queue concept completely and just process the data as it arrives to a consumer thread.
Have a pool of consumer threads relative to the hardware and os capacity. These consumer threads could poll your message queue.
I would either have the Messages know how to process themselves or register processors with the consumer thread classes when they are initialized.
In the absence of more detail about the constraints of processing the symbols, its hard to give very specific advice.
You should take a look at this slashdot article:
http://developers.slashdot.org/story/10/07/27/1925209/Java-IO-Faster-Than-NIO
It has quite a bit of discussions and actual measured data about the many thread vs. single select vs. thread pool arguments.