Multithreading Java regex - java

I have a 1 producer, M consumer threaded pattern. The producer reads a raw document from disc and places it in a LinkedBlockingQueue. Each consumer thread then takes a raw document and parses the document using a class
ParsedDoc article = parseDoc(rawDocument);
The parseDoc class is a set of approximately 20 methods with the following pattern:
public String clearContent(String document) {
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(document);
matcher.find();
....
}
public String removeHTML(String document) {
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(document);
matcher.replaceAll("");
....
}
The problem I am facing is that the code runs reasonably fast on my local (2 core) machine. But when I run the same code on a 8 core machine the consumer performance degrades to almost a grinding halt. I've tried to optimize the jvm options to no avail. Removing the regex processing step resulted in expected x4 performance gains on the 8 core. So the problem is the regexes. I realize Pattern is thread safe and matcher may need to be reset(). But the problem is how to redesign the bank of regexes (in parseDoc class) so that it is thread-safe across M consumers.
Any help would be much appreciated
Thank you

Compiling the regex is slow. You should only do it once for a given pattern. Unless the pattern variable shown in your sample is really different for each invocation, the Pattern instances could probably be static class members. A Pattern is explicitly safe for concurrent use by multiple threads. (All of the modifiable state is held by the Matcher.)
Since the Matcher is confined to the stack of a single thread, you don't have any threading issues to worry about. Don't try to reuse a Matcher. It can be done, but I'd be surprised if recycling them saves much time compared to regex compilation.

Producer consumer pattern doesn't scale well.
The more producers or consumers you have the worse the performance you get. The reason is that the common queue becomes a bottleneck for the whole system. I hope you see how.
A better approach is to have no common queue; have each consumer have its own queue. When a request comes in have it goes to a load balancer. The load balancer will place the request in the consumer queue that is smallest. The balancer becomes the bottleneck, but it doesn't do a lot of Operations -just picks the right queue to send the incoming request - so it should be darn fast.
Here is an edit to answer your questions:
Problem (more in depth): the more cores you have the slower it gets. Why? Shared memory.
#Peyman use ConcurrentLinkedQueue (which is a non blocking wait free queue where one enqueue and one dequeue can proceed concurrently). Even try it in your initial design and benchmark both designs. I expect your revised design to perform better because you can have only 1 enqueue and 1 dequeue at the same time, as opposed one enqueue and n dequeue as in your initial design (but this is just my speculation).
A great paper on scalable consumer producer by using balancers
Read this page (or can look only at "migrate from the common worker queue approach to the queue-per-thread approach")
Here's a list from http://www.javaperformancetuning.com/news/newtips128.shtml. I think the last 3 points are more applicable to you:
Most server applications use a common worker queue and thread pool; a shared worker queue holds short tasks that arrive from remote sources; a pool of threads retrieves tasks from the queue and processes the tasks; threads are blocked on the queue if there is no task to process.
A feeder queue shared amongst threads is an access bottleneck (from contention) when the number of tasks is high and the task time is very short. The bottleneck gets worse the more cores that are used.
Solutions available for overcoming contention in accessing a shared queue include: Using lock-free data structures; Using concurrent data structures with multiple locks; Maintaining multiple queues to isolate the contention.
A queue-per-thread approach eliminates queue access contention, but is not optimal when a queue is emptied while there is unprocessed queued data in other queues. To improve this, idle threads should have the ability to steal work from other queues. To keep contention to a minimum, the 'steal' should be done from the tail of the other queue (where normal dequeuing from the thread's own queue is done from the head of the queue).

If your not getting the concurrency you expect due to synchronization that is out of your control, one solution that comes to mind is to dispatch work to sub-processes (additional JVMs) upto the number of cores-1.

Related

ArrayBlockingQueue with priority waiting list

I currently have a Spring dispatcher ensuring various concurrency limitation policies based on bounded queues.
Basically, multiple request types are handled, some memory expensive, other less, and the request threads happening to hit the memory expensive tasks put a token in a bounded blocking queue (ArrayBlockingQueue), so that only N of them end up actually running, while the other end up waiting.
Now, the waiting list is internally managed by a ReentrantLock, which in turns leverages a Condition implementation fund in AbstractQueuedLongSynchronizer that uses a linked list, which notifies the longest waiting thread when a token is removed from the queue.
Now I need a different behavior, so that the list maintained by the Condition is sorted by a user defined priority too (straight one, no counter-starvation measures needed for lower priority requests).
Unfortunately the classes in question have a wall of "final" declarations making it hard to inject this seemingly small behavioral change.
Is there any concurrent data structure out there providing the behavior I'm looking for, or that would allow customization?
Alternatively, suggestions to implement it without rewriting ArrayBlockinQueue/ReentrantLock/Condition from scratch?
Note: really looking for a bounded blocking queue with priority in the waiting list, other approaches requiring a redesign of the whole application, secondary execution thread pools and the like are unfortunately not feasible (time and material limitations)

lmax disruptor is too slow in multi-producer mode compared to single-producer mode

Previously, when I use single-producer mode of disruptor, e.g.
new Disruptor<ValueEvent>(ValueEvent.EVENT_FACTORY,
2048, moranContext.getThreadPoolExecutor(), ProducerType.Single,
new BlockingWaitStrategy())
the performance is good. Now I am in a situation that multiple threads would write to a single ring buffer. What I found is that ProducerType.Multi make the code several times slower than single producer mode. That poor performance is not going to be accepted by me. So should I use single producer mode while multiple threads invoke the same event publish method with locks, is that OK? Thanks.
I'm somewhat new to the Disruptor, but after extensive testing and experimenting, I can say that ProducerType.MULTI is more accurate and faster for 2 or more producer threads.
With 14 producer threads on a MacBook, ProducerType.SINGLE shows more events published than consumed, even though my test code is waiting for all producers to end (which they do after a 10s run), and then waiting for the disruptor to end. Not very accurate: Where do those additional published events go?
Driver start: PID=38619 Processors=8 RingBufferSize=1024 Connections=Reuse Publishers=14[SINGLE] Handlers=1[BLOCK] HandlerType=EventHandler<Event>
Done: elpased=10s eventsPublished=6956894 eventsProcessed=4954645
Stats: events/sec=494883.36 sec/event=0.0000 CPU=82.4%
Using ProducerType.MULTI, fewer events are published than with SINGLE, but more events are actually consumed in the same 10 seconds than with SINGLE. And with MULTI, all of the published events are consumed, just what I would expect due to the careful way the driver shuts itself down after the elapsed time expires:
Driver start: PID=38625 Processors=8 RingBufferSize=1024 Connections=Reuse Publishers=14[MULTI] Handlers=1[BLOCK] HandlerType=EventHandler<Event>
Done: elpased=10s eventsPublished=6397109 eventsProcessed=6397109
Stats: events/sec=638906.33 sec/event=0.0000 CPU=30.1%
Again: 2 or more producers: Use ProducerType.MULTI.
By the way, each Producer publishes directly to the ring buffer by getting the next slot, updating the event, and then publishing the slot. And the handler gets the event whenever its onEvent method is called. No extra queues. Very simple.
IMHO, single producer accessed by multi threads with lock won't resolve your problem, because it simply shift the locking from the disruptor side to your own program.
The solution to your problem varies from the type of event model you need. I.e. do you need the events to be consumed chronologically; merged; or any special requirement. Since you are dealing with disruptor and multi producers, that sounds to me very much like FX trading systems :-) Anyway, based on my experience, assuming you need chronological order per producer but don't care about mixing events between producers, I would recommend you to do a queue merging thread. The structure is
Each producer produces data and put them into its own named queue
A worker thread constantly examine the queues. For each queue it remove one or several items and put it to the single producer of your single producer disruptor.
Note that in the above scenario,
Each producer queue is a single producer single consumer queue.
The disruptor is a single producer multi consumer disruptor.
Depends on your need, to avoid a forever running thread, if the thread examine for, say, 100 runs and all queues are empty, it can set some variable and go wait() and the event producers can yield() it when seeing it's waiting.
I think this resolve your problem. If not please post your need of event processing pattern and let's see.

LinkedBlockingDeque.take() - separate instance in multipe thread

In our multithreaded java app, we are using LinkedBlockingDeque separate instance for each thread, assume threads (c1, c2, .... c200)
Threads T1 & T2 receive data from socket & add the object to the specific consumer's Q between c1 to c200.
Infinite loop inside the run(), which calls LinkedBlockingDeque.take()
In the load run the CPU usage for the javae.exe itself is 40%. When we sum up the other process in the system the overall CPU usage reaches 90%.
By using JavaVisualVM the run() is taking more CPU and we suspect the LinkedBlockingDeque.take()
So tried alternatives like thread.wait and notify and thread.sleep(0) but no change.
The reason why each consumer having separate Q is for two reason,
1.there might be more than one request for consumer c1 from T1 or T2
2.if we dump all req in single q, the seach time for c1 to c200 will be more and search criteria will extend.
3.and let the consumer have the separate Q to process thier requests
Trying to reduce the CPU usage and in need of your inputs...
SD
do profiling and make sure that the queue methods take relatively much CPU time. Is your message processing so simple that is compared to putting/taking to/from queue?
How many messages are processed per second? How many CPUs are there? If each CPU is processing less than 100K messages per second, then it's likely that the reason is not the access to the queues, but message handling itself.
Putting in LinkedBlockingDeque creates an instance of a helper object. And I suspect, each new message is allocated from heap, so 2 creation per message. Try to use a pool of preallocated messages and circular buffers.
200 threads is a way too many. This means, too many context switches. Try to use actor libraries and thread pools, for example, https://github.com/rfqu/df4j (yes, it's mine).
Check if http://code.google.com/p/disruptor/ would fit for your needs.

Are regular Queues inappropriate to use when multithreading in Java?

I am trying to add asynchronous output to a my program.
Currently, I have an eventManager class that gets notified each frame of the position of any of the moveable objects currently present in the main loop (It's rendering a scene; some objects change from frame to frame, others are static and present in every frame). I am looking to record the state of each frame so I can add in the functionality to replay the scene.
This means that I need to store the changing information from frame to frame, and either hold it in memory or write it to disk for later retrieval and parsing.
I've done some timing experiments, and recording the state of each object to memory increased the time per frame by about 25% (not to mention the possibility of eventually hitting a memory limit). Directly writing each frame to disk takes (predictably) even longer, close to twice as long as not recording the frames at all.
Needless to say, I'd like to implement multithreading so that I won't lose frames per second in my main rendering loop because the process is constantly writing to disk.
I was wondering whether it was okay to use a regular queue for this task, or if I needed something more dedicated like the queues discussed in this question.
In my situation, there is only one producer (the main thread), and one consumer (the thread I want to asynchronously write to disk). The producer will never remove from the queue, and the consumer will never add to it - so do I need a specialized queue at all?
Is there an advantage to using a more specialized queue anyway?
Yes, a regular Queue is inappropriate. Since you have two threads you need to worry about boundary conditions like an empty queue, full queue (assuming you need to bound it for memory considerations), or anomalies like visibility.
A LinkedBlockingQueue is best suited for your application. The put and take methods use different locks so you will not have lock contention. The take method will automatically block the consumer writing to disk if it somehow magically caught up with the producer rendering frames.
It sounds like you don't need a special queue, but if you want the thread removing from the queue to wait until there's something to get, try the BlockingQueue. It's in the java.util.concurrent package, so it's threadsafe for sure. Here are some relevant quotes from that page:
A Queue that additionally supports operations that wait for the queue
to become non-empty when retrieving an element, and wait for space to
become available in the queue when storing an element.
...
BlockingQueue implementations are designed to be used primarily for
producer-consumer queues, but additionally support the Collection
interface.
...
BlockingQueue implementations are thread-safe.
As long as you're already profiling your code, try dropping a BlockingQueue in there and see what happens!
Good luck!
I don't think it will matter much.
If you have 25% overhead serializing a state in memory, that will still be there with a queue.
Disk will be even more expensive.
The queue blocking mechanism will be cheap in comparison.
One thing to watch for is your queue growing out of control: disk is slow no matter what, if it can't consume queue events fast enough you're in trouble.

Java: High-performance message-passing (single-producer/single-consumer)

I initially asked this question here, but I've realized that my question is not about a while-true loop. What I want to know is, what's the proper way to do high-performance asynchronous message-passing in Java?
What I'm trying to do...
I have ~10,000 consumers, each consuming messages from their private queues. I have one thread that's producing messages one by one and putting them in the correct consumer's queue. Each consumer loops indefinitely, checking for a message to appear in its queue and processing it.
I believe the term is "single-producer/single-consumer", since there's one producer, and each consumer only works on their private queue (multiple consumers never read from the same queue).
Inside Consumer.java:
#Override
public void run() {
while (true) {
Message msg = messageQueue.poll();
if (msg != null) {
... // do something with the message
}
}
}
The Producer is putting messages inside Consumer message queues at a rapid pace (several million messages per second). Consumers should process these messages as fast as possible!
Note: the while (true) { ... } is terminated by a KILL message sent by the Producer as its last message.
However, my question is about the proper way to design this message-passing. What kind of queue should I use for messageQueue? Should it be synchronous or asynchronous? How should Message be designed? Should I use a while-true loop? Should Consumer be a thread, or something else? Will 10,000 threads slow down to a crawl? What's the alternative to threads?
So, what's the proper way to do high-performance message-passing in Java?
I would say that the context switching overhead of 10,000 threads is going to be very high, not to mention the memory overhead. By default, on 32-bit platforms, each thread uses a default stack size of 256kb, so that's 2.5GB just for your stack. Obviously you're talking 64-bit but even so, that quite a large amount of memory. Due to the amount of memory used, the cache is going to be thrashing lots, and the cpu will be throttled by the memory bandwidth.
I would look for a design that avoids using so many threads to avoid allocating large amounts of stack and context switching overhead. You cannot process 10,000 threads concurrently. Current hardware has typically less than 100 cores.
I would create one queue per hardware thread and dispatch messages in a round-robin fashion. If the processing times vary considerably, there is the danger that some threads finish processing their queue before they are given more work, while other threads never get through their allotted work. This can be avoided by using work stealing, as implemented in the JSR-166 ForkJoin framework.
Since communication is one way from the publisher to the subscribers, then Message does not need any special design, assuming the subscriber doesn't change the message once it has been published.
EDIT: Reading the comments, if you have 10,000 symbols, then create a handful of generic subscriber threads (one subscriber thread per core), that asynchornously recieve messages from the publisher (e.g. via their message queue). The subscriber pulls the message from the queue, retrieves the symbol from the message, and looks this up in a Map of message handlers, retrieves the handler, and invokes the handler to synchronously handle the message. Once done, it repeats, fetching the next message from the queue. If messages for the same symbol have to be processed in order (which is why I'm guessing you wanted 10,000 queues.), you need to map symbols to subscribers. E.g. if there are 10 subscribers, then symbols 0-999 go to subscriber 0, 1000-1999 to subscriber 1 etc.. A more refined scheme is to map symbols according to their frequency distribution, so that each subscriber gets roughly the same load. For example, if 10% of the traffic is symbol 0, then subscriber 0 will deal with just that one symbol and the other symbols will be distributed amongst the other subscribers.
You could use this (credit goes to Which ThreadPool in Java should I use?):
class Main {
ExecutorService threadPool = Executors.newFixedThreadPool(
Runtime.availableProcessors()*2);
public static void main(String[] args){
Set<Consumer> consumers = getConsumers(threadPool);
for(Consumer consumer : consumers){
threadPool.execute(consumer);
}
}
}
and
class Consumer {
private final ExecutorService tp;
private final MessageQueue messageQueue;
Consumer(ExecutorService tp,MessageQueue queue){
this.tp = tp;
this.messageQueue = queue;
}
#Override
public void run(){
Message msg = messageQueue.poll();
if (msg != null) {
try{
... // do something with the message
finally{
this.tp.execute(this);
}
}
}
}
}
This way, you can have okay scheduling with very little hassle.
First of all, there's no single correct answer unless you either put a complete design doc or you try different approaches for yourself.
I'm assuming your processing is not going to be computationally intensive otherwise you wouldn't be thinking of processing 10000 queues at the same time. One possible solution is to minimise context switching by having one-two threads per CPU. Unless your system is going to be processing data in strict real time that may possibly give you bigger delays on each queue but overall better throughput.
For example -- have your producer thread run on its own CPU and put batches of messages to consumer threads. Each consumer thread would then distribute messages to its N private queues, perform the processing step, receive new data batch and so on. Again, depends on your delay tolerance so the processing step may mean either processing all the queues, a fixed number of queues, as many queues it can unless a time threshold is reached. Being able to easily tell which queue belongs to which consumer thread (e.g. if queues are numbered sequentially: int consumerThreadNum = queueNum & 0x03) would be beneficial as looking them up in a hash table each time may be slow.
To minimise memory thrashing it may not be such a good idea to create/destroy queues all the time so you may want to pre-allocate a (max number of queues/number of cores) queue objects per thread. When a queue is finished instead of being destroyed it can be cleared and reused. You don't want gc to get in your way too often and for too long.
Another unknown is if your producer produces complete sets of data for each queue or will send data in chunks until the KILL command is received. If your producer sends complete data sets you may do away with the queue concept completely and just process the data as it arrives to a consumer thread.
Have a pool of consumer threads relative to the hardware and os capacity. These consumer threads could poll your message queue.
I would either have the Messages know how to process themselves or register processors with the consumer thread classes when they are initialized.
In the absence of more detail about the constraints of processing the symbols, its hard to give very specific advice.
You should take a look at this slashdot article:
http://developers.slashdot.org/story/10/07/27/1925209/Java-IO-Faster-Than-NIO
It has quite a bit of discussions and actual measured data about the many thread vs. single select vs. thread pool arguments.

Categories

Resources