Using ArrayBlockingQueue makes the process slower

Using ArrayBlockingQueue makes the process slower - java

I just recently used ArrayBlockingQueue for my multi-thread process. But it seemed like it slowed down rather than speeding up. Can you guys help me out? I'm basically importing a file (about 300k rows) and parsing them and storing them in the DB
public class CellPool {
private static class RejectedHandler implements RejectedExecutionHandler {
#Override
public void rejectedExecution(Runnable arg0, ThreadPoolExecutor arg1) {
System.err.println(Thread.currentThread().getName() + " execution rejected: " + arg0);
}
}
private static class Task implements Runnable {
private JSONObject obj;
public Task(JSONObject obj) {
this.obj = obj;
}
#Override
public void run() {
try {
Thread.sleep(1);
runThis(obj);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public void runThis(JSONObject obj) {
//where the rows are parsed and stored in the DB, etc
}
}
public static void executeCellPool(String filename) throws InterruptedException {
// fixed pool fixed queue
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(300000, true);
ThreadPoolExecutor executor = new ThreadPoolExecutor(90, 100, 1, TimeUnit.MINUTES, queue);
DataSet ds = CommonDelimitedParser.getDataSet(filename);
final String[] colNames = ds.getColumns();
while (ds.next()) {
JSONObject obj = new JSONObject();
//some JSON object
Task t = new Task(obj);
executor.execute(t);
}
}
}

tl;dr Big queue sizes can have a negative impact, as can large thread counts. Ideally, you want your consumers and producers to be working at a similar rate.
The reason the addition of the queue is causing issues is because you're using a very large queue (which is not necessary) that is taking up resources. Typically, a blocking queue blocks producers when there is no space left in the queue and consumers when there are no objects left in the queue. By creating a such a large one of a static size, Java is assigning that space in memory when you almost certainly aren't using all of it. It would be more effective to force your producer to wait for space in the queue to clear up if your consumers are consumers too slowly. You don't need to store all of the lines from your file in the queue at the same time.
Thread Pool Executor Queues are discussed in the javadoc here.
Bounded queues. A bounded queue (for example, an ArrayBlockingQueue) helps prevent resource exhaustion when used with finite maximumPoolSizes, but can be more difficult to tune and control. Queue sizes and maximum pool sizes may be traded off for each other: Using large queues and small pools minimizes CPU usage, OS resources, and context-switching overhead, but can lead to artificially low throughput. If tasks frequently block (for example if they are I/O bound), a system may be able to schedule time for more threads than you otherwise allow. Use of small queues generally requires larger pool sizes, which keeps CPUs busier but may encounter unacceptable scheduling overhead, which also decreases throughput.
Your large thread size of 90, combined with your very large pool size of 300000, is most likely using a lot of memory, and resulting in additional thread scheduling overhead. I would drop both of them considerably. I don't know what hardware you are running on, but since it sounds like you're writing an IO intensive program, I would try double the number of threads your CPU can handle, and play around with sizes for your blocking queue to see what works (note: I haven't researched this, this is based on my experience running queues and executors. Happy for others to suggest a different count!).
Of note, though, is that the execute() method will throw a RejectedExecutionException on failure to add to the queue if your queue is too small. One way of monitoring the queue would be to check it's capacity before scheduling a task. You can do this by calling:
executor.getQueue().remainingCapacity()
Don't use the executor.getQueue() method to alter the queue in any way, but it can be used for monitoring.
An alternative is to use an unbounded queue, such as a LinkedBlockingQueue without a defined capacity. This way, you won't need to deal with queue sizes. However, if your producers are running much faster than your consumers, you will once again have the issue of consuming too much memory.
Also, kostya is right, a JDBC batch insert would be faster.

If you want to persist records from a file into a relational database as fast as possible you should use JDBC batch insert rather than inserting records one by one.

Related

Drawbacks to an idling fixed threadpool

I'm currently in the process of doing various performance improvements in a software. As it is using SWT for it's GUI, I have come across a problem where under certain circumstances a lot of UI elements are created in the Display Thread.
Since the person before me, didn't really take care to do any calculations outside of the Display Thread, the whole software can be unresponsive for several seconds on startup.
I've now isolated the code that needs to be performed in the Display Thread, and I'm now calculating everything else in Runnables that I submit to a fixed Threadpool.
I'm using the pool like this:
public abstract class AbstractChartComposite {
private static ExecutorService pool = Executors.newFixedThreadPool(8);
private List<String> currentlyProcessingChartItems = new ArrayList<>();
protected void doCalculate(constraints){
for (IMERuntimeConstraint c : constraints) {
if(!currentlyProcessingChartItems.contains(c.getId())){
currentlyProcessingChartItems.add(c.getId());
pool.submit(new Runnable(){
#Override
public void run() {
try{
createChartItem(c);
currentlyProcessingChartItems.remove(c.getId());
}catch(Throwable e){
e.printStackTrace();
}
}
});
}
}
}
}
I'm now wondering, if I have any drawbacks to leaving the Threadpool running at idle, once all the UI elements have been created. I cannot really shut it down for garbage collection, because it will be needed again on user Input when a new element needs to be created.
So are there any major drawbacks on leaving a threadpool with no submitted Runnables running?

No, there are no drawbacks.
The threads won't be actually running, they'll be parked until a new task is submitted. So it does not affect CPU. Also you say that you will use this pool again, so in your case there's no point of shutting it down and recreating again.
As to the memory - yes, idle threads will consume some amount of memory, but that's not an issue as well, until you have hundreds (thousands?) of threads.
Also, a piece of advice. Do not do premature optimizations. That's the root of all evil. Analyze a problem once you have real performance issues, using special utilities for that and detecting bottlenecks.

submit(Callable<T> task) method in ThreadPoolExecutor

ThreadPoolExecutor inherits submit(Callable<T> task) method.
The constructor of ThreadPoolExecutor accepts an instance BlockingQueue<Runnable>. This blocking queue can hold Runnable instances only.
Javadoc for ThreadPoolExecutor constructor says:
The queue to use for holding tasks before they are executed. This queue will hold only the Runnable tasks submitted by the execute method.
So, my question is: How tasks submitted through submit(Callable<T> task) are queued?

It's wrapped into a Runnable (specifically a RunnableFuture) using newTaskFor(Callable). See the source code.

You can find the explanation in the queuing section of the ThreadPoolExecutor documentation:
Queuing
Any BlockingQueue may be used to transfer and hold submitted tasks. The use of this queue interacts with pool sizing:
If fewer than corePoolSize threads are running, the Executor always prefers adding a new thread rather than queuing.
If corePoolSize or more threads are running, the Executor always prefers queuing a request rather than adding a new thread.
If a request cannot be queued, a new thread is created unless this would exceed maximumPoolSize, in which case, the task will be rejected.
There are three general strategies for queuing:
Direct handoffs. A good default choice for a work queue is a SynchronousQueue that hands off tasks to threads without otherwise holding them. Here, an attempt to queue a task will fail if no threads are immediately available to run it, so a new thread will be constructed. This policy avoids lockups when handling sets of requests that might have internal dependencies. Direct handoffs generally require unbounded maximumPoolSizes to avoid rejection of new submitted tasks. This in turn admits the possibility of unbounded thread growth when commands continue to arrive on average faster than they can be processed.
Unbounded queues. Using an unbounded queue (for example a LinkedBlockingQueue without a predefined capacity) will cause new tasks to wait in the queue when all corePoolSize threads are busy. Thus, no more than corePoolSize threads will ever be created. (And the value of the maximumPoolSize therefore doesn't have any effect.) This may be appropriate when each task is completely independent of others, so tasks cannot affect each others execution; for example, in a web page server. While this style of queuing can be useful in smoothing out transient bursts of requests, it admits the possibility of unbounded work queue growth when commands continue to arrive on average faster than they can be processed.
Bounded queues. A bounded queue (for example, an ArrayBlockingQueue) helps prevent resource exhaustion when used with finite maximumPoolSizes, but can be more difficult to tune and control. Queue sizes and maximum pool sizes may be traded off for each other: Using large queues and small pools minimizes CPU usage, OS resources, and context-switching overhead, but can lead to artificially low throughput. If tasks frequently block (for example if they are I/O bound), a system may be able to schedule time for more threads than you otherwise allow. Use of small queues generally requires larger pool sizes, which keeps CPUs busier but may encounter unacceptable scheduling overhead, which also decreases throughput.
Some examples can be found in the Executors class which offers methods to create several types of ThreadPoolExecutor.

Using a bounded queue with submit(Callable) is a pain because there's no default way to make it block (that I know of), even when using a BlockingQueue.
One thing that can help is to simply catch the RejectedExecutionException from the submit, then wait and try again. This will be suitable if your application should block on submit until capacity is available in the blocking queue. For example, wrap:
futures.add(executor.submit(callable));
like this:
boolean submitted = false;
while(!submitted) {
try {
futures.add(executor.submit(callable));
submitted = true;
} catch(RejectedExecutionException e) {
try {
Thread.sleep(1000);
} catch(InterruptedException e2) {
throw new RuntimeException("Interrupted", e2);
}
}
}
I hope that helps.

There isn't need to wrap any.You can crete an ExecutorService as by Executors or also directly as new ThreadPoolExecutor .Use submit(Callable) method and ThreadPoolExceutor will wrap Callable with Runnable for you .
Just pay attention if you are usign a Rejection Handler ,remember you will got the wrapper object FutureTask containing your Callable
#Override
public void rejectedExecution(Runnable r, ThreadPoolExecutor executor){
FutureTask ft= (FutureTask)r;
}

Java BlockingQueue with batching?

I am interested in a data structure identical to the Java BlockingQueue, with the exception that it must be able to batch objects in the queue. In other words, I would like the producer to be able to put objects into the queue, but have the consumer block on take() untill the queue reaches a certain size (the batch size).
Then, once the queue has reached the batch size, the producer must block on put() untill the consumer has consumed all of the elements in the queue (in which case the producer will start producing again and the consumer block untill the batch is reached again).
Does a similar data structure exist? Or should I write it (which I don't mind), I just don't want to waste my time if there is something out there.
UPDATE
Maybe to clarify things a bit:
The situation will always be as follows. There can be multiple producers adding items to the queue, but there will never be more than one consumer taking items from the queue.
Now, the problem is that there are multiple of these setups in parallel and serial. In other words, producers produce items for multiple queues, while consumers in their own right can also be producers. This can be more easily thought of as a directed graph of producers, consumer-producers, and finally consumers.
The reason that producers should block until the queues are empty (#Peter Lawrey) is because each of these will be running in a thread. If you leave them to simply produce as space becomes available, you will end up with a situation where you have too many threads trying to process too many things at once.
Maybe coupling this with an execution service could solve the problem?

I would suggest you use BlockingQueue.drainTo(Collection, int). You can use it with take() to ensure you get a minimum number of elements.
The advantage of using this approach is that your batch size grows dynamically with the workload and the producer doesn't have to block when the consumer is busy. i.e. it self optimises for latency and throughput.
To implement exactly as asked (which I think is a bad idea) you can use a SynchronousQueue with a busy consuming thread.
i.e. the consuming thread does a
list.clear();
while(list.size() < required) list.add(queue.take());
// process list.
The producer will block when ever the consumer is busy.

Here is a quick ( = simple but not fully tested) implementation that i think may be suitable for your requests - you should be able to extend it to support the full queue interface if you need to.
to increase performance you can switch to ReentrantLock instead of using "synchronized" keyword..
public class BatchBlockingQueue<T> {
private ArrayList<T> queue;
private Semaphore readerLock;
private Semaphore writerLock;
private int batchSize;
public BatchBlockingQueue(int batchSize) {
this.queue = new ArrayList<>(batchSize);
this.readerLock = new Semaphore(0);
this.writerLock = new Semaphore(batchSize);
this.batchSize = batchSize;
}
public synchronized void put(T e) throws InterruptedException {
writerLock.acquire();
queue.add(e);
if (queue.size() == batchSize) {
readerLock.release(batchSize);
}
}
public synchronized T poll() throws InterruptedException {
readerLock.acquire();
T ret = queue.remove(0);
if (queue.isEmpty()) {
writerLock.release(batchSize);
}
return ret;
}
}
Hope you find it useful.

I recently developed this utility that batch BlockingQueue elements using a flushing timeout if queue elements doesn't reach the batch size. It also supports fanOut pattern using multiple instances to elaborate the same set of data:
// Instantiate the registry
FQueueRegistry registry = new FQueueRegistry();
// Build FQueue consumer
registry.buildFQueue(String.class)
.batch()
.withChunkSize(5)
.withFlushTimeout(1)
.withFlushTimeUnit(TimeUnit.SECONDS)
.done()
.consume(() -> (broadcaster, elms) -> System.out.println("elms batched are: "+elms.size()));
// Push data into queue
for(int i = 0; i < 10; i++){
registry.sendBroadcast("Sample"+i);
}
More info here!
https://github.com/fulmicotone/io.fulmicotone.fqueue

Not that I am aware of. If I understand correctly you want either the producer to work (while the consumer is blocked) until it fills the queue or the consumer to work (while the producer blocks) until it clears up the queue. If that's the case may I suggest that you don't need a data structure but a mechanism to block the one party while the other one is working in a mutex fasion. You can lock on an object for that and internally have the logic of whether full or empty to release the lock and pass it to the other party. So in short, you should write it yourself :)

This sounds like how the RingBuffer works in the LMAX Disruptor pattern. See http://code.google.com/p/disruptor/ for more.
A very rough explanation is your main data structure is the RingBuffer. Producers put data in to the ring buffer in sequence and consumers can pull off as much data as the producer has put in to the buffer (so essentially batching). If the buffer is full, the producer blocks until the consumer has finished and freed up slots in the buffer.

Parallelizing a for loop

I have a for loop where the computation at iteration i does not depend on the computations done in the previous iterations.
I want to parallelize the for loop(my code is in java) so that the computation of multiple iterations can be run concurrently on multiple processors. Should I create a thread for the computation of each iteration, i.e. number of threads to be created is equal to the number of iterations(number of iterations are large in the for loop)? How to do this?

Here's a small example that you might find helpful to get started with parallelization. It assumes that:
You create an Input object that contains the input for each iteration of your computation.
You create an Output object that contains the output from computing the input of each iteration.
You want to pass in a list of inputs and get back a list of outputs all at once.
Your input is a reasonable chunk of work to do, so overhead isn't too high.
If your computation is really simple then you'll probably want to consider processing them in batches. You could do that by putting say 100 in each input. It uses as many threads as there are processors in your system. If you're dealing with purely CPU intensive tasks then that's probably the number you want. You'd want to go higher if they're blocked waiting for something else (disk, network, database, etc.)
public List<Output> processInputs(List<Input> inputs)
throws InterruptedException, ExecutionException {
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(threads);
List<Future<Output>> futures = new ArrayList<Future<Output>>();
for (final Input input : inputs) {
Callable<Output> callable = new Callable<Output>() {
public Output call() throws Exception {
Output output = new Output();
// process your input here and compute the output
return output;
}
};
futures.add(service.submit(callable));
}
service.shutdown();
List<Output> outputs = new ArrayList<Output>();
for (Future<Output> future : futures) {
outputs.add(future.get());
}
return outputs;
}

You should not do the thread handling manually. Instead:
create a reasonably-sized thread pool executor service (if your computations do no IO, use as many threads as you have cores).
Run a loop that submits each individual computation to the executor service and keeps the resulting Future objects. Note that if each computation consists of only a small amount of work, this will create a lot of overhead and possibly even be slower than a single-threaded program. In that case, submit jobs that do packets of computation as mdma suggests.
Run a second loop that collects the results from all the Futures (it will implicitly wait until all computations have finished)
shut down the executor service

No, you should not create one thread for each iteration. The optimum number of threads is related to the number of processors available - too many threads, and you waste too much time context switching for no added performance.
If you're not totally attached to Java, you might want to try a parallel high-performance C system like OpenMPI. OpenMPI is suitable for this kind of problem.

Don't create the threads yourself. I recommend you use the fork/join framework (jsr166y) and create tasks that iterate over a given range of items. It will take care of the thread management for you, using as many threads as the hardware supports.
Task granularity is the main issue here. If each iteration is relatively low computation (say less than 100 operations) then having each iteration executed as a separate task will introduce a lot of overhead of task scheduling. It's better to have each task accept a List of arguments to compute, and return the result as a list. That way you can have each task compute 1, 10 or thousands of elements, to keep task granulary at a reasonable level that balances keeping work available, and reducing task management overhead.
There is also a ParallelArray class in jsr166z, that allows repeated computation over an array. That may work for you, if the values you are computing are primitive types.

Computing map: computing value ahead of time

I have a computing map (with soft values) that I am using to cache the results of an expensive computation.
Now I have a situation where I know that a particular key is likely to be looked up within the next few seconds. That key is also more expensive to compute than most.
I would like to compute the value in advance, in a minimum-priority thread, so that when the value is eventually requested it will already be cached, improving the response time.
What is a good way to do this such that:
I have control over the thread (specifically its priority) in which the computation is performed.
Duplicate work is avoided, i.e. the computation is only done once. If the computation task is already running then the calling thread waits for that task instead of computing the value again (FutureTask implements this. With Guava's computing maps this is true if you only call get but not if you mix it with calls to put.)
The "compute value in advance" method is asynchronous and idempotent. If a computation is already in progress it should return immediately without waiting for that computation to finish.
Avoid priority inversion, e.g. if a high-priority thread requests the value while a medium-priority thread is doing something unrelated but the the computation task is queued on a low-priority thread, the high-priority thread must not be starved. Maybe this could be achieved by temporarily boosting the priority of the computing thread(s) and/or running the computation on the calling thread.
How could this be coordinated between all the threads involved?
Additional info
The computations in my application are image filtering operations, which means they are all CPU-bound. These operations include affine transforms (ranging from 50µs to 1ms) and convolutions (up to 10ms.) Of course the effectiveness of varying thread priorities depends on the ability of the OS to preempt the larger tasks.

You can arrange for "once only" execution of the background computation by using a Future with the ComputedMap. The Future represents the task that computes the value. The future is created by the ComputedMap and at the same time, passed to an ExecutorService for background execution. The executor can be configured with your own ThreadFactory implementation that creates low priority threads, e.g.
class LowPriorityThreadFactory implements ThreadFactory
{
public Thread newThread(Runnable r) {
Tread t = new Thread(r);
t.setPriority(MIN_PRIORITY);
return t;
}
}
When the value is needed, your high-priority thread then fetches the future from the map, and calls the get() method to retrieve the result, waiting for it to be computed if necessary. To avoid priority inversion you add some additional code to the task:
class HandlePriorityInversionTask extends FutureTask<ResultType>
{
Integer priority; // non null if set
Integer originalPriority;
Thread thread;
public ResultType get() {
if (!isDone())
setPriority(Thread.currentThread().getPriority());
return super.get();
}
public void run() {
synchronized (this) {
thread = Thread.currentThread();
originalPriority = thread.getPriority();
if (priority!=null) setPriority(priority);
}
super.run();
}
protected synchronized void done() {
if (originalPriority!=null) setPriority(originalPriority);
thread = null;
}
void synchronized setPriority(int priority) {
this.priority = Integer.valueOf(priority);
if (thread!=null)
thread.setPriority(priority);
}
}
This takes care of raising the priority of the task to the priority of the thread calling get() if the task has not completed, and returns the priority to the original when the task completes, normally or otherwise. (To keep it brief, the code doesn't check if the priority is indeed greater, but that's easy to add.)
When the high priority task calls get(), the future may not yet have begun executing. You might be tempted to avoid this by setting a large upper bound on the number of threads used by the executor service, but this may be a bad idea, since each thread could be running at high priority, consuming as much cpu as it can before the OS switches it out. The pool should probably be the same size as the number of hardware threads, e.g. size the pool to Runtime.availableProcessors(). If the task has not started executing, rather than wait for the executor to schedule it (which is a form of priority inversion, since your high priority thread is waiting for the low-priority threads to complete) then you may choose to cancel it from the current executor and re-submit on an executor running only high-priority threads.

One common way of coordinating this type of situation is to have a map whose values are FutureTask objects. So, stealing as an example some code I wrote from a web server of mine, the essential idea is that for a given parameter, we see if there is already a FutureTask (meaning that the calculation with that parameter has already been scheduled), and if so we wait for it. In this example, we otherwise schedule the lookup, but that could be done elsewhere with a separate call if that was desirable:
private final ConcurrentMap<WordLookupJob, Future<CharSequence>> cache = ...
private Future<CharSequence> getOrScheduleLookup(final WordLookupJob word) {
Future<CharSequence> f = cache.get(word);
if (f == null) {
Callable<CharSequence> ex = new Callable<CharSequence>() {
public CharSequence call() throws Exception {
return doCalculation(word);
}
};
Future<CharSequence> ft = executor.submit(ex);
f = cache.putIfAbsent(word, ft);
if (f != null) {
// somebody slipped in with the same word -- cancel the
// lookup we've just started and return the previous one
ft.cancel(true);
} else {
f = ft;
}
}
return f;
}
In terms of thread priorities: I wonder if this will achieve what you think it will? I don't quite understand your point about raising the priority of the lookup above the waiting thread: if the thread is waiting, then it's waiting, whatever the relative priorities of other threads... (You might want to have a look at some articles I've written on thread priorities and thread scheduling, but to cut a long story short, I'm not sure that changing the priority will necessarily buy you what you're expecting.)

I suspect that you are heading down the wrong path by focusing on thread priorities. Usually the data that a cache holds is expensive to compute due I/O (out-of-memory data) vs. CPU bound (logic computation). If you're prefetching to guess a user's future action, such as looking at unread emails, then it indicates to me that your work is likely I/O bound. This means that as long as thread starvation does not occur (which schedulers disallow), playing games with thread priority won't offer much of a performance improvement.
If the cost is an I/O call then the background thread is blocked waiting for the data to arrive and processing that data should be fairly cheap (e.g. deserialization). As the change in thread priority won't offer much of a speed-up, performing the work asynchronously on background threadpool should be sufficient. If the cache miss penalty is too high, then using multiple layers of caching tends to help to further reduce the user perceived latency.

As an alternative to thread priorities, you could perform a low-priority task only if no high-priority tasks are in progress. Here's a simple way to do that:
AtomicInteger highPriorityCount = new AtomicInteger();
void highPriorityTask() {
highPriorityCount.incrementAndGet();
try {
highPriorityImpl();
} finally {
highPriorityCount.decrementAndGet();
}
}
void lowPriorityTask() {
if (highPriorityCount.get() == 0) {
lowPriorityImpl();
}
}
In your use case, both Impl() methods would call get() on the computing map, highPriorityImpl() in the same thread and lowPriorityImpl() in a different thread.
You could write a more sophisticated version that defers low-priority tasks until the high-priority tasks complete and limits the number of concurrent low-priority tasks.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.