ExecutorService slow multi thread performance

ExecutorService slow multi thread performance - java

I am trying to execute a simple calculation (it calls Math.random() 10000000 times). Surprisingly running it in simple method performs much faster than using ExecutorService.
I have read another thread at ExecutorService's surprising performance break-even point --- rules of thumb? and tried to follow the answer by executing the Callable using batches, but the performance is still bad
How do I improve the performance based on my current code?
import java.util.*;
import java.util.concurrent.*;
public class MainTest {
public static void main(String[]args) throws Exception {
new MainTest().start();;
}
final List<Worker> workermulti = new ArrayList<Worker>();
final List<Worker> workersingle = new ArrayList<Worker>();
final int count=10000000;
public void start() throws Exception {
int n=2;
workersingle.add(new Worker(1));
for (int i=0;i<n;i++) {
// worker will only do count/n job
workermulti.add(new Worker(n));
}
ExecutorService serviceSingle = Executors.newSingleThreadExecutor();
ExecutorService serviceMulti = Executors.newFixedThreadPool(n);
long s,e;
int tests=10;
List<Long> simple = new ArrayList<Long>();
List<Long> single = new ArrayList<Long>();
List<Long> multi = new ArrayList<Long>();
for (int i=0;i<tests;i++) {
// simple
s = System.currentTimeMillis();
simple();
e = System.currentTimeMillis();
simple.add(e-s);
// single thread
s = System.currentTimeMillis();
serviceSingle.invokeAll(workersingle); // single thread
e = System.currentTimeMillis();
single.add(e-s);
// multi thread
s = System.currentTimeMillis();
serviceMulti.invokeAll(workermulti);
e = System.currentTimeMillis();
multi.add(e-s);
}
long avgSimple=sum(simple)/tests;
long avgSingle=sum(single)/tests;
long avgMulti=sum(multi)/tests;
System.out.println("Average simple: "+avgSimple+" ms");
System.out.println("Average single thread: "+avgSingle+" ms");
System.out.println("Average multi thread: "+avgMulti+" ms");
serviceSingle.shutdown();
serviceMulti.shutdown();
}
long sum(List<Long> list) {
long sum=0;
for (long l : list) {
sum+=l;
}
return sum;
}
private void simple() {
for (int i=0;i<count;i++){
Math.random();
}
}
class Worker implements Callable<Void> {
int n;
public Worker(int n) {
this.n=n;
}
#Override
public Void call() throws Exception {
// divide count with n to perform batch execution
for (int i=0;i<(count/n);i++) {
Math.random();
}
return null;
}
}
}
The output for this code
Average simple: 920 ms
Average single thread: 1034 ms
Average multi thread: 1393 ms
EDIT: performance suffer due to Math.random() being a synchronised method.. after changing Math.random() with new Random object for each thread, the performance improved
The output for the new code (after replacing Math.random() with Random for each thread)
Average simple: 928 ms
Average single thread: 1046 ms
Average multi thread: 642 ms

Math.random() is synchronized. Kind of the whole point of synchronized is to slow things down so they don't collide. Use something that isn't synchronized and/or give each thread its own object to work with, like a new Random.

You'd do well to read the contents of the other thread. There's plenty of good tips in there.
Perhaps the most significant issue with your benchmark is that according to the Math.random() contract, "This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator"
Read this as: the method is synchronized, so only one thread is likely to be able to usefully use it at the same time. So you do a bunch of overhead to distribute the tasks, only to force them again to run serially.

When you use multiple threads, you need to be aware of the overhead of using additional threads. You also need to determine if your algorithm has work which can be preformed in parallel or not. So you need to have work which can be run concurrently which is large enough that it will exceed the overhead of using multiple threads.
In this case, the simplest workaround is to use a separate Random in each thread. The problem you have is that as a micro-benchmark, your loop doesn't actually do anything and the JIT is very good at discarding code which doesn't do anything. A workaround for this is to sum the random results and return it from the call() as this is usually enough to prevent the JIT from discarding the code.
Lastly if you want to sum lots of numbers, you don't need to save them and sum them later. You can sum them as you go.

Related

Massive tasks alternative pattern for Runnable or Callable

For massive parallel computing I tend to use executors and callables. When I have thousand of objects to be computed I feel not so good to instantiate thousand of Runnables for each object.
So I have two approaches to solve this:
I. Split the workload into a small amount of x-workers giving y-objects each. (splitting the object list into x-partitions with y/x-size each)
public static <V> List<List<V>> partitions(List<V> list, int chunks) {
final ArrayList<List<V>> lists = new ArrayList<List<V>>();
final int size = Math.max(1, list.size() / chunks + 1);
final int listSize = list.size();
for (int i = 0; i <= chunks; i++) {
final List<V> vs = list.subList(Math.min(listSize, i * size), Math.min(listSize, i * size + size));
if(vs.size() == 0) break;
lists.add(vs);
}
return lists;
}
II. Creating x-workers which fetch objects from a queue.
Questions:
Is creating thousand of Runnables really expensive and to be avoided?
Is there a generic pattern/recommendation how to do it by solution II?
Are you aware of a different approach?

Creating thousands of Runnable (objects implementing Runnable) is not more expensive than creating a normal object.
Creating and running thousands of Threads can be very heavy, but you can use Executors with a pool of threads to solve this problem.

As for the different approach, you might be interested in java 8's parallel streams.

Combining various answers here :
Is creating thousand of Runnables really expensive and to be avoided?
No, it's not in and of itself. It's how you will make them execute that may prove costly (spawning a few thousand threads certainly has its cost).
So you would not want to do this :
List<Computation> computations = ...
List<Thread> threads = new ArrayList<>();
for (Computation computation : computations) {
Thread thread = new Thread(new Computation(computation));
threads.add(thread);
thread.start();
}
// If you need to wait for completion:
for (Thread t : threads) {
t.join();
}
Because it would 1) be unnecessarily costly in terms of OS ressource (native threads, each having a stack on the heap), 2) spam the OS scheduler with a vastly concurrent workload, most certainly leading to plenty of context switchs and associated cache invalidations at the CPU level 3) be a nightmare to catch and deal with exceptions (your threads should probably define an Uncaught exception handler, and you'd have to deal with it manually).
You'd probably prefer an approach where a finite Thread pool (of a few threads, "a few" being closely related to your number of CPU cores) handles many many Callables.
List<Computation> computations = ...
ExecutorService pool = Executors.newFixedSizeThreadPool(someNumber)
List<Future<Result>> results = new ArrayList<>();
for (Computation computation : computations) {
results.add(pool.submit(new ComputationCallable(computation));
}
for (Future<Result> result : results {
doSomething(result.get);
}
The fact that you reuse a limited number threads should yield a really nice improvement.
Is there a generic pattern/recommendation how to do it by solution II?
There are. First, your partition code (getting from a List to a List<List>) can be found inside collection tools such as Guava, with more generic and fail-proofed implementations.
But more than this, two patterns come to mind for what you are achieving :
Use the Fork/Join Pool with Fork/Join tasks (that is, spawn a task with your whole list of items, and each task will fork sub tasks with half of that list, up to the point where each task manages a small enough list of items). It's divide and conquer. See: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ForkJoinTask.html
If your computation were to be "add integers from a list", it could look like (there might be a boundary bug in there, I did not really check) :
public static class Adder extends RecursiveTask<Integer> {
protected List<Integer> globalList;
protected int start;
protected int stop;
public Adder(List<Integer> globalList, int start, int stop) {
super();
this.globalList = globalList;
this.start = start;
this.stop = stop;
System.out.println("Creating for " + start + " => " + stop);
}
#Override
protected Integer compute() {
if (stop - start > 1000) {
// Too many arguments, we split the list
Adder subTask1 = new Adder(globalList, start, start + (stop-start)/2);
Adder subTask2 = new Adder(globalList, start + (stop-start)/2, stop);
subTask2.fork();
return subTask1.compute() + subTask2.join();
} else {
// Manageable size of arguments, we deal in place
int result = 0;
for(int i = start; i < stop; i++) {
result +=i;
}
return result;
}
}
}
public void doWork() throws Exception {
List<Integer> computation = new ArrayList<>();
for(int i = 0; i < 10000; i++) {
computation.add(i);
}
ForkJoinPool pool = new ForkJoinPool();
RecursiveTask<Integer> masterTask = new Adder(computation, 0, computation.size());
Future<Integer> future = pool.submit(masterTask);
System.out.println(future.get());
}
Use Java 8 parallel streams in order to launch multiple parallel computations easily (under the hood, Java parallel streams can fall back to the Fork/Join pool actually).
Others have shown how this might look like.
Are you aware of a different approach?
For a different take at concurrent programming (without explicit task / thread handling), have a look at the actor pattern. https://en.wikipedia.org/wiki/Actor_model
Akka comes to mind as a popular implementation of this pattern...

#Aaron is right, you should take a look into Java 8's parallel streams:
void processInParallel(List<V> list) {
list.parallelStream().forEach(item -> {
// do something
});
}
If you need to specify chunks, you could use a ForkJoinPool as described here:
void processInParallel(List<V> list, int chunks) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> {
// do something with each item
});
});
}
You could also have a functional interface as an argument:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> processor.accept(item));
});
}
Or in shorthand notation:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
new ForkJoinPool(chunks).submit(() -> list.parallelStream().forEach(processor::accept));
}
And then you would use it like:
processInParallel(myList, 2, item -> {
// do something with each item
});
Depending on your needs, the ForkJoinPool#submit() returns an instance of ForkJoinTask, which is a Future and you may use it to check for the status or wait for the end of your task.
You'd most probably want the ForkJoinPool instantiated only once (not instantiate it on every method call) and then reuse it to prevent CPU choking if the method is called multiple times.

Is creating thousand of Runnables really expensive and to be avoided?
Not at all, the runnable/callable interfaces have only one method to implement each, and the amount of "extra" code in each task depends on the code you are running. But certainly no fault of the Runnable/Callable interfaces.
Is there a generic pattern/recommendation how to do it by solution II?
Pattern 2 is more favorable than pattern 1. This is because pattern 1 assumes that each worker will finish at the exact same time. If some workers finish before other workers, they could just be sitting idle since they only are able to work on the y/x-size queues you assigned to each of them. In pattern 2 however, you will never have idle worker threads (unless the end of the work queue is reached and numWorkItems < numWorkers).
An easy way to use the preferred pattern, pattern 2, is to use the ExecutorService invokeAll(Collection<? extends Callable<T>> list) method.
Here is an example usage:
List<Callable<?>> workList = // a single list of all of your work
ExecutorService es = Executors.newCachedThreadPool();
es.invokeAll(workList);
Fairly readable and straightforward usage, and the ExecutorService implementation will automatically use solution 2 for you, so you know that each worker thread has their use time maximized.
Are you aware of a different approach?
Solution 1 and 2 are two common approaches for generic work. Now, there are many different implementation available for you choose from (such as java.util.Concurrent, Java 8 parallel streams, or Fork/Join pools), but the concept of each implementation is generally the same. The only exception is if you have specific tasks in mind with non-standard running behavior.

Java concurrency counter not properly clean up

This is a java concurrency question. 10 jobs need to be done, each of them will have 32 worker threads. Worker thread will increase a counter . Once the counter is 32, it means this job is done and then clean up counter map. From the console output, I expect that 10 "done" will be output, pool size is 0 and counterThread size is 0.
The issues are :
most of time, "pool size: 0 and countThreadMap size:3" will be
printed out. even those all threads are gone, but 3 jobs are not
finished yet.
some time, I can see nullpointerexception in line 27. I have used ConcurrentHashMap and AtomicLong, why still have concurrency
exception.
Thanks
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.atomic.AtomicLong;
public class Test {
final ConcurrentHashMap<Long, AtomicLong[]> countThreadMap = new ConcurrentHashMap<Long, AtomicLong[]>();
final ExecutorService cachedThreadPool = Executors.newCachedThreadPool();
final ThreadPoolExecutor tPoolExecutor = ((ThreadPoolExecutor) cachedThreadPool);
public void doJob(final Long batchIterationTime) {
for (int i = 0; i < 32; i++) {
Thread workerThread = new Thread(new Runnable() {
#Override
public void run() {
if (countThreadMap.get(batchIterationTime) == null) {
AtomicLong[] atomicThreadCountArr = new AtomicLong[2];
atomicThreadCountArr[0] = new AtomicLong(1);
atomicThreadCountArr[1] = new AtomicLong(System.currentTimeMillis()); //start up time
countThreadMap.put(batchIterationTime, atomicThreadCountArr);
} else {
AtomicLong[] atomicThreadCountArr = countThreadMap.get(batchIterationTime);
atomicThreadCountArr[0].getAndAdd(1);
countThreadMap.put(batchIterationTime, atomicThreadCountArr);
}
if (countThreadMap.get(batchIterationTime)[0].get() == 32) {
System.out.println("done");
countThreadMap.remove(batchIterationTime);
}
}
});
tPoolExecutor.execute(workerThread);
}
}
public void report(){
while(tPoolExecutor.getActiveCount() != 0){
//
}
System.out.println("pool size: "+ tPoolExecutor.getActiveCount() + " and countThreadMap size:"+countThreadMap.size());
}
public static void main(String[] args) throws Exception {
Test test = new Test();
for (int i = 0; i < 10; i++) {
Long batchIterationTime = System.currentTimeMillis();
test.doJob(batchIterationTime);
}
test.report();
System.out.println("All Jobs are done");
}
}

Let’s dig through all the mistakes of thread related programming, one man can make:
Thread workerThread = new Thread(new Runnable() {
…
tPoolExecutor.execute(workerThread);
You create a Thread but don’t start it but submit it to an executor. It’s a historical mistake of the Java API to let Thread implement Runnable for no good reason. Now, every developer should be aware, that there is no reason to treat a Thread as a Runnable. If you don’t want to start a thread manually, don’t create a Thread. Just create the Runnable and pass it to execute or submit.
I want to emphasize the latter as it returns a Future which gives you for free what you are attempting to implement: the information when a task has been finished. It’s even easier when using invokeAll which will submit a bunch of Callables and return when all are done. Since you didn’t tell us anything about your actual task, it’s not clear whether you can let your tasks simply implement Callable (may return null) instead of Runnable.
If you can’t use Callables or don’t want to wait immediately on submission, you have to remember the returned Futures and query them at a later time:
static final ExecutorService cachedThreadPool = Executors.newCachedThreadPool();
public static List<Future<?>> doJob(final Long batchIterationTime) {
final Random r=new Random();
List<Future<?>> list=new ArrayList<>(32);
for (int i = 0; i < 32; i++) {
Runnable job=new Runnable() {
public void run() {
// pretend to do something
LockSupport.parkNanos(TimeUnit.SECONDS.toNanos(r.nextInt(10)));
}
};
list.add(cachedThreadPool.submit(job));
}
return list;
}
public static void main(String[] args) throws Exception {
Test test = new Test();
Map<Long,List<Future<?>>> map=new HashMap<>();
for (int i = 0; i < 10; i++) {
Long batchIterationTime = System.currentTimeMillis();
while(map.containsKey(batchIterationTime))
batchIterationTime++;
map.put(batchIterationTime,doJob(batchIterationTime));
}
// print some statistics, if you really need
int overAllDone=0, overallPending=0;
for(Map.Entry<Long,List<Future<?>>> e: map.entrySet()) {
int done=0, pending=0;
for(Future<?> f: e.getValue()) {
if(f.isDone()) done++;
else pending++;
}
System.out.println(e.getKey()+"\t"+done+" done, "+pending+" pending");
overAllDone+=done;
overallPending+=pending;
}
System.out.println("Total\t"+overAllDone+" done, "+overallPending+" pending");
// wait for the completion of all jobs
for(List<Future<?>> l: map.values())
for(Future<?> f: l)
f.get();
System.out.println("All Jobs are done");
}
But note that if you don’t need the ExecutorService for subsequent tasks, it’s much easier to wait for all jobs to complete:
cachedThreadPool.shutdown();
cachedThreadPool.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
System.out.println("All Jobs are done");
But regardless of how unnecessary the manual tracking of the job status is, let’s delve into your attempt, so you may avoid the mistakes in the future:
if (countThreadMap.get(batchIterationTime) == null) {
The ConcurrentMap is thread safe, but this does not turn your concurrent code into sequential one (that would render multi-threading useless). The above line might be processed by up to all 32 threads at the same time, all finding that the key does not exist yet so possibly more than one thread will then be going to put the initial value into the map.
AtomicLong[] atomicThreadCountArr = new AtomicLong[2];
atomicThreadCountArr[0] = new AtomicLong(1);
atomicThreadCountArr[1] = new AtomicLong(System.currentTimeMillis());
countThreadMap.put(batchIterationTime, atomicThreadCountArr);
That’s why this is called the “check-then-act” anti-pattern. If more than one thread is going to process that code, they all will put their new value, being confident that this was the right thing as they have checked the initial condition before acting but for all but one thread the condition has changed when acting and they are overwriting the value of a previous put operation.
} else {
AtomicLong[] atomicThreadCountArr = countThreadMap.get(batchIterationTime);
atomicThreadCountArr[0].getAndAdd(1);
countThreadMap.put(batchIterationTime, atomicThreadCountArr);
Since you are modifying the AtomicInteger which is already stored into the map, the put operation is useless, it will put the very array that it retrieved before. If there wasn’t the mistake that there can be multiple initial values as described above, the put operation had no effect.
}
if (countThreadMap.get(batchIterationTime)[0].get() == 32) {
Again, the use of a ConcurrentMap doesn’t turn the multi-threaded code into sequential code. While it is clear that the only last thread will update the atomic integer to 32 (when the initial race condition doesn’t materialize), it is not guaranteed that all other threads have already passed this if statement. Therefore more than one, up to all threads can still be at this point of execution and see the value of 32. Or…
System.out.println("done");
countThreadMap.remove(batchIterationTime);
One of the threads which have seen the 32 value might execute this remove operation. At this point, there might be still threads not having executed the above if statement, now not seeing the value 32 but producing a NullPointerException as the array supposed to contain the AtomicInteger is not in the map anymore. This is what happens, occasionally…

After creating your 10 jobs, your main thread is still running - it doesn't wait for your jobs to complete before it calls report on the test. You try to overcome this with the while loop, but tPoolExecutor.getActiveCount() is potentially coming out as 0 before the workerThread is executed, and then the countThreadMap.size() is happening after the threads were added to your HashMap.
There are a number of ways to fix this - but I will let another answer-er do that because I have to leave at the moment.

Synchronized HashMap vs ConcurrentHashMap write test

I learn java.util.concurrency and I have found one article about performance (http://www.javamex.com/tutorials/concurrenthashmap_scalability.shtml). And I decided to repeat one little part of those performance test for studying purpose. I have written the write test for HashMap and ConcurrentHashMap.
I have two questions about it:
Is it true, that for the best performance I should use number threads equal number CPU core?
I understand, that performance vary form platform to platform. But the result has shown that HashMap a little more faster then ConcurrentHashMap. I think it should be the same or vise versa. Maybe I have made a mistake in my code.
Any criticism is welcome.
package Concurrency;
import java.util.concurrent.*;
import java.util.*;
class Writer2 implements Runnable {
private Map<String, Integer> map;
private static int index;
private int nIteration;
private Random random = new Random();
char[] chars = "abcdefghijklmnopqrstuvwxyz".toCharArray();
private final CountDownLatch latch;
public Writer2(Map<String, Integer> map, int nIteration, CountDownLatch latch) {
this.map = map;
this.nIteration = nIteration;
this.latch = latch;
}
private synchronized String getNextString() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
char c = chars[random.nextInt(chars.length)];
sb.append(c);
}
sb.append(index);
if(map.containsKey(sb.toString()))
System.out.println("dublicate:" + sb.toString());
return sb.toString();
}
private synchronized int getNextInt() { return index++; }
#Override
public void run() {
while(nIteration-- > 0) {
map.put(getNextString(), getNextInt());
}
latch.countDown();
}
}
public class FourtyTwo {
static final int nIteration = 100000;
static final int nThreads = 4;
static Long testMap(Map<String, Integer> map) throws InterruptedException{
String name = map.getClass().getSimpleName();
CountDownLatch latch = new CountDownLatch(nThreads);
long startTime = System.currentTimeMillis();
ExecutorService exec = Executors.newFixedThreadPool(nThreads);
for(int i = 0; i < nThreads; i++)
exec.submit(new Writer2(map, nIteration, latch));
latch.await();
exec.shutdown();
long endTime = System.currentTimeMillis();
System.out.format(name + ": that took %,d milliseconds %n", (endTime - startTime));
return (endTime - startTime);
}
public static void main(String[] args) throws InterruptedException {
ArrayList<Long> result = new ArrayList<Long>() {
#Override
public String toString() {
Long result = 0L;
Long size = new Long(this.size());
for(Long i : this)
result += i;
return String.valueOf(result/size);
}
};
Map<String, Integer> map1 = Collections.synchronizedMap(new HashMap<String, Integer>());
Map<String, Integer> map2 = new ConcurrentHashMap<>();
System.out.println("Rinning test...");
for(int i = 0; i < 5; i++) {
//result.add(testMap(map1));
result.add(testMap(map2));
}
System.out.println("Average time:" + result + " milliseconds");
}
}
/*
OUTPUT:
ConcurrentHashMap: that took 5 727 milliseconds
ConcurrentHashMap: that took 2 349 milliseconds
ConcurrentHashMap: that took 9 530 milliseconds
ConcurrentHashMap: that took 25 931 milliseconds
ConcurrentHashMap: that took 1 056 milliseconds
Average time:8918 milliseconds
SynchronizedMap: that took 6 471 milliseconds
SynchronizedMap: that took 2 444 milliseconds
SynchronizedMap: that took 9 678 milliseconds
SynchronizedMap: that took 10 270 milliseconds
SynchronizedMap: that took 7 206 milliseconds
Average time:7213 milliseconds
*/

One
How many threads varies, not by CPU, but by what you are doing. If, for example, what you are doing with your threads is highly disk intensive, your CPU isn't likely to be maxed out, so doing 8 threads may just cause heavy thrashing. If, however, you have huge amounts of disk activity, followed by heavy computation, followed by more disk activity, you would benefit from staggering the threads, splitting out your activities and grouping them, as well as using more threads. You would, for example, in such a case, likely want to group together file activity that uses a single file, but maybe not activity where you are pulling from a bunch of files (unless they are written contiguously on the disk). Of course, if you overthink disk IO, you could seriously hurt your performance, but I'm making a point of saying that you shouldn't just shirk it, either. In such a program, I would probably have threads dedicated to disk IO, threads dedicated to CPU work. Divide and conquer. You'd have fewer IO threads and more CPU threads.
It is common for a synchronous server to run many more threads than cores/CPUs because most of those threads either do work for only a short time or don't do much CPU intensive work. It's not useful to have 500 threads, though, if you will only ever have 2 clients and the context switching of those excess threads hampers performance. It's a balancing act that often requires a little bit of tuning.
In short
Think about what you are doing
Network activity is light,so more threads are generally good
CPU intensive things don't do much good if you have 2x more of those threads than cores... usually a little more than 1x or a little less than 1x is optimum, but you have to test, test, test
Having 10 disk IO intensive threads may hurt all 10 threads, just like having 30 CPU intensive threads... the thrashing hurts them all
Try to spread out the pain
See if it helps to spread out the CPU, IO, etc, work or if clustering is better... it will depend on what you are doing
Try to group things up
If you can, separate out your disk, IO, and network tasks and give them their own threads that are tuned to those tasks
Two
In general, thread-unsafe methods run faster. Similarly using localized synchronization runs faster than synchronizing the entire method. As such, HashMap is normally significantly faster than ConcurrentHashMap. Another example would be StringBuffer compared to StringBuilder. StringBuffer is synchronized and is not only slower, but the synchronization is heavier (more code, etc); it should rarely be used. StringBuilder, however, is unsafe if you have multiple threads hitting it. With that said, StringBuffer and ConcurrentHashMap can race, too. Being "thread-safe" doesn't mean that you can just use it without thought, particularly the way that these two classes operate. For example, you can still have a race condition if you are reading and writing at the same time (say, using contains(Object) as you are doing a put or remove). If you want to prevent such things, you have to use your own class or synchronize your calls to your ConcurrentHashMap.
I generally use the non-concurrent maps and collections and just use my own locks where I need them. You'll find that it's much faster that way and the control is great. Atomics (e.g. AtomicInteger) are nice sometimes, but really not generally useful for what I do. Play with the classes, play with synchronization, and you'll find that you can master than more efficiently than the shotgun approach of ConcurrentHashMap, StringBuffer, etc. You can have race conditions whether or not you use those classes if you don't do it right... but if you do it yourself, you can also be much more efficient and more careful.
Example
Note that we have a new Object that we are locking on. Use this instead of synchronized on a method.
public final class Fun {
private final Object lock = new Object();
/*
* (non-Javadoc)
*
* #see java.util.Map#clear()
*/
#Override
public void clear() {
// Doing things...
synchronized (this.lock) {
// Where we do sensitive work
}
}
/*
* (non-Javadoc)
*
* #see java.util.Map#put(java.lang.Object, java.lang.Object)
*/
#Override
public V put(final K key, #Nullable final V value) {
// Doing things...
synchronized (this.lock) {
// Where we do sensitive work
}
// Doing things...
}
}
And From Your Code...
I might not put that sb.append(index) in the lock or might have a separate lock for index calls, but...
private final Object lock = new Object();
private String getNextString() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
char c = chars[random.nextInt(chars.length)];
sb.append(c);
}
synchronized (lock) {
sb.append(index);
if (map.containsKey(sb.toString()))
System.out.println("dublicate:" + sb.toString());
}
return sb.toString();
}
private int getNextInt() {
synchronized (lock) {
return index++;
}
}

The article you linked doesn't state that ConcurrentHashMap is generally faster than a synchronized HashMap, only that it scales better; i.e. it is faster for a large number of threads. As you can see in their graph, for 4 threads the performance is very similar.
Apart from that, you should test with more items, as my results show, they can vary quite a bit:
Rinning test...
SynchronizedMap: that took 13,690 milliseconds
SynchronizedMap: that took 8,210 milliseconds
SynchronizedMap: that took 11,598 milliseconds
SynchronizedMap: that took 9,509 milliseconds
SynchronizedMap: that took 6,992 milliseconds
Average time:9999 milliseconds
Rinning test...
ConcurrentHashMap: that took 10,728 milliseconds
ConcurrentHashMap: that took 7,227 milliseconds
ConcurrentHashMap: that took 6,668 milliseconds
ConcurrentHashMap: that took 7,071 milliseconds
ConcurrentHashMap: that took 7,320 milliseconds
Average time:7802 milliseconds
Note that your code isn't clearing the Map between loops, adding nIteration more items every time... is that what you wanted?
synchronized isn't necessary on your getNextInt/getNextString, as they aren't being called from multiple threads.

Is it possible to use multithreading without creating Threads over and over again?

First and once more, thanks to all that already answered my question. I am not a very experienced programmer and it is my first experience with multithreading.
I got an example that is working quite like my problem. I hope it could ease our case here.
public class ThreadMeasuring {
private static final int TASK_TIME = 1; //microseconds
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
public static void main(String[] args) {
ThreadFactory threadFactory = new ThreadFactory() {
int counter = 1;
#Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r, "Executor thread " + (counter++));
return t;
}
};
// the total duty to be divided in tasks is fixed (problem dependent).
// Increase ntasks will mean decrease the task time proportionally.
// 4 Is an arbitrary example.
// This tasks will be executed thousands of times, inside a loop alternating
// with serial processing that needs their result and prepare the next ones.
int ntasks = 4;
int nthreads = 2;
int ncores = Runtime.getRuntime().availableProcessors();
if (nthreads<ncores) ncores = nthreads;
Batch serial = new Batch(null);
long serialTime = System.nanoTime();
serial.run();
serialTime = System.nanoTime() - serialTime;
ExecutorService executor = Executors.newFixedThreadPool( nthreads, threadFactory );
CountDownLatch countDown = new CountDownLatch(ntasks);
ArrayList<Batch> batches = new ArrayList<Batch>();
for (int i = 0; i < ntasks; i++) {
batches.add(new Batch(countDown));
}
long start = System.nanoTime();
for (Batch r : batches){
executor.execute(r);
}
// wait for all threads to finish their task
try {
countDown.await();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long tmeasured = (System.nanoTime() - start);
System.out.println("Task time= " + TASK_TIME + " ms");
System.out.println("Number of tasks= " + ntasks);
System.out.println("Number of threads= " + nthreads);
System.out.println("Number of cores= " + ncores);
System.out.println("Measured time= " + tmeasured);
System.out.println("Theoretical serial time= " + TASK_TIME*1000000*ntasks);
System.out.println("Theoretical parallel time= " + (TASK_TIME*1000000*ntasks)/ncores);
System.out.println("Speedup= " + (serialTime*ntasks)/(double)tmeasured);
executor.shutdown();
}
}
Instead of doing the calculations, each batch just waits for some given time. The program calculates the speedup, that would allways be 2 in theory but can get less than 1 (actually a speed down) if the 'TASK_TIME' is small.
My calculations take at the top 1 ms and are commonly faster. For 1 ms I find a little speedup of around 30%, but in practice, with my program, I notice a speed down.
The structure of this code is very similar to my program, so if you could help me to optimise the thread handling I would be very grateful.
Kind regards.
Below, the original question:
Hi.
I would like to use multithreading on my program, since it could increase its efficiency considerably, I believe. Most of its running time is due to independent calculations.
My program has thousands of independent calculations (several linear systems to solve), but they just happen at the same time by minor groups of dozens or so. Each of this groups would take some miliseconds to run. After one of these groups of calculations, the program has to run sequentially for a little while and then I have to solve the linear systems again.
Actually, it can be seen as these independent linear systems to solve are inside a loop that iterates thousands of times, alternating with sequential calculations that depends on the previous results. My idea to speed up the program is to compute these independent calculations in parallel threads, by dividing each group into (the number of processors I have available) batches of independent calculation. So, in principle, there isn't queuing at all.
I tried using the FixedThreadPool and CachedThreadPool and it got even slower than serial processing. It seems to takes too much time creating new Treads each time I need to solve the batches.
Is there a better way to handle this problem? These pools I've used seem to be proper for cases when each thread takes more time instead of thousands of smaller threads...
Thanks!
Best Regards!

Thread pools don't create new threads over and over. That's why they're pools.
How many threads were you using and how many CPUs/cores do you have? What is the system load like (normally, when you execute them serially, and when you execute with the pool)? Is synchronization or any kind of locking involved?
Is the algorithm for parallel execution exactly the same as the serial one (your description seems to suggest that serial was reusing some results from previous iteration).

From what i've read: "thousands of independent calculations... happen at the same time... would take some miliseconds to run" it seems to me that your problem is perfect for GPU programming.
And i think it answers you question. GPU programming is becoming more and more popular. There are Java bindings for CUDA & OpenCL. If it is possible for you to use it, i say go for it.

I'm not sure how you perform the calculations, but if you're breaking them up into small groups, then your application might be ripe for the Producer/Consumer pattern.
Additionally, you might be interested in using a BlockingQueue. The calculation consumers will block until there is something in the queue and the block occurs on the take() call.
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
CountDownLatch getLatch(){
return countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
class CalcProducer implements Runnable {
private final BlockingQueue queue;
CalcProducer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) {
CountDownLatch latch = new CountDownLatch(ntasks);
for(int i = 0; i < ntasks; i++) {
queue.put(produce(latch));
}
// don't need to wait for the latch, only consumers wait
}
} catch (InterruptedException ex) { ... handle ...}
}
CalcGroup produce(CountDownLatch latch) {
return new Batch(latch);
}
}
class CalcConsumer implements Runnable {
private final BlockingQueue queue;
CalcConsumer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) { consume(queue.take()); }
} catch (InterruptedException ex) { ... handle ...}
}
void consume(Batch batch) {
batch.Run();
batch.getLatch().await();
}
}
class Setup {
void main() {
BlockingQueue<Batch> q = new LinkedBlockingQueue<Batch>();
int numConsumers = 4;
CalcProducer p = new CalcProducer(q);
Thread producerThread = new Thread(p);
producerThread.start();
Thread[] consumerThreads = new Thread[numConsumers];
for(int i = 0; i < numConsumers; i++)
{
consumerThreads[i] = new Thread(new CalcConsumer(q));
consumerThreads[i].start();
}
}
}
Sorry if there are any syntax errors, I've been chomping away at C# code and sometimes I forget the proper java syntax, but the general idea is there.

If you have a problem which does not scale to multiple cores, you need to change your program or you have a problem which is not as parallel as you think. I suspect you have some other type of bug, but cannot say based on the information given.
This test code might help.
Time per million tasks 765 ms
code
ExecutorService es = Executors.newFixedThreadPool(4);
Runnable task = new Runnable() {
#Override
public void run() {
// do nothing.
}
};
long start = System.nanoTime();
for(int i=0;i<1000*1000;i++) {
es.submit(task);
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
long time = System.nanoTime() - start;
System.out.println("Time per million tasks "+time/1000/1000+" ms");
EDIT: Say you have a loop which serially does this.
for(int i=0;i<1000*1000;i++)
doWork(i);
You might assume that changing to loop like this would be faster, but the problem is that the overhead could be greater than the gain.
for(int i=0;i<1000*1000;i++) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
doWork(i2);
}
}
}
So you need to create batches of work (at least one per thread) so there are enough tasks to keep all the threads busy, but not so many tasks that your threads are spending time in overhead.
final int batchSize = 10*1000;
for(int i=0;i<1000*1000;i+=batchSize) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
for(int i3=i2;i3<i2+batchSize;i3++)
doWork(i3);
}
}
}
EDIT2: RUnning atest which copied data between threads.
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
starts badly but warms up to ~50 us.
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.

Hmm, CachedThreadPool seems to be created just for your case. It does not recreate threads if you reuse them soon enough, and if you spend a whole minute before you use new thread, the overhead of thread creation is comparatively negligible.
But you can't expect parallel execution to speed up your calculations unless you can also access data in parallel. If you employ extensive locking, many synchronized methods, etc you'll spend more on overhead than gain on parallel processing. Check that your data can be efficiently processed in parallel and that you don't have non-obvious synchronizations lurkinb in the code.
Also, CPUs process data efficiently if data fully fit into cache. If data sets of each thread is bigger than half the cache, two threads will compete for cache and issue many RAM reads, while one thread, if only employing one core, may perform better because it avoids RAM reads in the tight loop it executes. Check this, too.

Here's a psuedo outline of what I'm thinking
class WorkerThread extends Thread {
Queue<Calculation> calcs;
MainCalculator mainCalc;
public void run() {
while(true) {
while(calcs.isEmpty()) sleep(500); // busy waiting? Context switching probably won't be so bad.
Calculation calc = calcs.pop(); // is it pop to get and remove? you'll have to look
CalculationResult result = calc.calc();
mainCalc.returnResultFor(calc,result);
}
}
}
Another option, if you're calling external programs. Don't put them in a loop that does them one at a time or they won't run in parallel. You can put them in a loop that PROCESSES them one at a time, but not that execs them one at a time.
Process calc1 = Runtime.getRuntime.exec("myCalc paramA1 paramA2 paramA3");
Process calc2 = Runtime.getRuntime.exec("myCalc paramB1 paramB2 paramB3");
Process calc3 = Runtime.getRuntime.exec("myCalc paramC1 paramC2 paramC3");
Process calc4 = Runtime.getRuntime.exec("myCalc paramD1 paramD2 paramD3");
calc1.waitFor();
calc2.waitFor();
calc3.waitFor();
calc4.waitFor();
InputStream is1 = calc1.getInputStream();
InputStreamReader isr1 = new InputStreamReader(is1);
BufferedReader br1 = new BufferedReader(isr1);
String resultStr1 = br1.nextLine();
InputStream is2 = calc2.getInputStream();
InputStreamReader isr2 = new InputStreamReader(is2);
BufferedReader br2 = new BufferedReader(isr2);
String resultStr2 = br2.nextLine();
InputStream is3 = calc3.getInputStream();
InputStreamReader isr3 = new InputStreamReader(is3);
BufferedReader br3 = new BufferedReader(isr3);
String resultStr3 = br3.nextLine();
InputStream is4 = calc4.getInputStream();
InputStreamReader isr4 = new InputStreamReader(is4);
BufferedReader br4 = new BufferedReader(isr4);
String resultStr4 = br4.nextLine();

Are tasks parallelized when executed via an ExecutorCompletionService?

I submitted 5 jobs to an ExecutorCompletionService, but it seems like the jobs are executed in sequence. The ExecutorService that is passed to the constructor of ExecutorCompletionService is created using newCacheThreadPool form. Am I doing anything wrong ?
UPDATE Each job is basically doing a database query & some calculation. The code for the ExecutorCompletionService is lifted as-is off the javadoc. I just replaced the Callables with my own custom Callable implementations.

The ExecutorCompletionService has nothing to do with how jobs are executed, it's simply a convenient way of retrieving the results.
Executors.newCachedThreadPool by default executes tasks in separate threads, which can be parallel, given that:
tasks are independent, and don't e.g. synchronize on the same object inside;
you have multiple hardware CPU threads.
The last point deserves an explanation. Although there are no guarantees, in practice the Sun JVM favours the currently executing thread so it's never swapped out in favour of another one. That means that your 5 tasks might end up being executed serially due to the JVM implementation and not having e.g. a multi-core machine.

I assume you meant Executors.newCachedThreadPool(). If so, execution should be parallelized as you expect.

Each job is basically doing a database query & some calculation. The code for the ExecutorCompletionService is lifted as-is off the javadoc. I just replaced the Callables with my own custom Callable implementations.
In that case, are you sure you're not mistaken in thinking they're executed sequentially because you're retrieving the results sequentially?
Throw in some debug logging lines in your callables to rule this out, and/or have a look at this limited usage scenario:
public static void main(String... args) throws InterruptedException, ExecutionException {
List<Callable<String>> list = new ArrayList<Callable<String>>();
list.add(new PowersOfX(2));
list.add(new PowersOfX(3));
list.add(new PowersOfX(5));
solve(Executors.newCachedThreadPool(), list);
}
static void solve(Executor e, Collection<Callable<String>> solvers) throws InterruptedException, ExecutionException {
CompletionService<String> ecs = new ExecutorCompletionService<String>(e);
for (Callable<String> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
String r = ecs.take().get();
if (r != null)
System.out.println("Retrieved: " + r);
}
}
static class PowersOfX implements Callable<String> {
int x;
public PowersOfX(int x) {this.x = x;}
#Override
public String call() throws Exception {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10; i++) {
sb.append(Math.pow(2, i)).append('\t');
System.out.println(Math.pow(x, i));
Thread.sleep(2000);
}
return sb.toString();
}
}
Executing this you'll see the numbers are generated intermixed (and thus executed concurrently), but retrieving the results alone wont show you this level detail..

The execution will depend on a number of things. For example:
the length of time it takes to complete a job
the number of threads in the thread pool (a cached thread pool will only create threads if it thinks they are needed)
Executing in sequence is not necessarily wrong.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.