I currently have some problems to understand why in some cases, parallelization in Java seems infficient. In the following code, I build 4 identical tasks that are executed using a ThreadPool.
On my Core i5 (2 core, 4 thread), if I set the number of workers to 1, the computer needs around 5700ms and use 25% of the processor.
If I set the number of workers to 4, then I observe 100% of CPU usage but... the time of computation is the same: 5700ms, while I expect it to be 4 times lower.
Why? Is it normal?
(Of course my real task is more complicated, but the example seems to reproduce the problem). Thank you by advance for your answers.
Here is the code:
public class Test {
public static void main(String[] args) {
int nb_workers=1;
ExecutorService executor=Executors.newFixedThreadPool(nb_workers);
long tic=System.currentTimeMillis();
for(int i=0; i<4;i++){
WorkerTest wt=new WorkerTest();
executor.execute(wt);
}
executor.shutdown();
try {
executor.awaitTermination(1000, TimeUnit.SECONDS);
} catch (InterruptedException e) {e.printStackTrace();}
System.out.println(System.currentTimeMillis()-tic);
}
public static class WorkerTest implements Runnable {
#Override
public void run() {
double[] array=new double[10000000];
for (int i=0;i<array.length;i++){
array[i]=Math.tanh(Math.random());
}
}
}
}
The clue is that you are calling Math.random which uses a single global instance of Random. So, all your 4 threads compete for the one resource.
Using a thread local Random object will make your execution really parallel:
Random random = new Random();
double[] array = new double[10000000];
for (int i = 0; i < array.length; i++) {
array[i] = Math.tanh(random.nextDouble());
}
Related
I'm trying to simulate a non-thread safe counter class by incrementing the count in an executor service task and using countdown latches to wait for all threads to start and then stop before reading the value in the main thread.
The issue is that when I run it the System.out at the end always returns 10 as the correct count value. I was expecting to see some other value when I run this as the 10 threads may see different values.
My code is below. Any idea what is happening here? I'm running it in Java 17 and from Intellij IDEA.
Counter.java
public class Counter {
private int counter = 0;
public void incrementCounter() {
counter += 1;
}
public int getCounter() {
return counter;
}
}
Main.java
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
public static void main(String[] args) throws InterruptedException {
ExecutorService executorService = Executors.newFixedThreadPool(10);
CountDownLatch startSignal = new CountDownLatch(10);
CountDownLatch doneSignal = new CountDownLatch(10);
Counter counter = new Counter();
for (int i=0; i<10; i++) {
executorService.submit(() -> {
try {
startSignal.countDown();
startSignal.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
counter.incrementCounter();
doneSignal.countDown();
});
}
doneSignal.await();
System.out.println("Finished: " + counter.getCounter());
executorService.shutdownNow();
}
}
It's worth remembering that just because something isn't synchronised correctly, it could still perform correctly under some circumstances, it just isn't guaranteed to do so in every situation, on every JVM, on every hardware.
In other words, there is no reverse guarantee, optimisers for example are free to decide your code can be replaced at little to no cost with a correctly synchronised implementation.
(Whether that is what's actually happening here isn't obvious to me at first glance.)
I have a problem which I would like to solve using Java's ExecutorService and Future classes. I am currently taking many samples from a function that is very expensive for me to compute (each sample can take several minutes) using a for loop. I have a class FunctionEvaluator that evaluates this function for me and this class is quite expensive to instantiate, since it contains a lot of internal memory, so I have made this class easily reusable with some internal counters and a reset() method. So my current situation looks like this:
int numSamples = 100;
int amountOfData = 1000000;
double[] data = new double[amountOfData];//Data comes from somewhere...
double[] results = new double[numSamples];
//a lot of memory contained inside the FunctionEvaluator class,
//expensive to intialise
FunctionEvaluator fe = new FunctionEvaluator();
for(int i=0; i<numSamples; i++) {
results[i] = fe.sampleAt(i, data);//very expensive computation
}
but I would like to get some multithreading going to speed things up. It should be easy enough, because while each sample will share whatever is inside of data, it is a read-only operation and each sample is independent of any other. Now I wouldn't be having any trouble with this since I've used Java's Future and ExecutorService before, but never in a context where the Callable had to be re-used. So in general, how would I go about setting this scenario up given that I can afford to run n instantiations of FunctionEvaluator? Something (very roughly) like this:
int numSamples = 100;
int amountOfData = 1000000;
int N = 10;
double[] data = new double[amountOfData];//Data comes from somewhere...
double[] results = new double[numSamples];
//a lot of memory contained inside the FunctionEvaluator class,
//expensive to intialise
FunctionEvaluator[] fe = new FunctionEvaluator[N];
for(int i=0; i<numSamples; i++) {
//Somehow add available FunctionEvaluators to an ExecutorService
//so that N FunctionEvaluators can run in parallel. When a
//FunctionEvaluator is finished, reset then compute a new sample
//until numSamples samples have been taken.
}
Any help would be greatly appreciated! Many thanks.
EDIT
So here is a toy example (which doesn't work :P). In this case the "expensive function" that I want to sample is just squaring an integer and the "expensive to instantiate class" that does it for me is called CallableComputation:
In TestConc.java:
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
public class TestConc {
public static void main(String[] args) {
SquareCalculator squareCalculator = new SquareCalculator();
int numFunctionEvaluators = 2;
int numSamples = 10;
ExecutorService executor = Executors.newFixedThreadPool(2);
CallableComputation c1 = new CallableComputation(2);
CallableComputation c2 = new CallableComputation(3);
CallableComputation[] callables = new CallableComputation[numFunctionEvaluators];
Future<Integer>[] futures = (new Future[numFunctionEvaluators]);
int[] results = new int[numSamples];
for(int i=0; i<numFunctionEvaluators; i++) {
callables[i] = new CallableComputation(i);
futures[i] = executor.submit(callables[i]);
}
futures[0] = executor.submit(c1);
futures[1] = executor.submit(c2);
for(int i=numFunctionEvaluators; i<numSamples; ) {
for(int j=0; j<futures.length; j++) {
if(futures[j].isDone()) {
try {
results[i] = futures[j].get();
}
catch (InterruptedException e) {
e.printStackTrace();
}
catch (ExecutionException e) {
e.printStackTrace();
}
callables[j].set(i);
System.out.printf("Function evaluator %d given %d\n", j, i+1);
executor.submit(callables[j]);
i++;
}
}
}
executor.shutdown();
try {
executor.awaitTermination(1, TimeUnit.MINUTES);
}
catch (InterruptedException e) {
e.printStackTrace();
}
for (int i=0; i<results.length; i++) {
System.out.printf("res%d=%d, ", i, results[i]);
}
System.out.println();
}
private static boolean areDone(Future<Integer>[] futures) {
for(int i=0; i<futures.length; i++) {
if(!futures[i].isDone()) {
return false;
}
}
return true;
}
private static void printFutures(Future<Integer>[] futures) {
for (int i=0; i<futures.length; i++) {
System.out.printf("f%d=%s | ", i, futures[i].isDone()?"done" : "not done");
}System.out.printf("\n");
}
}
In CallableComputation.java:
import java.util.concurrent.Callable;
public class CallableComputation implements Callable<Integer>{
int input = 0;
public CallableComputation(int input) {
this.input = input;
}
public void set(int i) {
input = i;
}
#Override
public Integer call() throws Exception {
System.out.printf("currval=%d\n", input);
Thread.sleep(500);
return input * input;
}
}
In Java8:
double[] result = IntStream.range(0, numSamples)
.parallel()
.mapToDouble(i->fe.sampleAt(i, data))
.toArray();
The question asks how to execute heavy computational functions in parallel by loading as many CPU as possible.
Exert from the Parallelism tutorial:
Parallel computing involves dividing a problem into subproblems,
solving those problems simultaneously (in parallel, with each
subproblem running in a separate thread), and then combining the
results of the solutions to the subproblems. Java SE provides the
fork/join framework, which enables you to more easily implement
parallel computing in your applications. However, with this framework,
you must specify how the problems are subdivided (partitioned). With
aggregate operations, the Java runtime performs this partitioning and
combining of solutions for you.
The actual solution includes:
IntStream.range will generate the stream of integers from 0 to numSamples.
parallel() will split the stream and execute it will all available CPU on the box.
mapToDouble() will convert the stream of integers to the stream of doubles by applying the lamba expression that will do actual work.
toArray() is a terminal operation that will aggregate the result and return it as an array.
There is no special code change required, you can use the same Callable again and again without any issue. Also, to improve efficiency, as you are saying, creating an instance of FunctionEvaluator is expensive, you can use only one instance and ensure that sampleAt is thread safe. One option is, maybe you can use all function local variables and don't modify any of the passing argument at any point of time while any of the thread is running
Please find a quick example below:
Code Snippet:
ExecutorService executor = Executors.newFixedThreadPool(2);
Callable<String> task1 = new Callable<String>(){public String call(){System.out.println(Thread.currentThread()+"currentThread");return null;}}
executor.submit(task1);
executor.submit(task1);
executor.shutdown();
Please find the screenshot below:
You can wrap each FunctionEvaluator's actual work as a Callable/Runnanle, then using a fixdThreadPool with a queue, then you just need to sumbit the target callable/runnable to the threadPool.
I would like to get some multithreading going to speed things up.
Sounds like a good idea but your code is massively over complex. #Pavel has a dead simple Java 8 solution but even without Java 8 you can make it a lot easier.
All you need to do is to submit the jobs into the executor and then call get() on each one of the Futures that are returned. A Callable class is not needed although it does make the code a lot cleaner. But you certainly don't need the arrays which are a bad pattern anyway because a typo can easily generate out-of-bounds exceptions. Stick to collections or Java 8 streams.
ExecutorService executor = Executors.newFixedThreadPool(2);
List<Future<Integer>> futureList = new ArrayList<Future<Integer>>();
for (int i = 0; i < numSamples; i++ ) {
// start the jobs running in the background
futureList.add(executor.subject(new CallableComputation(i));
}
// shutdown executor if done submitting tasks, submitted jobs will keep running
executor.shutdown();
for (Future<Integer> future : futureList) {
// this will wait for the future to finish, it also throws some exceptions
Integer result = future.get();
// add result to a collection or something here
}
I have kind of confused about the "Semaphore" class in java.util.concurrent package. Here are my code snippet:
import java.util.concurrent.Semaphore;
public class TestSemaphore {
public static void main(String[] args){
Semaphore limit = new Semaphore(2);
SemaphoreAA s = new SemaphoreAA(limit);
AAThread a = new AAThread(s);
Thread[] sThread = new Thread[100];
for(int i = 0; i<100; i++){
sThread[i] = new Thread(a,"[sThread"+i+"]");
sThread[i].start();
}
}
}
class SemaphoreAA{
private static int counter;
private Semaphore limit;
public SemaphoreAA(Semaphore limit){
this.limit = limit;
}
public void increment() throws InterruptedException{
System.out.printf("%-15s%-25s%5d%n",Thread.currentThread().getName()," : Before Increment. Current counter: ",counter);
limit.acquire();
System.out.printf("%-15s%-25s%n",Thread.currentThread().getName()," : Get the resource. Start to increment.");
counter++;
System.out.printf("%-20s%-40s%5d%n",Thread.currentThread().getName()," : Increment is done. Current counter: ",counter );
limit.release();
}
}
class AAThread implements Runnable{
private SemaphoreAA s;
public AAThread(SemaphoreAA s){
this.s = s;
}
public void run() {
try {
s.increment();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I understand it can be used to control accesses to resources. And if I set the limit to one, like this "Semaphore limit = new Semaphore(1);", it seems like a lock. It was proved. If I set the limit to two, I expect there are two threads in the given time to access to the increment() method and it might cause data race. The output might be like this:
[sThread3] : Before Increment. Current counter: 2
[sThread4] : Before Increment. Current counter: 2
[sThread3] : Get the resource. Start to increment.
[sThread4] : Get the resource. Start to increment.
[sThread3] : Increment is done. Current counter: 3
[sThread4] : Increment is done. Current counter: 3
However, though I had tried several times, the result expected didn't occur. So I wanna know if I misunderstood it. Thanks.
You understood it right.
However, though I had tried several times, the result expected didn't occur.
Just because it can appear doesn't mean it will. This is the problem with most concurrency bugs: they sometimes appear, sometimes not.
If you want to increase the likelihood of an error you can increase the number of Threads or create/start them in two different loops after each other.
I'm trying out this code and I'm a bit confused/surprised at the output I'm getting. I'm still new to Java but I'm aware that threads should normally run concurrently. It seems my "printB" thread here waits for the "printA" thread before it starts executing. I've run the program several times (hoping to get a mixture of both threads' outcome i.e. something like: a, a, b, a, b, a...) but still I get the same output (i.e. "A" getting printed first, before "B"). Why is this happening and how can I alter the code to start behaving normally?
Any inputs/suggestions would be much appreciated. Thanks.
Also, I'm trying out the same code using the extends Thread method and it doesn't work.
class PrintChars implements Runnable{
private char charToPrint;
private int times;
public PrintChars(char c, int t){
charToPrint = c;
times = t;
}
public void run(){
for (int i=0; i<times; i++)
System.out.println(charToPrint);
}
public static void main(String[] args){
PrintChars charA = new PrintChars('a', 7);
PrintChars charB = new PrintChars('b', 5);
Thread printA = new Thread(charA);
Thread printB = new Thread(charB);
printA.start();
printB.start();
}
}
Extends Thread method below:
class PrintChars extends Thread {
private Char charToPrint;
private int times;
public PrintChars(char c, int t){
charToPrint = c;
times = t;
}
public void run (){
for(int i =0; i<times; i++)
System.out.println(charToPrint);
}
PrintChars printA = new PrintChars('a', 7);
PrintChars printB = new PrintChars('a', 5);
printA.start();
printB.start();
}
In multithreading, usually you can't make any assumption about the output.
Perhaps the time used to create the thread is very long, hence the previous thread has time to complete entirely, since its execution is very short.
Try with 7000 and 5000 instead of 7 and 5.
Each thread takes time to start and can run to completion very quickly. I suggest you add Thread.sleep(500); after each line printed.
try {
for(int i =0; i<times; i++) {
System.out.println(charToPrint);
Thread.sleep(500);
}
} catch(InterruptedException ie) {
}
Try running it a few more times. When I tried it with 700/500 I noticed some interweaving.
Thread scheduling is not deterministic. It's perfectly fine for the OS to schedule one thread, and only schedule the second after the first has completed.
If you think about it from the OS' point of view, it makes sense.. If somebody asked you to do two tasks, it may be more efficient to do one and then the other.
When the task takes too long to execute, as an OS you'll probably want to task switch and do something on the other task as otherwise the other task won't progress at all, and the app. that issued the task will feel discriminated.
You can see this by making your task run longer, e.g. by adding Thread.sleep statements or calculating PI or something (or just loop for more than 7, like 70000).
I think the execution times for your threads are too short to notice an effect. You can try higher values for times. I would try something >10000. Another option is to increase the execution time by making the method slower:
public void run(){
for (int i = 0; i < times; i++) {
System.out.println(charToPrint);
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
Your code is behaving normally only, if your expectation is to have mixture of a and b's printed, then you should sufficiently print chars not just couple of times, or use Thread.sleep() or do a busy wait running a for loop doing nothing for a million times.
First and once more, thanks to all that already answered my question. I am not a very experienced programmer and it is my first experience with multithreading.
I got an example that is working quite like my problem. I hope it could ease our case here.
public class ThreadMeasuring {
private static final int TASK_TIME = 1; //microseconds
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
public static void main(String[] args) {
ThreadFactory threadFactory = new ThreadFactory() {
int counter = 1;
#Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r, "Executor thread " + (counter++));
return t;
}
};
// the total duty to be divided in tasks is fixed (problem dependent).
// Increase ntasks will mean decrease the task time proportionally.
// 4 Is an arbitrary example.
// This tasks will be executed thousands of times, inside a loop alternating
// with serial processing that needs their result and prepare the next ones.
int ntasks = 4;
int nthreads = 2;
int ncores = Runtime.getRuntime().availableProcessors();
if (nthreads<ncores) ncores = nthreads;
Batch serial = new Batch(null);
long serialTime = System.nanoTime();
serial.run();
serialTime = System.nanoTime() - serialTime;
ExecutorService executor = Executors.newFixedThreadPool( nthreads, threadFactory );
CountDownLatch countDown = new CountDownLatch(ntasks);
ArrayList<Batch> batches = new ArrayList<Batch>();
for (int i = 0; i < ntasks; i++) {
batches.add(new Batch(countDown));
}
long start = System.nanoTime();
for (Batch r : batches){
executor.execute(r);
}
// wait for all threads to finish their task
try {
countDown.await();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long tmeasured = (System.nanoTime() - start);
System.out.println("Task time= " + TASK_TIME + " ms");
System.out.println("Number of tasks= " + ntasks);
System.out.println("Number of threads= " + nthreads);
System.out.println("Number of cores= " + ncores);
System.out.println("Measured time= " + tmeasured);
System.out.println("Theoretical serial time= " + TASK_TIME*1000000*ntasks);
System.out.println("Theoretical parallel time= " + (TASK_TIME*1000000*ntasks)/ncores);
System.out.println("Speedup= " + (serialTime*ntasks)/(double)tmeasured);
executor.shutdown();
}
}
Instead of doing the calculations, each batch just waits for some given time. The program calculates the speedup, that would allways be 2 in theory but can get less than 1 (actually a speed down) if the 'TASK_TIME' is small.
My calculations take at the top 1 ms and are commonly faster. For 1 ms I find a little speedup of around 30%, but in practice, with my program, I notice a speed down.
The structure of this code is very similar to my program, so if you could help me to optimise the thread handling I would be very grateful.
Kind regards.
Below, the original question:
Hi.
I would like to use multithreading on my program, since it could increase its efficiency considerably, I believe. Most of its running time is due to independent calculations.
My program has thousands of independent calculations (several linear systems to solve), but they just happen at the same time by minor groups of dozens or so. Each of this groups would take some miliseconds to run. After one of these groups of calculations, the program has to run sequentially for a little while and then I have to solve the linear systems again.
Actually, it can be seen as these independent linear systems to solve are inside a loop that iterates thousands of times, alternating with sequential calculations that depends on the previous results. My idea to speed up the program is to compute these independent calculations in parallel threads, by dividing each group into (the number of processors I have available) batches of independent calculation. So, in principle, there isn't queuing at all.
I tried using the FixedThreadPool and CachedThreadPool and it got even slower than serial processing. It seems to takes too much time creating new Treads each time I need to solve the batches.
Is there a better way to handle this problem? These pools I've used seem to be proper for cases when each thread takes more time instead of thousands of smaller threads...
Thanks!
Best Regards!
Thread pools don't create new threads over and over. That's why they're pools.
How many threads were you using and how many CPUs/cores do you have? What is the system load like (normally, when you execute them serially, and when you execute with the pool)? Is synchronization or any kind of locking involved?
Is the algorithm for parallel execution exactly the same as the serial one (your description seems to suggest that serial was reusing some results from previous iteration).
From what i've read: "thousands of independent calculations... happen at the same time... would take some miliseconds to run" it seems to me that your problem is perfect for GPU programming.
And i think it answers you question. GPU programming is becoming more and more popular. There are Java bindings for CUDA & OpenCL. If it is possible for you to use it, i say go for it.
I'm not sure how you perform the calculations, but if you're breaking them up into small groups, then your application might be ripe for the Producer/Consumer pattern.
Additionally, you might be interested in using a BlockingQueue. The calculation consumers will block until there is something in the queue and the block occurs on the take() call.
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
CountDownLatch getLatch(){
return countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
class CalcProducer implements Runnable {
private final BlockingQueue queue;
CalcProducer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) {
CountDownLatch latch = new CountDownLatch(ntasks);
for(int i = 0; i < ntasks; i++) {
queue.put(produce(latch));
}
// don't need to wait for the latch, only consumers wait
}
} catch (InterruptedException ex) { ... handle ...}
}
CalcGroup produce(CountDownLatch latch) {
return new Batch(latch);
}
}
class CalcConsumer implements Runnable {
private final BlockingQueue queue;
CalcConsumer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) { consume(queue.take()); }
} catch (InterruptedException ex) { ... handle ...}
}
void consume(Batch batch) {
batch.Run();
batch.getLatch().await();
}
}
class Setup {
void main() {
BlockingQueue<Batch> q = new LinkedBlockingQueue<Batch>();
int numConsumers = 4;
CalcProducer p = new CalcProducer(q);
Thread producerThread = new Thread(p);
producerThread.start();
Thread[] consumerThreads = new Thread[numConsumers];
for(int i = 0; i < numConsumers; i++)
{
consumerThreads[i] = new Thread(new CalcConsumer(q));
consumerThreads[i].start();
}
}
}
Sorry if there are any syntax errors, I've been chomping away at C# code and sometimes I forget the proper java syntax, but the general idea is there.
If you have a problem which does not scale to multiple cores, you need to change your program or you have a problem which is not as parallel as you think. I suspect you have some other type of bug, but cannot say based on the information given.
This test code might help.
Time per million tasks 765 ms
code
ExecutorService es = Executors.newFixedThreadPool(4);
Runnable task = new Runnable() {
#Override
public void run() {
// do nothing.
}
};
long start = System.nanoTime();
for(int i=0;i<1000*1000;i++) {
es.submit(task);
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
long time = System.nanoTime() - start;
System.out.println("Time per million tasks "+time/1000/1000+" ms");
EDIT: Say you have a loop which serially does this.
for(int i=0;i<1000*1000;i++)
doWork(i);
You might assume that changing to loop like this would be faster, but the problem is that the overhead could be greater than the gain.
for(int i=0;i<1000*1000;i++) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
doWork(i2);
}
}
}
So you need to create batches of work (at least one per thread) so there are enough tasks to keep all the threads busy, but not so many tasks that your threads are spending time in overhead.
final int batchSize = 10*1000;
for(int i=0;i<1000*1000;i+=batchSize) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
for(int i3=i2;i3<i2+batchSize;i3++)
doWork(i3);
}
}
}
EDIT2: RUnning atest which copied data between threads.
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
starts badly but warms up to ~50 us.
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
Hmm, CachedThreadPool seems to be created just for your case. It does not recreate threads if you reuse them soon enough, and if you spend a whole minute before you use new thread, the overhead of thread creation is comparatively negligible.
But you can't expect parallel execution to speed up your calculations unless you can also access data in parallel. If you employ extensive locking, many synchronized methods, etc you'll spend more on overhead than gain on parallel processing. Check that your data can be efficiently processed in parallel and that you don't have non-obvious synchronizations lurkinb in the code.
Also, CPUs process data efficiently if data fully fit into cache. If data sets of each thread is bigger than half the cache, two threads will compete for cache and issue many RAM reads, while one thread, if only employing one core, may perform better because it avoids RAM reads in the tight loop it executes. Check this, too.
Here's a psuedo outline of what I'm thinking
class WorkerThread extends Thread {
Queue<Calculation> calcs;
MainCalculator mainCalc;
public void run() {
while(true) {
while(calcs.isEmpty()) sleep(500); // busy waiting? Context switching probably won't be so bad.
Calculation calc = calcs.pop(); // is it pop to get and remove? you'll have to look
CalculationResult result = calc.calc();
mainCalc.returnResultFor(calc,result);
}
}
}
Another option, if you're calling external programs. Don't put them in a loop that does them one at a time or they won't run in parallel. You can put them in a loop that PROCESSES them one at a time, but not that execs them one at a time.
Process calc1 = Runtime.getRuntime.exec("myCalc paramA1 paramA2 paramA3");
Process calc2 = Runtime.getRuntime.exec("myCalc paramB1 paramB2 paramB3");
Process calc3 = Runtime.getRuntime.exec("myCalc paramC1 paramC2 paramC3");
Process calc4 = Runtime.getRuntime.exec("myCalc paramD1 paramD2 paramD3");
calc1.waitFor();
calc2.waitFor();
calc3.waitFor();
calc4.waitFor();
InputStream is1 = calc1.getInputStream();
InputStreamReader isr1 = new InputStreamReader(is1);
BufferedReader br1 = new BufferedReader(isr1);
String resultStr1 = br1.nextLine();
InputStream is2 = calc2.getInputStream();
InputStreamReader isr2 = new InputStreamReader(is2);
BufferedReader br2 = new BufferedReader(isr2);
String resultStr2 = br2.nextLine();
InputStream is3 = calc3.getInputStream();
InputStreamReader isr3 = new InputStreamReader(is3);
BufferedReader br3 = new BufferedReader(isr3);
String resultStr3 = br3.nextLine();
InputStream is4 = calc4.getInputStream();
InputStreamReader isr4 = new InputStreamReader(is4);
BufferedReader br4 = new BufferedReader(isr4);
String resultStr4 = br4.nextLine();