Fibonacci on Java ExecutorService runs faster sequentially than in parallel

Fibonacci on Java ExecutorService runs faster sequentially than in parallel - java

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??

It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Related

Performance of executorService multithreading pool

I am using Java's concurrency library ExecutorService to run my tasks. The threshold for writing to the database is 200 QPS, however, this program can only reach 20 QPS with 15 threads. I tried 5, 10, 20, 30 threads, and they were even slower than 15 threads. Here is the code:
ExecutorService executor = Executors.newFixedThreadPool(15);
List<Callable<Object>> todos = new ArrayList<>();
for (final int id : ids) {
todos.add(Executors.callable(() -> {
try {
TestObject test = testServiceClient.callRemoteService();
SaveToDatabase();
} catch (Exception ex) {}
}));
}
try {
executor.invokeAll(todos);
} catch (InterruptedException ex) {}
executor.shutdown();
1) I checked the CPU usage of the linux server on which this program is running, and the usage was 90% and 60% (it has 4 CPUs). The memory usage was only 20%. So the CPU & memory were still fine. The database server's CPU usage was low (around 20%). What could prevent the speed from reaching 200 QPS? Maybe this service call: testServiceClient.callRemoteService()? I checked the server configuration for that call and it allows high number of calls per seconds.
2) If the count of id in ids is more than 50000, is it a good idea to use invokeAll? Should we split it to smaller batches, such as 5000 each batch?

There is nothing in this code which prevents this query rate, except creating and destroying a thread pool repeately is very expensive. I suggest using the Streams API which is not only simpler but reuses a built in thread pool
int[] ids = ....
IntStream.of(ids).parallel()
.forEach(id -> testServiceClient.callRemoteService(id));
Here is a benchmark using a trivial service. The main overhead is the latency in creating the connection.
public static void main(String[] args) throws IOException {
ServerSocket ss = new ServerSocket(0);
Thread service = new Thread(() -> {
try {
for (; ; ) {
try (Socket s = ss.accept()) {
s.getOutputStream().write(s.getInputStream().read());
}
}
} catch (Throwable t) {
t.printStackTrace();
}
});
service.setDaemon(true);
service.start();
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
IntStream.of(ids).parallel().forEach(id -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
});
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
}
prints
Throughput 12491 connects/sec
Throughput 13138 connects/sec
Throughput 15148 connects/sec
Throughput 14602 connects/sec
Throughput 15807 connects/sec
Using an ExecutorService would be better as #grzegorz-piwowarek mentions.
ExecutorService es = Executors.newFixedThreadPool(8);
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
List<Future> futures = new ArrayList<>(ids.length);
for (int id : ids) {
futures.add(es.submit(() -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
}));
}
for (Future future : futures) {
future.get();
}
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
es.shutdown();
In this case produces much the same results.

Why do you restrict yourself to such a low number of threads?
You're missing performance opportunities this way. It seems that your tasks are really not CPU-bound. The network operations (remote service + database query) may take up the majority of time for each task to finish. During these times, where a single task/thread needs to wait for some event (network,...), another thread can use the CPU. The more threads you make available to the system, the more threads may be waiting for their network I/O to complete while still having some threads use the CPU at the same time.
I suggest you drastically ramp up the number of threads for the executor. As you say that both remote servers are rather under-utilized, I assume the host your program runs at is the bottleneck at the moment. Try to increase (double?) the number of threads until either your CPU utilization approaches 100% or memory or the remote side become the bottleneck.
By the way, you shutdown the executor, but do you actually wait for the tasks to terminate? How do you measure the "QPS"?
One more thing comes to my mind: How are DB connections handled? I.e. how are SaveToDatabase()s synchronized? Do all threads share (and compete for) a single connection? Or, worse, will each thread create a new connection to the DB, do its thing, and then close the connection again? This may be a serious bottleneck because establishing a TCP connection and doing the authentication handshake may take up as much time as running a simple SQL statement.
If the count of id in ids is more than 50000, is it a good idea to use
invokeAll? Should we split it to smaller batches, such as 5000 each
batch?
As #Vaclav Stengl already wrote, the Executors have internal queues in which they enqueue and from which they process the tasks. So no need to worry about that one. You can also just call submit for each single task as soon as you have created it. This allows the first tasks to already start executing while you're still creating/preparing later tasks, which makes sense especially when each task creation takes comparatively long, but won't hurt in all other cases. Think about invokeAll as a convenience method for cases where you already have a collection of tasks. If you create the tasks successively yourself and you already have access to the ExecutorService to run them on, just submit() them a.s.a.p.

About batch spliting:
ExecutorService has inner queue for storing tasks. In your case ExecutorService executor = Executors.newFixedThreadPool(15); has 15 thread so max 15 tasks will run concurrently and others will be stored in queue. Size of queue can be parametrized. By default size will scale up to max int. InvokeAll call inside of method execute and this method will place tasks in to queue when all threads are working.
Imho there are 2 possible scenarios why CPU is not at 100%:
try to enlarge thread pool
thread is waiting for testServiceClient.callRemoteService() to
complete and meanwhile CPU is starwing

The problem of QPS maybe is the bandwidth limit or transaction execution(it will lock the table or row). So you just increase pool size is not worked. Additional, You can try to use the producer-consumer pattern.

Compiler ignore threads priorities

I tried to compile the example from Thinking in Java by Bruce Eckel:
import java.util.concurrent.*;
public class SimplePriorities implements Runnable {
private int countDown = 5;
private volatile double d; // No optimization
private int priority;
public SimplePriorities(int priority) {
this.priority = priority;
}
public String toString() {
return Thread.currentThread() + ": " + countDown;
}
public void run() {
Thread.currentThread().setPriority(priority);
while(true) {
// An expensive, interruptable operation:
for(int i = 1; i < 100000; i++) {
d += (Math.PI + Math.E) / (double)i;
if(i % 1000 == 0)
Thread.yield();
}
System.out.println(this);
if(--countDown == 0) return;
}
}
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
for(int i = 0; i < 5; i++)
exec.execute(
new SimplePriorities(Thread.MIN_PRIORITY));
exec.execute(
new SimplePriorities(Thread.MAX_PRIORITY));
exec.shutdown();
}
}
According to the book, the output has to look like:
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-6,10,main]: 4
Thread[pool-1-thread-6,10,main]: 3
Thread[pool-1-thread-6,10,main]: 2
Thread[pool-1-thread-6,10,main]: 1
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-1,1,main]: 5
...
But in my case 6th thread doesn't execute its task at first and threads are disordered. Could you please explain me what's wrong? I just copied the source and didn't add any strings of code.

The code is working fine and with the output from the book.
Your IDE probably has console window with the scroll bar - just scroll it up and see the 6th thread first doing its job.
However, the results may differ depending on OS / JVM version. This code runs as expected for me on Windows 10 / JVM 8

There are two issues here:
If two threads with the same priority want to write output, which one goes first?
The order of threads (with the same priority) is undefined, therefore the order of output is undefined. It is likely that a single thread is allowed to write several outputs in a row (because that's how most thread schedulers work), but it could also be completely random, or anything in between.
How many threads will a cached thread pool create?
That depends on your system. If you run on a dual-core system, creating more than 4 threads is pointless, because there hardly won't be any CPU available to execute those threads. In this scenario further tasks will be queued and executed only after earlier tasks are completed.
Hint: there is also a fixed-size thread pool, experimenting with that should change the output.
In summary there is nothing wrong with your code, it is just wrong to assume that threads are executed in any order. It is even technically possible (although very unlikely), that the first task is already completed before the last task is even started. If your book says that the above order is "correct" then the book is simply wrong. On an average system that might be the most likely output, but - as above - with threads there is never any order, unless you enforce it.
One way to enforce it are thread priorities - higher priorities will get their work done first - you can find other concepts in the concurrent package.

Ideas to avoid doing a trick on CyclicBarrier

I was running some tests with parallel processing and made a program that given a matrix of integers re-calcutes each position's value based on the neighbours.
I needed a copy of the matrix so the values wouldn't be overriden and used a CyclicBarrier to merge the results once the partial problems were solved:
CyclicBarrier cyclic_barrier = new CyclicBarrier(n_tasks + 1, new Runnable() {
public void run() {
ParallelProcess.mergeResult();
}
});
ParallelProcess p = new ParallelProcess(cyclic_barrier, n_rows, r_cols); // init
Each task is assigned a portion of the matrix: I'm splitting it in equals pieces by rows. But it might happen that the divisions are not exact so there would be a small piece corresponding to the lasts row that wouldn't be submitted to the thread pool.
Example: if I have 16 rows and n_tasks = 4 no problem, all 4 will be submitted to the pool. But if I had 18 instead, the first 16 ones will be submitted, but not the last two ones.
So I'm forcing a submit if this case happens. Well, I'm not submitting actually, because I am using a fixed thread pool that I created like this ExecutorService e = Executors.newFixedThreadPool(n_tasks). Since all the slots in the pool are occupied and the threads are blocked by the barrier (mybarrier.await() is called in the run method) I couldn't submit it to the pool, so I just used Thread.start().
Let's go to the point. Since I need to take into consideration for the CyclicBarrier the possibility of that chunk remaining, the number of parties must be incremented by 1.
But if this case didn't happen, I would be one party short to trigger the barrier.
What's my solution?:
if (lower_limit != n_rows) { // the remaining chunk to be processed
Thread t = new Thread(new ParallelProcess(lower_limit, n_rows));
t.start();
t.join();
}
else {
cyclic_barrier.await();
}
I feel like I am cheating when using the cyclic_barrier.await() trick to raise the barrier by force.
Is there any other way I could approach this problem so I didn't have to do what I'm doing?

Though this doesn't answer your question about CyclicBarriers, can I recommend using a Phaser? This does have the ability to include the number of parties, and it also allows you to run the mergeResult when a phase is tripped.
So, before you execute an async calculation, simply register. Then inside that calculation have the thread arrive on the phaser. When all threads have arrived, it will advance the phase and can invoke an overriden method onAdvance.
The submission:
ParallelProcess process = new ParallelProcess(lower_limit, n_rows));
phaser.register();
executor.submit(process);
The processor
public void run(){
//do stuff
phaser.arrive();
}
The phaser
Phaser phaser = new Phaser(){
protected boolean onAdvance(int phase, int registeredParties) {
ParallelProcess.mergeResult();
return true;
}
}

Can this Java program ever print a value other than zero?

I have a favorite C# program similar to the one below that shows that if two threads share the same memory address for counting (one thread incrementing n times, one thread decrementing n times) you can get a final result other than zero. As long as n is reasonably large, it's pretty easy to get C# to display some non-zero value between [-n, n]. However, I can't get Java to produce a non-zero result even when increasing the number of threads to 1000 (500 up, 500 down). Is there some memory model or specification difference wrt C# I'm not aware of that guarantees this program will always yield 0 despite the scheduling or number of cores that I am not aware of? Would we agree that this program could produce a non-zero value even if we can not prove that experimentally?
(Not:, I found this exact question over here, but when I run that topic's code I also get zero.)
public class Counter
{
private int _counter = 0;
Counter() throws Exception
{
final int limit = Integer.MAX_VALUE;
Thread add = new Thread()
{
public void run()
{
for(int i = 0; i<limit; i++)
{
_counter++;
}
}
};
Thread sub = new Thread()
{
public void run()
{
for(int i = 0; i<limit; i++)
{
_counter--;
}
}
};
add.run();
sub.run();
add.join();
sub.join();
System.out.println(_counter);
}
public static void main(String[] args) throws Exception
{
new Counter();
}
}

The code you've given only runs on a single thread, so will always give a result of 0. If you actually start two threads, you can indeed get a non-zero result:
// Don't call run(), which is a synchronous call, which doesn't start any threads
// Call start(), which starts a new thread and calls run() *in that thread*.
add.start();
sub.start();
On my box in a test run that gave -2146200243.

Assuming you really meant start, not run.
On most common platforms it will very likely produce non zero, because ++/-- are not atomic operations in case of multiple cores. On single core/single CPU you will most likely get 0 because ++/-- are atomic if compiled to one instruction (add/inc) but that part depends on JVM.
Check result here: http://ideone.com/IzTT2

The problem with your program is that you are not creating an OS thread, so your program is essentially single threaded. In Java you must call Thread.start() to create a new OS thread, not Thread.run(). This has to do with a regrettable mistake made in the initial Java API. That mistake is that the designer made Thread implement Runnable.
add.start();
sub.start();
add.join();
sub.join();

Java thread creation overhead

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.