Java thread creation overhead

Java thread creation overhead - java

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Related

Simple Parallel Computation Achieves Better Results When Having More Threads Than Cores

I have a simple Java application that basically increments a counter 600M times. I have separated this task into several threads, each incrementing its own counter, and finally summing these counters.
Oddly, having more threads than cores achieves better performance:
Example (on Intel I7-9850H with six cores) for average calculation time:
Having 6 threads, each with 100M increments yields 97 ms
Having 60 threads, each with 10M increments yields 61 ms
AFAIK Java maps each thread to a real system thread.
Any ideas why is this happening?
EDIT:
Is it possible that the reason is that my computer has many other running processes and threads, so competition of 60 threads against all other inhabits is better than having only 6 threads competing on CPU resources?
The code of the increment method:
private static void incrementWithLockFree(long increments, int threads) throws InterruptedException {
final long[] numbers = new long[threads];
ExecutorService threadPool = Executors.newFixedThreadPool(threads);
for (int task = 0; task < threads; task++) {
int finalTask = task;
threadPool.submit(() -> {
for (long increment = 0; increment < increments; increment++) {
numbers[finalTask]++;
}
});
}
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.DAYS);
long number = 0;
for (long num : numbers) {
number += num;
}
System.out.println(number);
}

On my system less threads perform better.
Is it possible that the reason is that my computer has many other running processes and threads, so competition of 60 threads against all other inhabits is better than having only 6 threads competing on CPU resources?
Yes.
In case this is a real use-case, take a look at LongAdder, which is optimized for multithreaded counter scenarios.

Can I trust that System.nanoTime() will return different value each invocation?

My program needs to generate unique labels that consist of a tag, date and time. Something like this:
"myTag__2019_09_05__07_51"
If one tries to generate two labels with the same tag in the same minute, one will receive equal labels, what I cannot allow. I think of adding as an additional suffix result of System.nanoTime() to make sure that each label will be unique (I cannot access all labels previously generated to look for duplicates):
"myTag__2019_09_05__07_51__{System.nanoTime() result}"
Can I trust that each invocation of System.nanoTime() this will produce a different value? I tested it like this on my laptop:
assertNotEquals(System.nanoTime(), System.nanoTime())
and this works. I wonder if I have a guarantee that it will always work.

TLDR; If you only use a single thread on a popular VM on a modern operating system, it may work in practice. But many serious applications use multiple threads and multiple instances of the application, and there won't be any guarantee in that case.
The only guarantee given in the Javadoc for System.nanoTime() is that the resolution of the clock is at least as good as System.currentTimeMillis() - so if you are writing cross-platform code, there is clearly no expectation that the results of nanoTime are unique, as you can call nanoTime() many times per millisecond.
On my OS (Java 11, MacOS) I always get at least one nanosecond difference between successive calls on the same thread (and that is after Integer.MAX_VALUE looks at successive return values); it's possible that there is something in the implementation that guarantees it.
However it is simple to generate duplicate results if you use multiple Threads and have more than 1 physical CPU. Here's code that will show you:
public class UniqueNano {
private static volatile long a = -1, b = -2;
public static void main(String[] args) {
long max = 1_000_000;
new Thread(() -> {
for (int i = 0; i < max; i++) { a = System.nanoTime(); }
}).start();
new Thread(() -> {
for (int i = 0; i < max; i++) { b = System.nanoTime(); }
}).start();
for (int i = 0; i < max; i++) {
if (a == b) {
System.out.println("nanoTime not unique");
}
}
}
}
Also, when you scale your application to multiple machines, you will potentially have the same problem.
Relying on System.nanoTime() to get unique values is not a good idea.

Compiler ignore threads priorities

I tried to compile the example from Thinking in Java by Bruce Eckel:
import java.util.concurrent.*;
public class SimplePriorities implements Runnable {
private int countDown = 5;
private volatile double d; // No optimization
private int priority;
public SimplePriorities(int priority) {
this.priority = priority;
}
public String toString() {
return Thread.currentThread() + ": " + countDown;
}
public void run() {
Thread.currentThread().setPriority(priority);
while(true) {
// An expensive, interruptable operation:
for(int i = 1; i < 100000; i++) {
d += (Math.PI + Math.E) / (double)i;
if(i % 1000 == 0)
Thread.yield();
}
System.out.println(this);
if(--countDown == 0) return;
}
}
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
for(int i = 0; i < 5; i++)
exec.execute(
new SimplePriorities(Thread.MIN_PRIORITY));
exec.execute(
new SimplePriorities(Thread.MAX_PRIORITY));
exec.shutdown();
}
}
According to the book, the output has to look like:
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-6,10,main]: 4
Thread[pool-1-thread-6,10,main]: 3
Thread[pool-1-thread-6,10,main]: 2
Thread[pool-1-thread-6,10,main]: 1
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-1,1,main]: 5
...
But in my case 6th thread doesn't execute its task at first and threads are disordered. Could you please explain me what's wrong? I just copied the source and didn't add any strings of code.

The code is working fine and with the output from the book.
Your IDE probably has console window with the scroll bar - just scroll it up and see the 6th thread first doing its job.
However, the results may differ depending on OS / JVM version. This code runs as expected for me on Windows 10 / JVM 8

There are two issues here:
If two threads with the same priority want to write output, which one goes first?
The order of threads (with the same priority) is undefined, therefore the order of output is undefined. It is likely that a single thread is allowed to write several outputs in a row (because that's how most thread schedulers work), but it could also be completely random, or anything in between.
How many threads will a cached thread pool create?
That depends on your system. If you run on a dual-core system, creating more than 4 threads is pointless, because there hardly won't be any CPU available to execute those threads. In this scenario further tasks will be queued and executed only after earlier tasks are completed.
Hint: there is also a fixed-size thread pool, experimenting with that should change the output.
In summary there is nothing wrong with your code, it is just wrong to assume that threads are executed in any order. It is even technically possible (although very unlikely), that the first task is already completed before the last task is even started. If your book says that the above order is "correct" then the book is simply wrong. On an average system that might be the most likely output, but - as above - with threads there is never any order, unless you enforce it.
One way to enforce it are thread priorities - higher priorities will get their work done first - you can find other concepts in the concurrent package.

Why this piece of Java code is not running concurrently

I have written Sieve of Eratosthenes which is supposed to work in parallel, but it's not. When I increase number of threads, time of computing is not getting lower. Any ideas why?
Main class
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ConcurrentTest {
public static void main(String[] args) throws InterruptedException {
Sieve task = new Sieve();
int x = 1000000;
int threads = 4;
task.setArray(x);
Long beg = new Date().getTime();
ExecutorService exec = Executors.newCachedThreadPool();
for (int i = 0; i < threads; i++) {
exec.execute(task);
}
exec.shutdown();
Long time = 0L;
// Main thread is waiting until all threads are terminated
// ( it means that computing is done)
while (true)
if (exec.isTerminated()) {
time = new Date().getTime() - beg;
break;
}
System.out.println("Time is " + time);
}
}
Sieve class
import java.util.concurrent.ConcurrentHashMap;
public class Sieve implements Runnable {
private ConcurrentHashMap<Integer, Boolean> array =
new ConcurrentHashMap<Integer, Boolean>();
private int x;
public void run() {
while(true){
// I am getting synchronized number to check if it's prime
int n = getCounter();
// If no more numbers to check, stop loop
if( n == -1)
break;
// If HashMap contains number, we can further
if(!array.containsKey(n))continue;
for (int i = 2 * n; i <= x; i += n) {
// Compound numbers are removed from HashMap, Eg. 6, 12 and much more.
array.remove(i);
}
}
}
private synchronized int getCounter(){
if( counter < x)
return counter++;
else return -1;
}
public void setArray(int x) {
this.x = x;
for (int i = 2; i <= x; i++)
array.put(i, false);
}
}
I made some tests with different number of threads. These are results:
Nr of threads 1 Time is 1850, 1795, 1825
Nr of threads 2 Time is 1845, 1836, 1814
Nr of threads 3 Time is 1767, 1820, 1756
Nr of threads 4 Time is 1732, 1840, 2083
Nr of threads 5 Time is 1791, 1795, 1803
Nr of threads 6 Time is 1825, 1728, 1707
Nr of threads 7 Time is 1754, 1729, 1686
Nr of threads 8 Time is 1760, 1717, 1817
Nr of threads 9 Time is 1721, 1699, 1673
Nr of threads 10 Time is 1661, 1722, 1718

When I increase number of threads, time of computing is not getting
lower
tl;dr: your problem size is too small. If you increase x to 10000000, the differences will become more obvious. They won't be what you're expecting, though.
I tried your code on an eight core machine with two slight modifications:
For timing, I used System.nanoTime() instead of getTime() on a Date.
I used the awaitTermination method of ExecutorService rather than a spinloop to check for the end of run.
I tried launching your Sieve tasks using a fixed thread pool, a cached thread pool and a fork join pool and comparing the results of different values for your thread variable.
I see the following results (in milliseconds) on my machine with x=10000000:
Thread count = 1 2 4 8 16
Fixed thread pool = 5451 3866 3639 3227 3120
Cached thread pool= 5434 3763 3709 3258 3078
Fork-join pool = 6732 3670 3735 3190 3102
What these results show us is a clear benefit of changing from a single thread of execution to two threads. However, the benefit of additional threads drops off rapidly. There's an interesting plateau going from two to four threads and marginal benefits up to 16.
In addition, you can also see that the different threading mechanisms have different initial overhead: I didn't expect the Fork-Join pool to cost that much more to start than the other mechanisms.
So, as written, you shouldn't really expect a benefit past two threads for small but non-trivial problem sets.
If you'd like to increase the benefit of additional threads, you're going to need to look at your current implementation. For example, when I switched from your synchronized getCounter() to an AtomicInteger using incrementAndGet(), I eliminated the overhead of the synchronized method. The result is that all of my four thread numbers dropped on the order of 1000 milliseconds.

Fibonacci on Java ExecutorService runs faster sequentially than in parallel

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??

It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.