Java FutureTask<> without using an ExecutorService?

Java FutureTask<> without using an ExecutorService? - java

Recently a use case came up where I had to kick off several blocking IO tasks at the same time and use them in sequence. I did not want to change the order of operation on the consumption side and since this was a web app and these were short-lived tasks in the request path, I didn't want to bottleneck on a fixed threadpool and was looking to mirror the .Net async/await coding style. The FutureTask<> seemed ideal for this but required an ExecutorService. This is an attempt to remove the need for one.
Order of operation:
Kick off tasks
Do some stuff
Consume Task 1
Do some other stuff
Consume Task 2
Finish up
...
I wanted to spawn a new thread for each FutureTask<> but simplify the thread management. After run() completed, the calling thread could be joined.
The solution I came up with was:
package com.staples.search.util;
import java.util.concurrent.Callable;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;
public class FutureWrapper<T> extends FutureTask<T> implements Future<T> {
private Thread myThread;
public FutureWrapper(Callable<T> callable) {
super(callable);
myThread = new Thread(this);
myThread.start();
}
#Override
public T get() {
T val = null;
try {
val = super.get();
myThread.join();
}
catch (Exception ex)
{
this.setException(ex);
}
return val;
}
}
Here are a couple of JUnit tests I created to compare FutureWrapper to CachedThreadPool.
#Test
public void testFutureWrapper() throws InterruptedException, ExecutionException {
long startTime = System.currentTimeMillis();
int numThreads = 2000;
List<FutureWrapper<ValueHolder>> taskList = new ArrayList<FutureWrapper<ValueHolder>>();
System.out.printf("FutureWrapper: Creating %d tasks\n", numThreads);
for (int i = 0; i < numThreads; i++) {
taskList.add(new FutureWrapper<ValueHolder>(new Callable<ValueHolder>() {
public ValueHolder call() throws InterruptedException {
int value = 500;
Thread.sleep(value);
return new ValueHolder(value);
}
}));
}
for (int i = 0; i < numThreads; i++)
{
FutureWrapper<ValueHolder> wrapper = taskList.get(i);
ValueHolder v = wrapper.get();
}
System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);
Assert.assertTrue(true);
}
#Test
public void testCachedThreadPool() throws InterruptedException, ExecutionException {
long startTime = System.currentTimeMillis();
int numThreads = 2000;
List<Future<ValueHolder>> taskList = new ArrayList<Future<ValueHolder>>();
ExecutorService esvc = Executors.newCachedThreadPool();
System.out.printf("CachedThreadPool: Creating %d tasks\n", numThreads);
for (int i = 0; i < numThreads; i++) {
taskList.add(esvc.submit(new Callable<ValueHolder>() {
public ValueHolder call() throws InterruptedException {
int value = 500;
Thread.sleep(value);
return new ValueHolder(value);
}
}));
}
for (int i = 0; i < numThreads; i++)
{
Future<ValueHolder> wrapper = taskList.get(i);
ValueHolder v = wrapper.get();
}
System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);
Assert.assertTrue(true);
}
class ValueHolder {
private int value;
public ValueHolder(int val) { value = val; }
public int getValue() { return value; }
public void setValue(int val) { value = val; }
}
Repeated runs puts the FutureWrapper at ~925ms vs. ~935ms for the CachedThreadPool. Both tests bump into OS thread limits.
Things seem to work and the thread spawning is pretty fast (10k threads with random sleeps in ~4s). Does anyone see something wrong with this implementation?

Creating and starting thousands of threads is usually a very bad idea, because creating threads is expensive, and having more threads than processors will bring no performance gain but cause thread-context-switches that consume CPU-cycles instead. (See notes very below)
So in my opinion, your test-code contains a big error in reasoning: You are simulating CPU load by calling Thread.sleep(500). But in fact, this does not really cause the CPU to do anything. It is possible to have many sleeping threads in parallel - no matter how many processors you have, but it is not possible to run more CPU consuming tasks than processors in (real) parallel.
If you simulate real CPU load, you'll see, that more threads will just increase the overhead due to thread-management, but not decrease the total processing time.
So let's compare different ways to run CPU consuming tasks in parallel!
First, let's assume we've got some CPU consuming task that always takes the same amount of time:
public Integer task() throws Exception {
// do some computations here (e.g. fibonacchi, primes, cipher, ...)
return 1;
}
Our goal is to run this task NUM_TASKS times using different execution strategies. For our tests, we set NUM_TASKS = 2000.
(1) Using a thread-per-task strategy
This strategy is very comparable to your approach, with the difference, that it is not necessary to subclass FutureTask and fiddle around with threads. Instead, you can use FutureTask directly as it is both, a Runnable and a Future:
#Test
public void testFutureTask() throws InterruptedException, ExecutionException {
List<RunnableFuture<Integer>> taskList = new ArrayList<RunnableFuture<Integer>>();
// run NUM_TASKS FutureTasks in NUM_TASKS threads
for (int i = 0; i < NUM_TASKS; i++) {
RunnableFuture<Integer> rf = new FutureTask<Integer>(this::task);
taskList.add(rf);
new Thread(rf).start();
}
// now wait for all tasks
int sum = 0;
for (Future<Integer> future : taskList) {
sum += future.get();
}
Assert.assertEquals(NUM_TASKS, sum);
}
Running this test with JUnitBenchmarks (10 test iterations + 5 warmup iterations) yields the following result:
ThreadPerformanceTest.testFutureTask: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 0.66 [+- 0.01], round.block: 0.00 [+-
0.00], round.gc: 0.00 [+- 0.00], GC.calls: 66, GC.time: 0.06, time.total: 10.59, time.warmup: 4.02, time.bench: 6.57
So one round (execution time of method task()) is about 0.66 seconds.
(2) Using a thread-per-cpu strategy
This strategy uses a fixed number of threads to execute all tasks. Therefore, we create an ExecutorService via Executors.newFixedThreadPool(...). The number of threads should be equal to the number of CPUs (Runtime.getRuntime().availableProcessors()), which is 8 in my case.
To be able to track the results, we simply use a CompletionService. It automatically takes care of the results - no matter in which order they arrive.
#Test
public void testFixedThreadPool() throws InterruptedException, ExecutionException {
ExecutorService exec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
CompletionService<Integer> ecs = new ExecutorCompletionService<Integer>(exec);
// submit NUM_TASKS tasks
for (int i = 0; i < NUM_TASKS; i++) {
ecs.submit(this::task);
}
// now wait for all tasks
int sum = 0;
for (int i = 0; i < NUM_TASKS; i++) {
sum += ecs.take().get();
}
Assert.assertEquals(NUM_TASKS, sum);
}
Again we run this test with JUnitBenchmarks with the same settings. The results are:
ThreadPerformanceTest.testFixedThreadPool: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 0.41 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 22, GC.time: 0.04, time.total: 6.59, time.warmup: 2.53, time.bench: 4.05
Now one round is only 0.41 seconds (almost 40% runtime reduction)! Also not the fewer GC calls.
(3) Sequential execution
For comparison we should also measure the non-parallelized execution:
#Test
public void testSequential() throws Exception {
int sum = 0;
for (int i = 0; i < NUM_TASKS; i++) {
sum += this.task();
}
Assert.assertEquals(NUM_TASKS, sum);
}
The results:
ThreadPerformanceTest.testSequential: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 1.50 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+-0.00], GC.calls: 244, GC.time: 0.15, time.total: 22.81, time.warmup: 7.77, time.bench: 15.04
Note that 1.5 seconds is for 2000 executions, so a single execution of task() takes 0.75 ms.
Interpretation
According to Amdahl's law, the time T(n) to execute an algorithm on n processors, is:
B is the fraction of the algorithm that cannot be parallelized and must run sequentially. For pure sequential algorithms, B is 1, for pure parallel algorithms it would be 0 (but this is not possible as there is always some sequential overhead).
T(1) can be taken from our sequential execution: T(1) = 1.5 s
If we had no overhead (B = 0), on 8 CPUs we'd got: T(8) = 1.5 / 8 = 0.1875 s.
But we do have overhead! So let's compute B for our two strategies:
B(thread-per-task) = 0.36
B(thread-per-cpu) = 0.17
In other words: The thread-per-task strategy has twice the overhead!
Finally, let's compute the speedup S(n). That's the number of times, an algorithm runs faster on n CPUs compared to sequential execution (S(1) = 1):
Applied to our two strategies, we get:
thread-per-task: S(8) = 2.27
thread-per-cpu: S(8) = 3.66
So the thread-per-cpu strategy has about 60% more speedup than thread-per-task.
TODO
We should also measure and compare memory consumption.
Note: This all is only true for CPU consuming tasks. If instead, your tasks perform lots of I/O related stuff, you might benefit from having more threads than CPUs as waiting for I/O will put a thread in idle mode, so the CPU can execute another thread meanwhile. But even in this case, there is a reasonable upper limit which is usually far below 2000 on a PC.

Related

Why this piece of Java code is not running concurrently

I have written Sieve of Eratosthenes which is supposed to work in parallel, but it's not. When I increase number of threads, time of computing is not getting lower. Any ideas why?
Main class
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ConcurrentTest {
public static void main(String[] args) throws InterruptedException {
Sieve task = new Sieve();
int x = 1000000;
int threads = 4;
task.setArray(x);
Long beg = new Date().getTime();
ExecutorService exec = Executors.newCachedThreadPool();
for (int i = 0; i < threads; i++) {
exec.execute(task);
}
exec.shutdown();
Long time = 0L;
// Main thread is waiting until all threads are terminated
// ( it means that computing is done)
while (true)
if (exec.isTerminated()) {
time = new Date().getTime() - beg;
break;
}
System.out.println("Time is " + time);
}
}
Sieve class
import java.util.concurrent.ConcurrentHashMap;
public class Sieve implements Runnable {
private ConcurrentHashMap<Integer, Boolean> array =
new ConcurrentHashMap<Integer, Boolean>();
private int x;
public void run() {
while(true){
// I am getting synchronized number to check if it's prime
int n = getCounter();
// If no more numbers to check, stop loop
if( n == -1)
break;
// If HashMap contains number, we can further
if(!array.containsKey(n))continue;
for (int i = 2 * n; i <= x; i += n) {
// Compound numbers are removed from HashMap, Eg. 6, 12 and much more.
array.remove(i);
}
}
}
private synchronized int getCounter(){
if( counter < x)
return counter++;
else return -1;
}
public void setArray(int x) {
this.x = x;
for (int i = 2; i <= x; i++)
array.put(i, false);
}
}
I made some tests with different number of threads. These are results:
Nr of threads 1 Time is 1850, 1795, 1825
Nr of threads 2 Time is 1845, 1836, 1814
Nr of threads 3 Time is 1767, 1820, 1756
Nr of threads 4 Time is 1732, 1840, 2083
Nr of threads 5 Time is 1791, 1795, 1803
Nr of threads 6 Time is 1825, 1728, 1707
Nr of threads 7 Time is 1754, 1729, 1686
Nr of threads 8 Time is 1760, 1717, 1817
Nr of threads 9 Time is 1721, 1699, 1673
Nr of threads 10 Time is 1661, 1722, 1718

When I increase number of threads, time of computing is not getting
lower
tl;dr: your problem size is too small. If you increase x to 10000000, the differences will become more obvious. They won't be what you're expecting, though.
I tried your code on an eight core machine with two slight modifications:
For timing, I used System.nanoTime() instead of getTime() on a Date.
I used the awaitTermination method of ExecutorService rather than a spinloop to check for the end of run.
I tried launching your Sieve tasks using a fixed thread pool, a cached thread pool and a fork join pool and comparing the results of different values for your thread variable.
I see the following results (in milliseconds) on my machine with x=10000000:
Thread count = 1 2 4 8 16
Fixed thread pool = 5451 3866 3639 3227 3120
Cached thread pool= 5434 3763 3709 3258 3078
Fork-join pool = 6732 3670 3735 3190 3102
What these results show us is a clear benefit of changing from a single thread of execution to two threads. However, the benefit of additional threads drops off rapidly. There's an interesting plateau going from two to four threads and marginal benefits up to 16.
In addition, you can also see that the different threading mechanisms have different initial overhead: I didn't expect the Fork-Join pool to cost that much more to start than the other mechanisms.
So, as written, you shouldn't really expect a benefit past two threads for small but non-trivial problem sets.
If you'd like to increase the benefit of additional threads, you're going to need to look at your current implementation. For example, when I switched from your synchronized getCounter() to an AtomicInteger using incrementAndGet(), I eliminated the overhead of the synchronized method. The result is that all of my four thread numbers dropped on the order of 1000 milliseconds.

Fibonacci on Java ExecutorService runs faster sequentially than in parallel

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??

It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Create Multiple Threads in certain time for Java

What is the best way to create 500.000 threads in 5 seconds. (Runnable) I created for loop but it takes lots of time. For example;
startTime = System.currentTimeMills();
for (int i=0;i<500.000; i++){
// create thread
thread.start();
}
resultTime = (System.currentTimeMills() - startTime);
So the resultTime is bigger than 5 seconds. I know it depends on my hardware and os configuration but i just want to know what is the best way to create multiple threads in certain time?
Thanks.

I really can't imagine this is a good idea. Each thread takes a reasonable amount of resource (by default, 512k of heap for each thread) and so even if you create all your threads, your JVM will be fighting for resources.
If you have a requirement for 500,000 work units, I think you're better off creating these as Runnables (and not all at once!) and passing them to a ThreadPool tuned to your environment.machine (e.g. a naive/simple tuning would be one thread per CPU)

The fastest way to create many tasks is to use an ExecutorService
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(processors);
long start = System.nanoTime();
int tasks = 500 * 1000;
for (int i = 0; i < tasks; i++) {
es.execute(new Runnable() {
#Override
public void run() {
// do something.
}
});
}
long time = System.nanoTime() - start;
System.out.printf("Took %.1f ms to create/submit %,d tasks%n", time / 1e6, tasks);
es.shutdown();
prints
Took 143.6 ms to create/submit 500,000 tasks

Maybe you can make a couple of special threads that generates 250000 threads each..

Maybe this one to expect your computer to smoke better:
concept: share the job among each core.
public class Example {
public static void main(String[] args) {
for (int i = 0; i < 3; i++) {
new Thread(new ThreadCreator()).start(); // with 4 cores on your processor
}
}
}
class ThreadCreator implements Runnable {
#Override
public void run() {
for (int i = 0; i < 125000; i++) {
new Thread().start(); // each core creating remaining thread
}
}
}
Took only 0,6 ms !!

Java strange performance inconsistency

I have a simple recursive method, a depth first search. On each call, it checks if it's in a leaf, otherwise it expands the current node and calls itself on the children.
I'm trying to make it parallel, but I notice the following strange (for me) problem.
I measure execution time with System.currentTimeMillis().
When I break the search into a number of subsearches and add the total execution time, I get a bigger number than the sequential search. I only measure execution time, no communication or sync, etc. I would expect to get the same time when I add the times of the subtasks. This happens even if I just run one task after the other, so without threads. If I just break the search into some subtasks and run the subtasks one after the other, I get a bigger time.
If I add the number of method calls for the subtasks, I get the same number as the sequential search. So, basically, in both cases I do the same number of method calls, but I get different times.
I'm guessing there's some overhead on initial method calls or something else caused by a JVM mechanism. Any ideas what could it be?
For example, one sequential search takes around 3300 ms. If I break it into 13 tasks, it takes a total time of 3500ms.
My method looks like this:
private static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
Whenever I call it, I do it like this:
for(int i = 0; i < num_tasks; i++){
long start = System.currentTimeMillis();
dfs(tasks[i]);
totalTime += (System.currentTimeMillis() - start);
}
Problem is totalTime increases with num_tasks and I would expect to stay the same because the method_calls variable stays the same.

You should average out the numbers over longer runs. Secondly the precision of currentTimeMillis may not be sufficient, you can try using System.nanoTime().

As in all the programming languages, whenever you call a procedure or a method, you have to push the environment, initialize the new one, execute the programs instructions, return the value on the stack and finally reset the previous environment. It cost a bit! Create a thread cost also more!
I suppose that if you enlarge the researching tree you will have benefit by the parallelization.

Adding system clock time for several threads seems a weird idea. Either you are interested in the time until processing is complete, in which case adding doesn't make sense, or in cpu usage, in which case you should only count when the thread is actually scheduled to execute.
What probably happens is that at least part of the time, more threads are ready to execute than the system has cpu cores, and the scheduler puts one of your threads to sleep, which causes it to take longer to complete. It makes sense that this effect is exacerbated the more threads you use. (Even if your program uses less threads than you have cores, other programs (such as your development environment, ...) might).
If you are interested in CPU usage, you might wish to query ThreadMXBean.getCurrentThreadCpuTime

I'd expect to see Threads used. Something like this:
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Puzzle {
static volatile long totalTime = 0;
private static int method_calls = 0;
/**
* #param args
*/
public static void main(String[] args) {
final int num_tasks = 13;
final State[] tasks = new State[num_tasks];
ExecutorService threadPool = Executors.newFixedThreadPool(5);
for(int i = 0; i < num_tasks; i++){
threadPool.submit(new DfsRunner(tasks[i]));
}
try {
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.out.println("Interrupted");
}
System.out.println(method_calls + " Methods in " + totalTime + "msecs");
}
static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
}
With the runnable bit like this:
public class DfsRunner implements Runnable {
private State state;
public DfsRunner(State state) {
super();
this.state = state;
}
#Override
public void run() {
long start = System.currentTimeMillis();
Puzzle.dfs(state);
Puzzle.totalTime += (System.currentTimeMillis() - start);
}
}

Java thread creation overhead

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.