Create Multiple Threads in certain time for Java - java

What is the best way to create 500.000 threads in 5 seconds. (Runnable) I created for loop but it takes lots of time. For example;
startTime = System.currentTimeMills();
for (int i=0;i<500.000; i++){
// create thread
thread.start();
}
resultTime = (System.currentTimeMills() - startTime);
So the resultTime is bigger than 5 seconds. I know it depends on my hardware and os configuration but i just want to know what is the best way to create multiple threads in certain time?
Thanks.

I really can't imagine this is a good idea. Each thread takes a reasonable amount of resource (by default, 512k of heap for each thread) and so even if you create all your threads, your JVM will be fighting for resources.
If you have a requirement for 500,000 work units, I think you're better off creating these as Runnables (and not all at once!) and passing them to a ThreadPool tuned to your environment.machine (e.g. a naive/simple tuning would be one thread per CPU)

The fastest way to create many tasks is to use an ExecutorService
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(processors);
long start = System.nanoTime();
int tasks = 500 * 1000;
for (int i = 0; i < tasks; i++) {
es.execute(new Runnable() {
#Override
public void run() {
// do something.
}
});
}
long time = System.nanoTime() - start;
System.out.printf("Took %.1f ms to create/submit %,d tasks%n", time / 1e6, tasks);
es.shutdown();
prints
Took 143.6 ms to create/submit 500,000 tasks

Maybe you can make a couple of special threads that generates 250000 threads each..

Maybe this one to expect your computer to smoke better:
concept: share the job among each core.
public class Example {
public static void main(String[] args) {
for (int i = 0; i < 3; i++) {
new Thread(new ThreadCreator()).start(); // with 4 cores on your processor
}
}
}
class ThreadCreator implements Runnable {
#Override
public void run() {
for (int i = 0; i < 125000; i++) {
new Thread().start(); // each core creating remaining thread
}
}
}
Took only 0,6 ms !!

Related

Java FutureTask<> without using an ExecutorService?

Recently a use case came up where I had to kick off several blocking IO tasks at the same time and use them in sequence. I did not want to change the order of operation on the consumption side and since this was a web app and these were short-lived tasks in the request path, I didn't want to bottleneck on a fixed threadpool and was looking to mirror the .Net async/await coding style. The FutureTask<> seemed ideal for this but required an ExecutorService. This is an attempt to remove the need for one.
Order of operation:
Kick off tasks
Do some stuff
Consume Task 1
Do some other stuff
Consume Task 2
Finish up
...
I wanted to spawn a new thread for each FutureTask<> but simplify the thread management. After run() completed, the calling thread could be joined.
The solution I came up with was:
package com.staples.search.util;
import java.util.concurrent.Callable;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;
public class FutureWrapper<T> extends FutureTask<T> implements Future<T> {
private Thread myThread;
public FutureWrapper(Callable<T> callable) {
super(callable);
myThread = new Thread(this);
myThread.start();
}
#Override
public T get() {
T val = null;
try {
val = super.get();
myThread.join();
}
catch (Exception ex)
{
this.setException(ex);
}
return val;
}
}
Here are a couple of JUnit tests I created to compare FutureWrapper to CachedThreadPool.
#Test
public void testFutureWrapper() throws InterruptedException, ExecutionException {
long startTime = System.currentTimeMillis();
int numThreads = 2000;
List<FutureWrapper<ValueHolder>> taskList = new ArrayList<FutureWrapper<ValueHolder>>();
System.out.printf("FutureWrapper: Creating %d tasks\n", numThreads);
for (int i = 0; i < numThreads; i++) {
taskList.add(new FutureWrapper<ValueHolder>(new Callable<ValueHolder>() {
public ValueHolder call() throws InterruptedException {
int value = 500;
Thread.sleep(value);
return new ValueHolder(value);
}
}));
}
for (int i = 0; i < numThreads; i++)
{
FutureWrapper<ValueHolder> wrapper = taskList.get(i);
ValueHolder v = wrapper.get();
}
System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);
Assert.assertTrue(true);
}
#Test
public void testCachedThreadPool() throws InterruptedException, ExecutionException {
long startTime = System.currentTimeMillis();
int numThreads = 2000;
List<Future<ValueHolder>> taskList = new ArrayList<Future<ValueHolder>>();
ExecutorService esvc = Executors.newCachedThreadPool();
System.out.printf("CachedThreadPool: Creating %d tasks\n", numThreads);
for (int i = 0; i < numThreads; i++) {
taskList.add(esvc.submit(new Callable<ValueHolder>() {
public ValueHolder call() throws InterruptedException {
int value = 500;
Thread.sleep(value);
return new ValueHolder(value);
}
}));
}
for (int i = 0; i < numThreads; i++)
{
Future<ValueHolder> wrapper = taskList.get(i);
ValueHolder v = wrapper.get();
}
System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);
Assert.assertTrue(true);
}
class ValueHolder {
private int value;
public ValueHolder(int val) { value = val; }
public int getValue() { return value; }
public void setValue(int val) { value = val; }
}
Repeated runs puts the FutureWrapper at ~925ms vs. ~935ms for the CachedThreadPool. Both tests bump into OS thread limits.
Things seem to work and the thread spawning is pretty fast (10k threads with random sleeps in ~4s). Does anyone see something wrong with this implementation?
Creating and starting thousands of threads is usually a very bad idea, because creating threads is expensive, and having more threads than processors will bring no performance gain but cause thread-context-switches that consume CPU-cycles instead. (See notes very below)
So in my opinion, your test-code contains a big error in reasoning: You are simulating CPU load by calling Thread.sleep(500). But in fact, this does not really cause the CPU to do anything. It is possible to have many sleeping threads in parallel - no matter how many processors you have, but it is not possible to run more CPU consuming tasks than processors in (real) parallel.
If you simulate real CPU load, you'll see, that more threads will just increase the overhead due to thread-management, but not decrease the total processing time.
So let's compare different ways to run CPU consuming tasks in parallel!
First, let's assume we've got some CPU consuming task that always takes the same amount of time:
public Integer task() throws Exception {
// do some computations here (e.g. fibonacchi, primes, cipher, ...)
return 1;
}
Our goal is to run this task NUM_TASKS times using different execution strategies. For our tests, we set NUM_TASKS = 2000.
(1) Using a thread-per-task strategy
This strategy is very comparable to your approach, with the difference, that it is not necessary to subclass FutureTask and fiddle around with threads. Instead, you can use FutureTask directly as it is both, a Runnable and a Future:
#Test
public void testFutureTask() throws InterruptedException, ExecutionException {
List<RunnableFuture<Integer>> taskList = new ArrayList<RunnableFuture<Integer>>();
// run NUM_TASKS FutureTasks in NUM_TASKS threads
for (int i = 0; i < NUM_TASKS; i++) {
RunnableFuture<Integer> rf = new FutureTask<Integer>(this::task);
taskList.add(rf);
new Thread(rf).start();
}
// now wait for all tasks
int sum = 0;
for (Future<Integer> future : taskList) {
sum += future.get();
}
Assert.assertEquals(NUM_TASKS, sum);
}
Running this test with JUnitBenchmarks (10 test iterations + 5 warmup iterations) yields the following result:
ThreadPerformanceTest.testFutureTask: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 0.66 [+- 0.01], round.block: 0.00 [+-
0.00], round.gc: 0.00 [+- 0.00], GC.calls: 66, GC.time: 0.06, time.total: 10.59, time.warmup: 4.02, time.bench: 6.57
So one round (execution time of method task()) is about 0.66 seconds.
(2) Using a thread-per-cpu strategy
This strategy uses a fixed number of threads to execute all tasks. Therefore, we create an ExecutorService via Executors.newFixedThreadPool(...). The number of threads should be equal to the number of CPUs (Runtime.getRuntime().availableProcessors()), which is 8 in my case.
To be able to track the results, we simply use a CompletionService. It automatically takes care of the results - no matter in which order they arrive.
#Test
public void testFixedThreadPool() throws InterruptedException, ExecutionException {
ExecutorService exec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
CompletionService<Integer> ecs = new ExecutorCompletionService<Integer>(exec);
// submit NUM_TASKS tasks
for (int i = 0; i < NUM_TASKS; i++) {
ecs.submit(this::task);
}
// now wait for all tasks
int sum = 0;
for (int i = 0; i < NUM_TASKS; i++) {
sum += ecs.take().get();
}
Assert.assertEquals(NUM_TASKS, sum);
}
Again we run this test with JUnitBenchmarks with the same settings. The results are:
ThreadPerformanceTest.testFixedThreadPool: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 0.41 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 22, GC.time: 0.04, time.total: 6.59, time.warmup: 2.53, time.bench: 4.05
Now one round is only 0.41 seconds (almost 40% runtime reduction)! Also not the fewer GC calls.
(3) Sequential execution
For comparison we should also measure the non-parallelized execution:
#Test
public void testSequential() throws Exception {
int sum = 0;
for (int i = 0; i < NUM_TASKS; i++) {
sum += this.task();
}
Assert.assertEquals(NUM_TASKS, sum);
}
The results:
ThreadPerformanceTest.testSequential: [measured 10 out of 15 rounds, threads: 1 (sequential)]
round: 1.50 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+-0.00], GC.calls: 244, GC.time: 0.15, time.total: 22.81, time.warmup: 7.77, time.bench: 15.04
Note that 1.5 seconds is for 2000 executions, so a single execution of task() takes 0.75 ms.
Interpretation
According to Amdahl's law, the time T(n) to execute an algorithm on n processors, is:
B is the fraction of the algorithm that cannot be parallelized and must run sequentially. For pure sequential algorithms, B is 1, for pure parallel algorithms it would be 0 (but this is not possible as there is always some sequential overhead).
T(1) can be taken from our sequential execution: T(1) = 1.5 s
If we had no overhead (B = 0), on 8 CPUs we'd got: T(8) = 1.5 / 8 = 0.1875 s.
But we do have overhead! So let's compute B for our two strategies:
B(thread-per-task) = 0.36
B(thread-per-cpu) = 0.17
In other words: The thread-per-task strategy has twice the overhead!
Finally, let's compute the speedup S(n). That's the number of times, an algorithm runs faster on n CPUs compared to sequential execution (S(1) = 1):
Applied to our two strategies, we get:
thread-per-task: S(8) = 2.27
thread-per-cpu: S(8) = 3.66
So the thread-per-cpu strategy has about 60% more speedup than thread-per-task.
TODO
We should also measure and compare memory consumption.
Note: This all is only true for CPU consuming tasks. If instead, your tasks perform lots of I/O related stuff, you might benefit from having more threads than CPUs as waiting for I/O will put a thread in idle mode, so the CPU can execute another thread meanwhile. But even in this case, there is a reasonable upper limit which is usually far below 2000 on a PC.

Total time taken and Average time taken by all the threads

I am working on a project in which I need to measure Total Time taken by program and average time taken by program. And that program is a Multithreaded program.
In that program, each thread is working in a particular range. Input parameters is Number of Threads and Number of Task.
If number of threads is 2 and number of tasks is 10 then each thread will be performing 10 tasks. So that means 2 thread will be doing 20 tasks.
So that means-
First thread should be using id between 1 and 10 and second thread should be using id between 11 and 20.
I got the above scenario working. Now I want to measure what is the total time and average time taken by all the threads. So I got the below setup in my program.
Problem Statement:-
Can anyone tell me the way I am trying to measure the Total time and Average time taken by all the threads is correct or not in my below program?
//create thread pool with given size
ExecutorService service = Executors.newFixedThreadPool(noOfThreads);
long startTime = 0L;
try {
readPropertyFiles();
startTime = System.nanoTime();
// queue some tasks
for (int i = 0, nextId = startRange; i < noOfThreads; i++, nextId += noOfTasks) {
service.submit(new XMPTask(nextId, noOfTasks, tableList));
}
service.shutdown();
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} finally {
long estimatedTime = System.nanoTime() - startTime;
logTimingInfo(estimatedTime, noOfTasks, noOfThreads);
}
private static void logTimingInfo(long elapsedTime, int noOfTasks, int noOfThreads) {
long timeInMilliseconds = elapsedTime / 1000000L;
float avg = (float) (timeInMilliseconds) / noOfTasks * noOfThreads;
LOG.info(CNAME + "::" + "Total Time taken " + timeInMilliseconds + " ms. And Total Average Time taken " + avg + " ms");
}
service.submit is getting executed only noOfThreads times. XMPTask object is created the same number of times.
The time you measure is not the consumed time but the elapsed time.
If the program tested (the JVM) is the only one on the computer, it may be relatively accurate but in a real world a lot of process runs concurrently.
I have already done this job by using a native call to the OS, on Windows (I'll complete this post Monday at my office) and Linux (/proc).
I think you would need to measure the time within the task class itself (XMPTask). Within that task you should be able to extract the id of the thread that is executing it and log that. Using this approach will require reading the logs and doing some calculations on them.
Another approach would be to keep running totals and averages as time progresses. To do this you could write a simple class that is passed into each task that has some static (per jvm) variables for tracking what each thread is doing. Then you could have a single thread outside the Threadpool that did the calculations. So if you wanted to report the average cpu time for each thread every second, this calculation thread could sleep for a second, then calculate and log all the average times, then sleep for a second....
EDIT: After re-reading the requirements, you don't need a background thread, but not sure if we are tracking the average time per thread or average time per task. I have assumed total time and average time per thread and fleshed out the idea in code below. It has not been tested or debugged but should give you a good idea of how to start:
public class Runner
{
public void startRunning()
{
// Create your thread pool
ExecutorService service = Executors.newFixedThreadPool(noOfThreads);
readPropertyFiles();
MeasureTime measure = new MeasureTime();
// queue some tasks
for (int i = 0, nextId = startRange; i < noOfThreads; i++, nextId += noOfTasks)
{
service.submit(new XMPTask(nextId, noOfTasks, tableList, measure));
}
service.shutdown();
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
measure.printTotalsAndAverages();
}
}
public class MeasureTime
{
HashMap<Long, Long> threadIdToTotalCPUTimeNanos = new HashMap<Long, Long>();
HashMap<Long, Long> threadIdToStartTimeMillis = new HashMap<Long, Long>();
HashMap<Long, Long> threadIdToStartTimeNanos = new HashMap<Long, Long>();
private void addThread(Long threadId)
{
threadIdToTotalCPUTimeNanos.put(threadId, 0L);
threadIdToStartTimeMillis.put(threadId, 0L);
}
public void startTimeCount(Long threadId)
{
synchronized (threadIdToStartTimeNanos)
{
if (!threadIdToStartTimeNanos.containsKey(threadId))
{
addThread(threadId);
}
long nanos = System.nanoTime();
threadIdToStartTimeNanos.put(threadId, nanos);
}
}
public void endTimeCount(long threadId)
{
synchronized (threadIdToStartTimeNanos)
{
long endNanos = System.nanoTime();
long startNanos = threadIdToStartTimeNanos.get(threadId);
long nanos = threadIdToTotalCPUTimeNanos.get(threadId);
nanos = nanos + (endNanos - startNanos);
threadIdToTotalCPUTimeNanos.put(threadId, nanos);
}
}
public void printTotalsAndAverages()
{
long totalForAllThreadsNanos = 0L;
int numThreads = 0;
long totalWallTimeMillis = 0;
synchronized (threadIdToStartTimeNanos)
{
numThreads = threadIdToStartTimeMillis.size();
for (Long threadId: threadIdToStartTimeNanos.keySet())
{
totalWallTimeMillis += System.currentTimeMillis() - threadIdToStartTimeMillis.get(threadId);
long totalCPUTimeNanos = threadIdToTotalCPUTimeNanos.get(threadId);
totalForAllThreadsNanos += totalCPUTimeNanos;
}
}
long totalCPUMillis = (totalForAllThreadsNanos)/1000000;
System.out.println("Total milli-seconds for all threads: " + totalCPUMillis);
double averageMillis = totalCPUMillis/numThreads;
System.out.println("Average milli-seconds for all threads: " + averageMillis);
double averageCPUUtilisation = totalCPUMillis/totalWallTimeMillis;
System.out.println("Average CPU utilisation for all threads: " + averageCPUUtilisation);
}
}
public class XMPTask implements Callable<String>
{
private final MeasureTime measure;
public XMPTask(// your parameters first
MeasureTime measure)
{
// Save your things first
this.measure = measure;
}
#Override
public String call() throws Exception
{
measure.startTimeCount(Thread.currentThread().getId());
try
{
// do whatever work here that burns some CPU.
}
finally
{
measure.endTimeCount(Thread.currentThread().getId());
}
return "Your return thing";
}
}
After writing all this, there is one thing that seems a bit strange in that the XMPTask seems to know too much about the list of tasks, when, I think you should just create an XMPTask for every task that you have, give it enough information to do the job, and submit them to the service as you create them.

Fibonacci on Java ExecutorService runs faster sequentially than in parallel

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??
It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Java strange performance inconsistency

I have a simple recursive method, a depth first search. On each call, it checks if it's in a leaf, otherwise it expands the current node and calls itself on the children.
I'm trying to make it parallel, but I notice the following strange (for me) problem.
I measure execution time with System.currentTimeMillis().
When I break the search into a number of subsearches and add the total execution time, I get a bigger number than the sequential search. I only measure execution time, no communication or sync, etc. I would expect to get the same time when I add the times of the subtasks. This happens even if I just run one task after the other, so without threads. If I just break the search into some subtasks and run the subtasks one after the other, I get a bigger time.
If I add the number of method calls for the subtasks, I get the same number as the sequential search. So, basically, in both cases I do the same number of method calls, but I get different times.
I'm guessing there's some overhead on initial method calls or something else caused by a JVM mechanism. Any ideas what could it be?
For example, one sequential search takes around 3300 ms. If I break it into 13 tasks, it takes a total time of 3500ms.
My method looks like this:
private static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
Whenever I call it, I do it like this:
for(int i = 0; i < num_tasks; i++){
long start = System.currentTimeMillis();
dfs(tasks[i]);
totalTime += (System.currentTimeMillis() - start);
}
Problem is totalTime increases with num_tasks and I would expect to stay the same because the method_calls variable stays the same.
You should average out the numbers over longer runs. Secondly the precision of currentTimeMillis may not be sufficient, you can try using System.nanoTime().
As in all the programming languages, whenever you call a procedure or a method, you have to push the environment, initialize the new one, execute the programs instructions, return the value on the stack and finally reset the previous environment. It cost a bit! Create a thread cost also more!
I suppose that if you enlarge the researching tree you will have benefit by the parallelization.
Adding system clock time for several threads seems a weird idea. Either you are interested in the time until processing is complete, in which case adding doesn't make sense, or in cpu usage, in which case you should only count when the thread is actually scheduled to execute.
What probably happens is that at least part of the time, more threads are ready to execute than the system has cpu cores, and the scheduler puts one of your threads to sleep, which causes it to take longer to complete. It makes sense that this effect is exacerbated the more threads you use. (Even if your program uses less threads than you have cores, other programs (such as your development environment, ...) might).
If you are interested in CPU usage, you might wish to query ThreadMXBean.getCurrentThreadCpuTime
I'd expect to see Threads used. Something like this:
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Puzzle {
static volatile long totalTime = 0;
private static int method_calls = 0;
/**
* #param args
*/
public static void main(String[] args) {
final int num_tasks = 13;
final State[] tasks = new State[num_tasks];
ExecutorService threadPool = Executors.newFixedThreadPool(5);
for(int i = 0; i < num_tasks; i++){
threadPool.submit(new DfsRunner(tasks[i]));
}
try {
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.out.println("Interrupted");
}
System.out.println(method_calls + " Methods in " + totalTime + "msecs");
}
static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
}
With the runnable bit like this:
public class DfsRunner implements Runnable {
private State state;
public DfsRunner(State state) {
super();
this.state = state;
}
#Override
public void run() {
long start = System.currentTimeMillis();
Puzzle.dfs(state);
Puzzle.totalTime += (System.currentTimeMillis() - start);
}
}

Java thread creation overhead

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.
Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).
First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.
for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.
I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Categories

Resources