Multithreading a massive file read

Multithreading a massive file read - java

I'm still in the process of wrapping my brain around how concurrency works in Java. I understand that (if you're subscribing to the OO Java 5 concurrency model) you implement a Task or Callable with a run() or call() method (respectively), and it behooves you to parallelize as much of that implemented method as possible.
But I'm still not understanding something inherent about concurrent programming in Java:
How is a Task's run() method assigned the right amount of concurrent work to be performed?
As a concrete example, what if I have an I/O-bound readMobyDick() method that reads the entire contents of Herman Melville's Moby Dick into memory from a file on the local system. And let's just say I want this readMobyDick() method to be concurrent and handled by 3 threads, where:
Thread #1 reads the first 1/3rd of the book into memory
Thread #2 reads the second 1/3rd of the book into memory
Thread #3 reads the last 1/3rd of the book into memory
Do I need to chunk Moby Dick up into three files and pass them each to their own task, or do I I just call readMobyDick() from inside the implemented run() method and (somehow) the Executor knows how to break the work up amongst the threads.
I am a very visual learner, so any code examples of the right way to approach this are greatly appreciated! Thanks!

You have probably chosen by accident the absolute worst example of parallel activities!
Reading in parallel from a single mechanical disk is actually slower than reading with a single thread, because you are in fact bouncing the mechanical head to different sections of the disk as each thread gets its turn to run. This is best left as a single threaded activity.
Let's take another example, which is similar to yours but can actually offer some benefit: assume I want to search for the occurrences of a certain word in a huge list of words (this list could even have come from a disk file, but like I said, read by a single thread). Assume I can use 3 threads like in your example, each searching on 1/3rd of the huge word list and keeping a local counter of how many times the searched word appeared.
In this case you'd want to partition the list in 3 parts, pass each part to a different object whose type implements Runnable and have the search implemented in the run method.
The runtime itself has no idea how to do the partitioning or anything like that, you have to specify it yourself. There are many other partitioning strategies, each with its own strengths and weaknesses, but we can stick to the static partitioning for now.
Let's see some code:
class SearchTask implements Runnable {
private int localCounter = 0;
private int start; // start index of search
private int end;
private List<String> words;
private String token;
public SearchTask(int start, int end, List<String> words, String token) {
this.start = start;
this.end = end;
this.words = words;
this.token = token;
}
public void run() {
for(int i = start; i < end; i++) {
if(words.get(i).equals(token)) localCounter++;
}
}
public int getCounter() { return localCounter; }
}
// meanwhile in main :)
List<String> words = new ArrayList<String>();
// populate words
// let's assume you have 30000 words
// create tasks
SearchTask task1 = new SearchTask(0, 10000, words, "John");
SearchTask task2 = new SearchTask(10000, 20000, words, "John");
SearchTask task3 = new SearchTask(20000, 30000, words, "John");
// create threads for each task
Thread t1 = new Thread(task1);
Thread t2 = new Thread(task2);
Thread t3 = new Thread(task3);
// start threads
t1.start();
t2.start();
t3.start();
// wait for threads to finish
t1.join();
t2.join();
t3.join();
// collect results
int counter = 0;
counter += task1.getCounter();
counter += task2.getCounter();
counter += task3.getCounter();
This should work nicely. Note that in practical cases you would build a more generic partitioning scheme. You could alternatively use an ExecutorService and implement Callable instead of Runnable if you wish to return a result.
So an alternative example using more advanced constructs:
class SearchTask implements Callable<Integer> {
private int localCounter = 0;
private int start; // start index of search
private int end;
private List<String> words;
private String token;
public SearchTask(int start, int end, List<String> words, String token) {
this.start = start;
this.end = end;
this.words = words;
this.token = token;
}
public Integer call() {
for(int i = start; i < end; i++) {
if(words.get(i).equals(token)) localCounter++;
}
return localCounter;
}
}
// meanwhile in main :)
List<String> words = new ArrayList<String>();
// populate words
// let's assume you have 30000 words
// create tasks
List<Callable> tasks = new ArrayList<Callable>();
tasks.add(new SearchTask(0, 10000, words, "John"));
tasks.add(new SearchTask(10000, 20000, words, "John"));
tasks.add(new SearchTask(20000, 30000, words, "John"));
// create thread pool and start tasks
ExecutorService exec = Executors.newFixedThreadPool(3);
List<Future> results = exec.invokeAll(tasks);
// wait for tasks to finish and collect results
int counter = 0;
for(Future f: results) {
counter += f.get();
}

You picked a bad example, as Tudor was so kind to point out. Spinning disk hardware is subject to physical constraints of moving platters and heads, and the most efficient read implementation is to read each block in order, which reduces the need to move the head or wait for the disk to align.
That said, some operating systems don't always store things continuously on disks, and for those who remember, defragmentation could provide a disk performance boost if you OS / filesystem didn't do the job for you.
As you mentioned wanting a program that would benefit, let me suggest a simple one, matrix addition.
Assuming you made one thread per core, you can trivially divide any two matrices to be added into N (one for each thread) rows. Matrix addition (if you recall) works as such:
A + B = C
or
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ]
[ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ]
[ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
So to distribute this across N threads, we simply need to take the row count and modulus divide by the number of threads to get the "thread id" it will be added with.
matrix with 20 rows across 3 threads
row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18)
row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19)
row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17)
// row 20 doesn't exist, because we number rows from 0
Now each thread "knows" which rows it should handle, and the results "per row" can be computed trivially because the results do not cross into other thread's domain of computation.
All that is needed now is a "result" data structure which tracks when the values have been computed, and when last value is set, then the computation is complete. In this "fake" example of a matrix addition result with two threads, computing the answer with two threads takes approximately half the time.
// the following assumes that threads don't get rescheduled to different cores for
// illustrative purposes only. Real Threads are scheduled across cores due to
// availability and attempts to prevent unnecessary core migration of a running thread.
[ done, done, done ] // filled in at about the same time as row 2 (runs on core 3)
[ done, done, done ] // filled in at about the same time as row 1 (runs on core 1)
[ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3)
[ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)
More complex problems can be solved by multithreading, and different problems are solved with different techniques. I purposefully picked one of the simplest examples.

you implement a Task or Callable with a run() or call() method
(respectively), and it behooves you to parallelize as much of that
implemented method as possible.
A Task represents a discrete unit of work
Loading a file into memory is a discrete unit of work and can therefore this activity can be delegated to a background thread. I.e. a background thread runs this task of loading the file.
It is a discrete unit of work since it has no other dependencies needed in order to do its job (load the file) and has discrete boundaries.
What you are asking is to further divide this into task. I.e. a thread loads 1/3 of the file while another thread the 2/3 etc.
If you were able to divide the task into further subtasks then it would not be a task in the first place by definition. So loading a file is a single task by itself.
To give you an example:
Let's say that you have a GUI and you need to present to the user data from 5 different files. To present them you need also to prepare some data structures to process the actual data.
All these are separate tasks.
E.g. the loading of files is 5 different tasks so could be done by 5 different threads.
The preparation of the data structures could be done a different thread.
The GUI runs of course in another thread.
All these can happen concurrently

If you system supported high-throughput I/O , here is how you can do it:
How to read a file using multiple threads in Java when a high throughput(3GB/s) file system is available
Here is the solution to read a single file with multiple threads.
Divide the file into N chunks, read each chunk in a thread, then merge them in order. Beware of lines that cross chunk boundaries. It is the basic idea as suggested by user
slaks
Bench-marking below implementation of multiple-threads for a single 20 GB file:
1 Thread : 50 seconds : 400 MB/s
2 Threads: 30 seconds : 666 MB/s
4 Threads: 20 seconds : 1GB/s
8 Threads: 60 seconds : 333 MB/s
Equivalent Java7 readAllLines() : 400 seconds : 50 MB/s
Note: This may only work on systems that are designed to support high-throughput I/O , and not on usual personal computers
Here is the essential nits of the code, for complete details , follow the link
public class FileRead implements Runnable
{
private FileChannel _channel;
private long _startLocation;
private int _size;
int _sequence_number;
public FileRead(long loc, int size, FileChannel chnl, int sequence)
{
_startLocation = loc;
_size = size;
_channel = chnl;
_sequence_number = sequence;
}
#Override
public void run()
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
//allocate memory
ByteBuffer buff = ByteBuffer.allocate(_size);
//Read file chunk to RAM
_channel.read(buff, _startLocation);
//chunk to String
String string_chunk = new String(buff.array(), Charset.forName("UTF-8"));
System.out.println("Done Reading the channel: " + _startLocation + ":" + _size);
}
//args[0] is path to read file
//args[1] is the size of thread pool; Need to try different values to fing sweet spot
public static void main(String[] args) throws Exception
{
FileInputStream fileInputStream = new FileInputStream(args[0]);
FileChannel channel = fileInputStream.getChannel();
long remaining_size = channel.size(); //get the total number of bytes in the file
long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads
//thread pool
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1]));
long start_loc = 0;//file pointer
int i = 0; //loop counter
while (remaining_size >= chunk_size)
{
//launches a new thread
executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i));
remaining_size = remaining_size - chunk_size;
start_loc = start_loc + chunk_size;
i++;
}
//load the last remaining piece
executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i));
//Tear Down
}
}

Related

Rest parellel calls to service -Multithreading in java

I have a rest call api where max count of result return by the api is 1000.start page=1
{
"status": "OK",
"payload": {
"EMPList":[],
count:5665
}
So to get other result I have to change the start page=2 and again hit the service.again will get 1000 results only.
but after first call i want to make it as a parallel call and I want to collect the result and combine it and send it back to calling service in java. Please suggest i am new to java.i tried using callable but it's not working

It seems to me that ideally you should be able to configure your max count to one appropriate for your use case. I'm assuming you aren't able to do that. Here is a simple, lock-less, multi threading scheme that acts as a simple reduction operation for your two network calls:
// online runnable: https://ideone.com/47KsoS
int resultSize = 5;
int[] result = new int[resultSize*2];
Thread pg1 = new Thread(){
public void run(){
System.out.println("Thread 1 Running...");
// write numbers 1-5 to indexes 0-4
for(int i = 0 ; i < resultSize; i ++) {
result[i] = i + 1;
}
System.out.println("Thread 1 Exiting...");
}
};
Thread pg2 = new Thread(){
public void run(){
System.out.println("Thread 2 Running");
// write numbers 5-10 to indexes 5-9
for(int i = 0 ; i < resultSize; i ++) {
result[i + resultSize] = i + 1 + resultSize;
}
System.out.println("Thread 2 Exiting...");
}
};
pg1.start();
pg2.start();
// ensure that pg1 execution finishes
pg1.join();
// ensure that pg2 execution finishes
pg2.join();
// print result of reduction operation
System.out.println(Arrays.toString(result));
There is a very important caveat with this implementation however. You will notice that both of the threads DO NOT overlap in their memory writes. This is very important as if you were to simply change our int[] result to ArrayList<Integer> this could lead to catastrophic failure in our reduction operation between the two threads called a Race Condition (I believe the standard ArrayList implementation in Java is not thread safe). Since we can guarantee how large our result will be I would highly suggest sticking to my usage of an array for this multi-threaded implementation as ArrayLists hide a lot of implementation logic from you that you likely won't understand until you take a basic data-structures course.

Compiler ignore threads priorities

I tried to compile the example from Thinking in Java by Bruce Eckel:
import java.util.concurrent.*;
public class SimplePriorities implements Runnable {
private int countDown = 5;
private volatile double d; // No optimization
private int priority;
public SimplePriorities(int priority) {
this.priority = priority;
}
public String toString() {
return Thread.currentThread() + ": " + countDown;
}
public void run() {
Thread.currentThread().setPriority(priority);
while(true) {
// An expensive, interruptable operation:
for(int i = 1; i < 100000; i++) {
d += (Math.PI + Math.E) / (double)i;
if(i % 1000 == 0)
Thread.yield();
}
System.out.println(this);
if(--countDown == 0) return;
}
}
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
for(int i = 0; i < 5; i++)
exec.execute(
new SimplePriorities(Thread.MIN_PRIORITY));
exec.execute(
new SimplePriorities(Thread.MAX_PRIORITY));
exec.shutdown();
}
}
According to the book, the output has to look like:
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-6,10,main]: 4
Thread[pool-1-thread-6,10,main]: 3
Thread[pool-1-thread-6,10,main]: 2
Thread[pool-1-thread-6,10,main]: 1
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-1,1,main]: 5
...
But in my case 6th thread doesn't execute its task at first and threads are disordered. Could you please explain me what's wrong? I just copied the source and didn't add any strings of code.

The code is working fine and with the output from the book.
Your IDE probably has console window with the scroll bar - just scroll it up and see the 6th thread first doing its job.
However, the results may differ depending on OS / JVM version. This code runs as expected for me on Windows 10 / JVM 8

There are two issues here:
If two threads with the same priority want to write output, which one goes first?
The order of threads (with the same priority) is undefined, therefore the order of output is undefined. It is likely that a single thread is allowed to write several outputs in a row (because that's how most thread schedulers work), but it could also be completely random, or anything in between.
How many threads will a cached thread pool create?
That depends on your system. If you run on a dual-core system, creating more than 4 threads is pointless, because there hardly won't be any CPU available to execute those threads. In this scenario further tasks will be queued and executed only after earlier tasks are completed.
Hint: there is also a fixed-size thread pool, experimenting with that should change the output.
In summary there is nothing wrong with your code, it is just wrong to assume that threads are executed in any order. It is even technically possible (although very unlikely), that the first task is already completed before the last task is even started. If your book says that the above order is "correct" then the book is simply wrong. On an average system that might be the most likely output, but - as above - with threads there is never any order, unless you enforce it.
One way to enforce it are thread priorities - higher priorities will get their work done first - you can find other concepts in the concurrent package.

Why this piece of Java code is not running concurrently

I have written Sieve of Eratosthenes which is supposed to work in parallel, but it's not. When I increase number of threads, time of computing is not getting lower. Any ideas why?
Main class
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ConcurrentTest {
public static void main(String[] args) throws InterruptedException {
Sieve task = new Sieve();
int x = 1000000;
int threads = 4;
task.setArray(x);
Long beg = new Date().getTime();
ExecutorService exec = Executors.newCachedThreadPool();
for (int i = 0; i < threads; i++) {
exec.execute(task);
}
exec.shutdown();
Long time = 0L;
// Main thread is waiting until all threads are terminated
// ( it means that computing is done)
while (true)
if (exec.isTerminated()) {
time = new Date().getTime() - beg;
break;
}
System.out.println("Time is " + time);
}
}
Sieve class
import java.util.concurrent.ConcurrentHashMap;
public class Sieve implements Runnable {
private ConcurrentHashMap<Integer, Boolean> array =
new ConcurrentHashMap<Integer, Boolean>();
private int x;
public void run() {
while(true){
// I am getting synchronized number to check if it's prime
int n = getCounter();
// If no more numbers to check, stop loop
if( n == -1)
break;
// If HashMap contains number, we can further
if(!array.containsKey(n))continue;
for (int i = 2 * n; i <= x; i += n) {
// Compound numbers are removed from HashMap, Eg. 6, 12 and much more.
array.remove(i);
}
}
}
private synchronized int getCounter(){
if( counter < x)
return counter++;
else return -1;
}
public void setArray(int x) {
this.x = x;
for (int i = 2; i <= x; i++)
array.put(i, false);
}
}
I made some tests with different number of threads. These are results:
Nr of threads 1 Time is 1850, 1795, 1825
Nr of threads 2 Time is 1845, 1836, 1814
Nr of threads 3 Time is 1767, 1820, 1756
Nr of threads 4 Time is 1732, 1840, 2083
Nr of threads 5 Time is 1791, 1795, 1803
Nr of threads 6 Time is 1825, 1728, 1707
Nr of threads 7 Time is 1754, 1729, 1686
Nr of threads 8 Time is 1760, 1717, 1817
Nr of threads 9 Time is 1721, 1699, 1673
Nr of threads 10 Time is 1661, 1722, 1718

When I increase number of threads, time of computing is not getting
lower
tl;dr: your problem size is too small. If you increase x to 10000000, the differences will become more obvious. They won't be what you're expecting, though.
I tried your code on an eight core machine with two slight modifications:
For timing, I used System.nanoTime() instead of getTime() on a Date.
I used the awaitTermination method of ExecutorService rather than a spinloop to check for the end of run.
I tried launching your Sieve tasks using a fixed thread pool, a cached thread pool and a fork join pool and comparing the results of different values for your thread variable.
I see the following results (in milliseconds) on my machine with x=10000000:
Thread count = 1 2 4 8 16
Fixed thread pool = 5451 3866 3639 3227 3120
Cached thread pool= 5434 3763 3709 3258 3078
Fork-join pool = 6732 3670 3735 3190 3102
What these results show us is a clear benefit of changing from a single thread of execution to two threads. However, the benefit of additional threads drops off rapidly. There's an interesting plateau going from two to four threads and marginal benefits up to 16.
In addition, you can also see that the different threading mechanisms have different initial overhead: I didn't expect the Fork-Join pool to cost that much more to start than the other mechanisms.
So, as written, you shouldn't really expect a benefit past two threads for small but non-trivial problem sets.
If you'd like to increase the benefit of additional threads, you're going to need to look at your current implementation. For example, when I switched from your synchronized getCounter() to an AtomicInteger using incrementAndGet(), I eliminated the overhead of the synchronized method. The result is that all of my four thread numbers dropped on the order of 1000 milliseconds.

Fibonacci on Java ExecutorService runs faster sequentially than in parallel

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??

It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Java thread creation overhead

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.
There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.
Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.
This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.
I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.
Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

Here is an example microbenchmark:
public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
Thread[] tt = new Thread[threadCount];
final int[] aa = new int[tt.length];
System.out.print("Creating "+tt.length+" Thread objects... ");
long t0 = System.nanoTime(), t00 = t0;
for (int i = 0; i < tt.length; i++) {
final int j = i;
tt[i] = new Thread() {
public void run() {
int k = j;
for (int l = 0; l < workAmountPerThread; l++) {
k += k*k+l;
}
aa[j] = k;
}
};
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].start();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
System.out.print("Joining "+tt.length+" threads... ");
t0 = System.nanoTime();
for (int i = 0; i < tt.length; i++) {
tt[i].join();
}
System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
long totalTime = System.nanoTime()-t00;
int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
for (int a : aa) {
checkSum += a;
}
System.out.println("Checksum: "+checkSum);
System.out.println("Total time: "+totalTime*1E-6+" ms");
System.out.println();
return totalTime;
}
public static void main(String[] kr) throws InterruptedException {
int workAmount = 100000000;
int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
int trialCount = 2;
long[][] time = new long[threadCount.length][trialCount];
for (int j = 0; j < trialCount; j++) {
for (int i = 0; i < threadCount.length; i++) {
time[i][j] = test(threadCount[i], workAmount/threadCount[i]);
}
}
System.out.print("Number of threads ");
for (long t : threadCount) {
System.out.print("\t"+t);
}
System.out.println();
for (int j = 0; j < trialCount; j++) {
System.out.print((j+1)+". trial time (ms)");
for (int i = 0; i < threadCount.length; i++) {
System.out.print("\t"+Math.round(time[i][j]*1E-6));
}
System.out.println();
}
}
}
The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 #2.13 GHz are as follows:
Number of threads 1 2 10 100 1000 10000 100000
1. trial time (ms) 346 181 179 191 286 1229 11308
2. trial time (ms) 346 181 187 189 281 1224 10651
Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):
One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.
And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool.
Each Thread has these package-private properties:
/* ThreadLocal values pertaining to this thread. This map is maintained
* by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;
/*
* InheritableThreadLocal values pertaining to this thread. This map is
* maintained by the InheritableThreadLocal class.
*/
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.