i try to run a set of parallel thread in java. im creating these through a high order function as follows:
public static void parallelizedMap(Consumer<String> f, List<String> list, int count) {
List<List<String>> parts = new ArrayList<List<String>>();
final int N = list.size();
int L = N / (count - 1);
for (int i = 0; i < N; i += L) {
parts.add(new ArrayList<String>(list.subList(i, Math.min(N, i + L))));
}
for (List<String> e : parts) {
Runnable r = new Runnable() {
public void run() {
e.forEach(f);
}
};
new Thread(r).start();
}
}
this method is called every few minutes. it creates hundreds of thread after several minutes. every thread only runs for 20 seconds. but my debugging showed that they never terminate and therefor i get this Exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
thanks in advance
I do not think the problem are the threads. But if count is double or more than size of list:
final int N = list.size(); // N = 10, count = 22
int L = N / (count - 1); // L = 10 / (22-1) = 0.476 / L = 0.4d = (int) 0;
for (int i = 0; i < N; i += 0) {
parts.add(new ArrayList<String>(list.subList(i, Math.min(N, i + 0))));
}
The iteration is an endless-loop but the memory-usage grows every iteration.
Naturally its a OutOfMemory.
for me the solution was to give every thread an extra jdbc connection and dont let them share one. this resolved the deadlock und they terminate as expected.
Related
I am trying to get familiar with java multithreaded applications. I tried to think of a simple application that can be parallelized very well. I thought vector addition would be a good application to do so.
However, when running on my linux server (which has 4 cores) I dont get any speed up. The time to execute on 4,2,1 threads is about the same.
Here is the code I came up with:
public static void main(String[]args)throws InterruptedException{
final int threads = Integer.parseInt(args[0]);
final int length= Integer.parseInt(args[1]);
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final double[]result =new double[length];
final double[]array1=getRandomArray(length);
final double[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k]+array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
}
length is always a multiple of threads and getRandomArray() creates a random array of doubles between 0 and 1.
Execution Time for 1-Thread: 84579446ns
Execution Time for 2-Thread: 74211325ns
Execution Time for 4-Thread: 89215100ns
length =10000000
Here is the Code for getRandomArray():
private static double[]getRandomArray(int length){
Random random =new Random();
double[]array= new double[length];
for(int i=0;i<length;i++){
array[i]=random.nextDouble();
}
return array;
}
I would appreciate any help.
The difference is observable for the following code. Try it.
public static void main(String[]args)throws InterruptedException{
for(int z = 0; z < 10; z++) {
final int threads = 1;
final int length= 100_000_000;
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final boolean[]result =new boolean[length];
final boolean[]array1=getRandomArray(length);
final boolean[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k] | array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime)*1.0/1000/1000);
boolean x = false;
for(boolean d : result) {
x |= d;
}
System.out.println(x);
}
}
First things first you need to warmup your code. This way you will measure compiled code. The first two iterations have the same(approximately) time but the next will differ. Also I changed double to boolean because my machine doesn't have much memory. This allows me to allocate a huge array and it also makes work more CPU consuming.
There is a link in comments. I suggest you to read it.
Hi from my side if you are trying to see how your cores shares work you can make very simple task for all cores, but make them to work constantly on something not shared across different threads (basically to simulate for example merge sort, where threads are working on something complicated and use shared resources in a small amount of time). Using your code i did something like this. In such case you should see almost exactly 2x speed up and 4 times speed up.
public static void main(String[]args)throws InterruptedException{
for(int a=0; a<5; a++) {
final int threads = 2;
final int length = 10;
final int balk = (length / threads);
Thread[] th = new Thread[threads];
System.out.println(Runtime.getRuntime().availableProcessors());
final double[] result = new double[length];
final double[] array1 = getRandomArray(length);
final double[] array2 = getRandomArray(length);
long startingTime = System.nanoTime();
for (int i = 0; i < threads; i++) {
final int current = i;
th[i] = new Thread(() -> {
Random random = new Random();
int meaningless = 0;
for (int k = current * balk; k < (current + 1) * balk; k++) {
result[k] = array1[k] + array2[k];
for (int j = 0; j < 10000000; j++) {
meaningless+=random.nextInt(10);
}
}
});
th[i].start();
}
for (int i = 0; i < threads; i++) {
th[i].join();
}
System.out.println("Time needed: " + ((System.nanoTime() - startingTime) * 1.0) / 1000000000 + " s");
}
}
You see, in your code most time is consumed by building big table, and then threads are executing very fast, their work is so fast that your calculation of time is wrong because most of time is consumed by creating threads. When i invoked code which works on precalculated loop like this:
long startingTime =System.nanoTime();
for(int k=0; k<length; k++){
result[k]=array1[k]|array2[k];
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
It worked two times faster than your code with 2 threads. I hope that you understand what i mean in this case and will see my point when i gave my threads much more meaningless work.
For my work I have done some tests for time chart.
I have come to something that surprised me and need help understanding it.
I used few data structures as queue and wanted to know how deleting is fast according to number of items. And arraylist with 10 items, deleting from front and not set initial capacity is much slower than the same with set initial capacity (to 15). Why? And why it's same at 100 items.
Here's the chart:
Data Structures: L - implements List, C - set initial capacity, B - removing from back, Q - implements Queue
EDIT:
Appending relevant piece of code
new Thread(new Runnable() {
#Override
public void run()
{
long time;
final int[] arr = {10, 100, 1000, 10000, 100000, 1000000};
for (int anArr : arr)
{
final List<Word> temp = new ArrayList<>();
while (temp.size() < anArr) temp.add(new Item());
final int top = (int) Math.sqrt(anArr);
final List<Word> first = new ArrayList<>();
final List<Word> second = new ArrayList<>(anArr);
...
first.addAll(temp);
second.addAll(temp);
...
SystemClock.sleep(5000);
time = System.nanoTime();
for (int i = 0; i < top; ++i) first.remove(0);
Log.d("al_l", "rem: " + (System.nanoTime() - time));
time = System.nanoTime();
for (int i = 0; i < top; ++i) second.remove(0);
Log.d("al_lc", "rem: " + (System.nanoTime() - time));
...
}
}
}).start();
Read this article about Avoiding Benchmarking Pitfalls on the JVM. It explains the impact of the Hotspot VM on the test results. If you don't take care about it, your measurement isn't right. As you have found out with your own test.
If you want to do reliable benchamrking use JMH.
I too was able to replicate it by creating the code below. However, I noticed that whatever is run first (the set capacity vs non-set capacity) is the one that will take the longest. I assume this is some kind of optimization, maybe the JVM, or some kind of Caching?
public class Test {
public static void main(String[] args) {
measure(-1, 10); // switch with line below
measure(15, 10); // switch with line above
measure(-1, 100);
measure(15, 100);
}
public static void measure(int capacity, long numItems) {
ArrayList<String> arr = new ArrayList<>();
if (capacity >= 1) {
arr.ensureCapacity(capacity);
}
for (int i = 0; i <= numItems; i++) {
arr.add("T");
}
long start = System.nanoTime();
for (int i = 0; i <= numItems; i++) {
arr.remove(0);
}
long end = System.nanoTime();
System.out.println("Capacity: " + capacity + ", " + "Runtime: "
+ (end - start));
}
}
In our project for one task we used static Random instance for random numbers generation goal. After Java 7 release new ThreadLocalRandom class appeared for generating random numbers.
From spec:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention. Use of ThreadLocalRandom is particularly appropriate when multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools.
and also:
When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.
So I've made my little test:
public class ThreadLocalRandomTest {
private static final int THREAD_COUNT = 100;
private static final int GENERATED_NUMBER_COUNT = 1000;
private static final int INT_RIGHT_BORDER = 5000;
private static final int EXPERIMENTS_COUNT = 5000;
public static void main(String[] args) throws InterruptedException {
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Length of generated numbers chain for each thread: " + GENERATED_NUMBER_COUNT);
System.out.println("Right border integer: " + INT_RIGHT_BORDER);
System.out.println("Count of experiments: " + EXPERIMENTS_COUNT);
int repeats = 0;
int workingTime = 0;
long startTime = 0;
long endTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForSharedRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for shared Random instance: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
repeats = 0;
workingTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForTheadLocalRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for ThreadLocalRandom: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
}
private static int calculateRepeatsForSharedRandom() throws InterruptedException {
final Random rand = new Random();
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
private static int calculateRepeatsForTheadLocalRandom() throws InterruptedException {
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = ThreadLocalRandom.current().nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
}
I've also added test for non-shared Random and got next results:
Number of threads: 100
Length of generated numbers chain for each thread: 100
Right border integer: 5000
Count of experiments: 10000
Average repeats for non-shared Random instance: 8646. Average working time: 13 ms.
Average repeats for shared Random instance: 8646. Average working time: 13 ms.
Average repeats for ThreadLocalRandom: 8646. Average working time: 13 ms.
To me it's little strange, as I expected at least speed increasing when using ThreadLocalRandom comparing to shared Random instance, but see no difference at all.
Can someone explain why it works that way, maybe I haven't done testing properly. Thank you!
You're not running anything in parallel because you're waiting for each thread to finish immediately after starting it. You need a waiting loop outside the loop that starts the threads:
List<Thread> threads = new ArrayList<Thread>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
threads.add(thread);
thread.start();
}
for (Thread thread: threads) {
thread.join();
}
Your testing code is flawed for one. The bane of benchmarkers everywhere.
thread.start();
thread.join();
why not save LOCs and write
thread.run();
the outcome is the same.
EDIT: If you don't realize the outcome from the above, it means that you're running single threaded tests, there's no multithreading going on.
Maybe it would be easier to just have a look at what actually happens. Here is the source for ThreadLocal.get() which is also called for the ThreadLocalRandom.current().
public T get() {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null) {
ThreadLocalMap.Entry e = map.getEntry(this);
if (e != null)
return (T)e.value;
}
return setInitialValue();
}
Where ThreadLocalMap is a specialized HashMap-like implementation with optimizations.
So what basically happens is that ThreadLocal holds a map Thread->Object - or in this case Thread->Random - which is then looked up and either returned or created. As this is nothing 'magical', the timing will be equal to a HashMap-lookup + the initial creation overhead of the actual Object to be returned. Since a HashMap lookup (in this optimized case) is linear, the cost for a lookup is k, where k is the calculation cost of the hash function.
So you can make some assumptions:
ThreadLocal will be faster than creating the object each time in each Runnable, unless the creation cost is much smaller than k. So looking up Random is a good thing, putting an int inside might not be so smart.
ThreadLocal will be better than using your own HashMap, as such a generic implementation can be assumed to be equal to k or worse.
ThreadLocal will be slower than using any lookup with a cost < k. Example: store everything in an array first, then do myRandoms[threadID].
But then this assumes that you know which threads will be processing your work in the first place, so this isn't a real candidate for ThreadLocal anyways.
This question is identical to this
Two loop bodies or one (result identical)
but in my case, I use Java.
I have two loops that runs a billion times.
int a = 188, b = 144, aMax = 0, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax)
aMax = t;
}
for (int i = 0; i < 1000000000; i++) {
int t = b ^ i;
if (t > bMax)
bMax = t;
}
The time it takes to run these two loops in my machine is appr 4 secs. When I fuse these two loops into a single loop and perform all the operations in that single loop, then it runs in 2 secs. As you can see trivial operations makes up the loop contents, thus requiring constant time.
My question is where am I getting this performance improvement?
I am guessing that the only possible place where performance gets affected in the two separate loops is that it increments i and checks if i < 1000000000 2 billion times vs only 1 billion times if I fuse the loops together. Is anything else going on in there?
Thanks!
If you don't run a warm-up phase, it is possible that the first loop gets optimised and compiled but not the second one, whereas when you merge them the whole merged loop gets compiled. Also, using the server option and your code, most gets optimised away as you don't use the results.
I have run the test below, putting each loop as well as the merged loop in their own method and warmimg-up the JVM to make sure everything gets compiled.
Results (JVM options: -server -XX:+PrintCompilation):
loop 1 = 500ms
loop 2 = 900 ms
merged loop = 1,300 ms
So the merged loop is slightly faster, but not that much.
public static void main(String[] args) throws InterruptedException {
for (int i = 0; i < 3; i++) {
loop1();
loop2();
loopBoth();
}
long start = System.nanoTime();
loop1();
long end = System.nanoTime();
System.out.println((end - start) / 1000000);
start = System.nanoTime();
loop2();
end = System.nanoTime();
System.out.println((end - start) / 1000000);
start = System.nanoTime();
loopBoth();
end = System.nanoTime();
System.out.println((end - start) / 1000000);
}
public static void loop1() {
int a = 188, aMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
}
System.out.println(aMax);
}
public static void loop2() {
int b = 144, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = b ^ i;
if (t > bMax) {
bMax = t;
}
}
System.out.println(bMax);
}
public static void loopBoth() {
int a = 188, b = 144, aMax = 0, bMax = 0;
for (int i = 0; i < 1000000000; i++) {
int t = a ^ i;
if (t > aMax) {
aMax = t;
}
int u = b ^ i;
if (u > bMax) {
bMax = u;
}
}
System.out.println(aMax);
System.out.println(bMax);
}
In short, the CPU can execute the instructions in the merged loop in parallel, doubling performance.
Its also possible the second loop is not optimised efficiently. This is because the first loop will trigger the whole method to be compiled and the second loop will be compiled without any metrics which can upset the timing of the second loop. I would place each loop in a separate method to make sure this is not the case.
The CPU can perform a large number of independent operation in parallel (depth 10 on Pentium III and 20 in the Xeon). One operation it attempts to do in parallel is a branch, using branch prediction, but if it doesn't take the same branch almost every time.
I suspect with loop unrolling your loop looks more like following (possibly more loop unrolling in this case)
for (int i = 0; i < 1000000000; i += 2) {
// this first block is run almost in parallel
int t1 = a ^ i;
int t2 = b ^ i;
int t3 = a ^ (i+1);
int t4 = b ^ (i+1);
// this block run in parallel
if (t1 > aMax) aMax = t1;
if (t2 > bMax) bMax = t2;
if (t3 > aMax) aMax = t3;
if (t4 > bMax) bMax = t4;
}
Seems to me that in the case of a single loop the JIT may opt to do loop unrolling and as a result the performance is slightly better
Did you use -server? If no, you should - the client JIT is not a as predictable, neither as good. If you are really interested in what exactly is going on, you can use UnlockDiagnostic + LogCompilation to check what optimizations are being applied in both cases (all the way down to the generated assembly).
Also, from the code you provided I can't see whether you do warmup, whether you run your test one or multiple times for the same JVM, whether you did it a couple of runs (different JVMs). Whether you are taking into account the best, the average or the median time, do you throw out outliers?
Here is a good link on the subject of writing Java micro-benchmarks: http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html
Edit: One more microbenchmarking tip, beware of on-the-stack replacement: http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good
So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.