In our project for one task we used static Random instance for random numbers generation goal. After Java 7 release new ThreadLocalRandom class appeared for generating random numbers.
From spec:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention. Use of ThreadLocalRandom is particularly appropriate when multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools.
and also:
When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.
So I've made my little test:
public class ThreadLocalRandomTest {
private static final int THREAD_COUNT = 100;
private static final int GENERATED_NUMBER_COUNT = 1000;
private static final int INT_RIGHT_BORDER = 5000;
private static final int EXPERIMENTS_COUNT = 5000;
public static void main(String[] args) throws InterruptedException {
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Length of generated numbers chain for each thread: " + GENERATED_NUMBER_COUNT);
System.out.println("Right border integer: " + INT_RIGHT_BORDER);
System.out.println("Count of experiments: " + EXPERIMENTS_COUNT);
int repeats = 0;
int workingTime = 0;
long startTime = 0;
long endTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForSharedRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for shared Random instance: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
repeats = 0;
workingTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForTheadLocalRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for ThreadLocalRandom: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
}
private static int calculateRepeatsForSharedRandom() throws InterruptedException {
final Random rand = new Random();
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
private static int calculateRepeatsForTheadLocalRandom() throws InterruptedException {
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = ThreadLocalRandom.current().nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
}
I've also added test for non-shared Random and got next results:
Number of threads: 100
Length of generated numbers chain for each thread: 100
Right border integer: 5000
Count of experiments: 10000
Average repeats for non-shared Random instance: 8646. Average working time: 13 ms.
Average repeats for shared Random instance: 8646. Average working time: 13 ms.
Average repeats for ThreadLocalRandom: 8646. Average working time: 13 ms.
To me it's little strange, as I expected at least speed increasing when using ThreadLocalRandom comparing to shared Random instance, but see no difference at all.
Can someone explain why it works that way, maybe I haven't done testing properly. Thank you!
You're not running anything in parallel because you're waiting for each thread to finish immediately after starting it. You need a waiting loop outside the loop that starts the threads:
List<Thread> threads = new ArrayList<Thread>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
threads.add(thread);
thread.start();
}
for (Thread thread: threads) {
thread.join();
}
Your testing code is flawed for one. The bane of benchmarkers everywhere.
thread.start();
thread.join();
why not save LOCs and write
thread.run();
the outcome is the same.
EDIT: If you don't realize the outcome from the above, it means that you're running single threaded tests, there's no multithreading going on.
Maybe it would be easier to just have a look at what actually happens. Here is the source for ThreadLocal.get() which is also called for the ThreadLocalRandom.current().
public T get() {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null) {
ThreadLocalMap.Entry e = map.getEntry(this);
if (e != null)
return (T)e.value;
}
return setInitialValue();
}
Where ThreadLocalMap is a specialized HashMap-like implementation with optimizations.
So what basically happens is that ThreadLocal holds a map Thread->Object - or in this case Thread->Random - which is then looked up and either returned or created. As this is nothing 'magical', the timing will be equal to a HashMap-lookup + the initial creation overhead of the actual Object to be returned. Since a HashMap lookup (in this optimized case) is linear, the cost for a lookup is k, where k is the calculation cost of the hash function.
So you can make some assumptions:
ThreadLocal will be faster than creating the object each time in each Runnable, unless the creation cost is much smaller than k. So looking up Random is a good thing, putting an int inside might not be so smart.
ThreadLocal will be better than using your own HashMap, as such a generic implementation can be assumed to be equal to k or worse.
ThreadLocal will be slower than using any lookup with a cost < k. Example: store everything in an array first, then do myRandoms[threadID].
But then this assumes that you know which threads will be processing your work in the first place, so this isn't a real candidate for ThreadLocal anyways.
Related
So I have a task to calculate Euler's number using multiple threads, using this formula: sum( ((3k)^2 + 1) / ((3k)!) ), for k = 0...infinity.
import java.math.BigDecimal;
import java.math.BigInteger;
import java.io.FileWriter;
import java.io.IOException;
import java.math.RoundingMode;
class ECalculator {
private BigDecimal sum;
private BigDecimal[] series;
private int length;
public ECalculator(int threadCount) {
this.length = threadCount;
this.sum = new BigDecimal(0);
this.series = new BigDecimal[threadCount];
for (int i = 0; i < this.length; i++) {
this.series[i] = BigDecimal.ZERO;
}
}
public synchronized void addToSum(BigDecimal element) {
this.sum = this.sum.add(element);
}
public void addToSeries(int id, BigDecimal element) {
if (id - 1 < length) {
this.series[id - 1] = this.series[id - 1].add(element);
}
}
public synchronized BigDecimal getSum() {
return this.sum;
}
public BigDecimal getSeriesSum() {
BigDecimal result = BigDecimal.ZERO;
for (int i = 0; i < this.length; i++) {
result = result.add(this.series[i]);
}
return result;
}
}
class ERunnable implements Runnable {
private final int id;
private final int threadCount;
private final int threadRemainder;
private final int elements;
private final boolean quietFlag;
private ECalculator eCalc;
public ERunnable(int threadCount, int threadRemainder, int id, int elements, boolean quietFlag, ECalculator eCalc) {
this.id = id;
this.threadCount = threadCount;
this.threadRemainder = threadRemainder;
this.elements = elements;
this.quietFlag = quietFlag;
this.eCalc = eCalc;
}
#Override
public void run() {
if (!quietFlag) {
System.out.println(String.format("Thread-%d started.", this.id));
}
long start = System.currentTimeMillis();
int k = this.threadRemainder;
int iteration = 0;
BigInteger currentFactorial = BigInteger.valueOf(intFactorial(3 * k));
while (iteration < this.elements) {
if (iteration != 0) {
for (int i = 3 * (k - threadCount) + 1; i <= 3 * k; i++) {
currentFactorial = currentFactorial.multiply(BigInteger.valueOf(i));
}
}
this.eCalc.addToSeries(this.id, new BigDecimal(Math.pow(3 * k, 2) + 1).divide(new BigDecimal(currentFactorial), 100, RoundingMode.HALF_UP));
iteration += 1;
k += this.threadCount;
}
long stop = System.currentTimeMillis();
if (!quietFlag) {
System.out.println(String.format("Thread-%d stopped.", this.id));
System.out.println(String.format("Thread %d execution time: %d milliseconds", this.id, stop - start));
}
}
public int intFactorial(int n) {
int result = 1;
for (int i = 1; i <= n; i++) {
result *= i;
}
return result;
}
}
public class TaskRunner {
public static final String DEFAULT_FILE_NAME = "result.txt";
public static void main(String[] args) throws InterruptedException {
int threadCount = 2;
int precision = 10000;
int elementsPerTask = precision / threadCount;
int remainingElements = precision % threadCount;
boolean quietFlag = false;
calculate(threadCount, elementsPerTask, remainingElements, quietFlag, DEFAULT_FILE_NAME);
}
public static void writeResult(String filename, String result) {
try {
FileWriter writer = new FileWriter(filename);
writer.write(result);
writer.close();
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}
public static void calculate(int threadCount, int elementsPerTask, int remainingElements, boolean quietFlag, String outputFile) throws InterruptedException {
long start = System.currentTimeMillis();
Thread[] threads = new Thread[threadCount];
ECalculator eCalc = new ECalculator(threadCount);
for (int i = 0; i < threadCount; i++) {
if (i == 0) {
threads[i] = new Thread(new ERunnable(threadCount, i, i + 1, elementsPerTask + remainingElements, quietFlag, eCalc));
} else {
threads[i] = new Thread(new ERunnable(threadCount, i, i + 1, elementsPerTask, quietFlag, eCalc));
}
threads[i].start();
}
for (int i = 0; i < threadCount; i++) {
threads[i].join();
}
String result = eCalc.getSeriesSum().toString();
if (!quietFlag) {
System.out.println("E = " + result);
}
writeResult(outputFile, result);
long stop = System.currentTimeMillis();
System.out.println("Calculated in: " + (stop - start) + " milliseconds" );
}
}
I stripped out the prints, etc. in the code that have no effect. My problem is that the more threads I use the slower it gets. Currently the fastest run I have is for 1 thread. I am sure the factorial calculation is causing some issues. I tried using a thread pool but still got the same times.
How can I make it so that running it with more threads, up until some point, will speed up the calculation process?
How would one go about calculating this big factorials?
The precision parameter that is passed is the amount of elements in the sum that are used. Can I set the BigDecimal scale to be somehow dependent on that precision so I don't hard code it?
EDIT
I updated the code block to be in 1 file only and runnable without external libs.
EDIT 2
I found out that the factorial code messes up with the time. If I let the threads ramp up to some high precision without calculating factorials the time goes down with increasing threads. Yet I cannot implement the factorial calculating in any way while keeping the time decreasing.
EDIT 3
Adjusting code to address answers.
private static BigDecimal partialCalculator(int start, int threadCount, int id) {
BigDecimal nBD = BigDecimal.valueOf(start);
BigDecimal result = nBD.multiply(nBD).multiply(BigDecimal.valueOf(9)).add(BigDecimal.valueOf(1));
for (int i = start; i > 0; i -= threadCount) {
BigDecimal iBD = BigDecimal.valueOf(i);
BigDecimal iBD1 = BigDecimal.valueOf(i - 1);
BigDecimal iBD3 = BigDecimal.valueOf(3).multiply(iBD);
BigDecimal prevNumerator = iBD1.multiply(iBD1).multiply(BigDecimal.valueOf(9)).add(BigDecimal.valueOf(1));
// 3 * i * (3 * i - 1) * (3 * i - 2);
BigDecimal divisor = iBD3.multiply(iBD3.subtract(BigDecimal.valueOf(1))).multiply(iBD3.subtract(BigDecimal.valueOf(2)));
result = result.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(prevNumerator);
}
return result;
}
public static void main(String[] args) {
int threadCount = 3;
int precision = 6;
ExecutorService executorService = Executors.newFixedThreadPool(threadCount);
ArrayList<Future<BigDecimal> > futures = new ArrayList<Future<BigDecimal> >();
for (int i = 0; i < threadCount; i++) {
int start = precision - i;
System.out.println(start);
final int id = i + 1;
futures.add(executorService.submit(() -> partialCalculator(start, threadCount, id)));
}
BigDecimal result = BigDecimal.ZERO;
try {
for (int i = 0; i < threadCount; i++) {
result = result.add(futures.get(i).get());
}
} catch (Exception e) {
e.printStackTrace();
}
executorService.shutdown();
System.out.println(result);
}
Seems to be working properly for 1 thread but messes up the calculation for multiple.
After a review of the updated code, I've made the following observations:
First of all, the program runs for a fraction of a second. That means that this is a micro benchmark. Several key features in Java make micro benchmarks difficult to implement reliably. See How do I write a correct micro-benchmark in Java? For example, if the program doesn't run enough repetitions, the "just in time" compiler doesn't have time to kick in to compile it to native code, and you end up benchmarking the intepreter. It seems possible that in your case the JIT compiler takes longer to kick in when there are multiple threads,
As an example, to make your program do more work, I changed the BigDecimal precision from 100 to 10,000 and added a loop around the main method. The execution times were measured as follows:
1 thread:
Calculated in: 2803 milliseconds
Calculated in: 1116 milliseconds
Calculated in: 1040 milliseconds
Calculated in: 1066 milliseconds
Calculated in: 1036 milliseconds
2 threads:
Calculated in: 2354 milliseconds
Calculated in: 856 milliseconds
Calculated in: 624 milliseconds
Calculated in: 659 milliseconds
Calculated in: 664 milliseconds
4 threads:
Calculated in: 1961 milliseconds
Calculated in: 797 milliseconds
Calculated in: 623 milliseconds
Calculated in: 536 milliseconds
Calculated in: 497 milliseconds
The second observation is that there is a significant part of the workload that does not benefit from multiple threads: every thread is computing every factorial. This means the speed-up cannot be linear - as described by Amdahl's law.
So how can we get the result without computing factorials? One way is with Horner's method. As an example, consider the simpler series sum(1/k!) which also conveges to e but a little slower than yours.
Let's say you want to compute sum(1/k!) up to k = 100. With Horner's method you start from the end and extract common factors:
sum(1/k!, k=0..n) = 1/100! + 1/99! + 1/98! + ... + 1/1! + 1/0!
= ((... (((1/100 + 1)/99 + 1)/98 + ...)/2 + 1)/1 + 1
See how you start with 1, divide by 100 and add 1, divide by 99 and add 1, divide by 98 and add 1, and so on? That makes a very simple program:
private static BigDecimal serialHornerMethod() {
BigDecimal accumulator = BigDecimal.ONE;
for (int k = 10000; k > 0; k--) {
BigDecimal divisor = new BigDecimal(k);
accumulator = accumulator.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(BigDecimal.ONE);
}
return accumulator;
}
Ok that's a serial method, how do you make it use parallel? Here's an example for two threads: First split the series into even and odd terms:
1/100! + 1/99! + 1/98! + 1/97! + ... + 1/1! + 1/0! =
(1/100! + 1/98! + ... + 1/0!) + (1/99! + 1/97! + ... + 1/1!)
Then apply Horner's method to both the even and odd terms:
1/100! + 1/98! + 1/96! + ... + 1/2! + 1/0! =
((((1/(100*99) + 1)/(98*97) + 1)/(96*95) + ...)/(2*1) + 1
and:
1/99! + 1/97! + 1/95! + ... + 1/3! + 1/1! =
((((1/(99*98) + 1)/(97*96) + 1)/(95*94) + ...)/(3*2) + 1
This is just as easy to implement as the serial method, and you get pretty close to linear speedup going from 1 to 2 threads:
private static BigDecimal partialHornerMethod(int start) {
BigDecimal accumulator = BigDecimal.ONE;
for (int i = start; i > 0; i -= 2) {
int f = i * (i + 1);
BigDecimal divisor = new BigDecimal(f);
accumulator = accumulator.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(BigDecimal.ONE);
}
return accumulator;
}
// Usage:
ExecutorService executorService = Executors.newFixedThreadPool(2);
Future<BigDecimal> submit = executorService.submit(() -> partialHornerMethod(10000));
Future<BigDecimal> submit1 = executorService.submit(() -> partialHornerMethod(9999));
BigDecimal result = submit1.get().add(submit.get());
There is a lot of contention between the threads: they all compete to get a lock on the ECalculator object after every little bit of computation, because of this method:
public synchronized void addToSum(BigDecimal element) {
this.sum = this.sum.add(element);
}
In general having threads compete for frequent access to a common resource leads to bad performance, because you're asking for the operating system to intervene and tell the program which thread can continue. I haven't tested your code to confirm that this is the issue because it's not self-contained.
To fix this, have the threads accumulate their results separately, and merge results after the threads have finished. That is, create a sum variable in ERunnable, and then change the methods:
// ERunnable.run:
this.sum = this.sum.add(new BigDecimal(Math.pow(3 * k, 2) + 1).divide(new BigDecimal(factorial(3 * k)), 100, RoundingMode.HALF_UP));
// TaskRunner.calculate:
for (int i = 0; i < threadCount; i++) {
threads[i].join();
eCalc.addToSum(/* recover the sum computed by thread */);
}
By the way would be easier if you used the higher level java.util.concurrent API instead of creating thread objects yourself. You could wrap the computation in a Callable which can return a result.
Q2 How would one go about calculating this big factorials?
Usually you don't. Instead, you reformulate the problem so that it does not involve the direct computation of factorials. One technique is Horner's method.
Q3 The precision parameter that is passed is the amount of elements in the sum that are used. Can I set the BigDecimal scale to be somehow dependent on that precision so I don't hard code it?
Sure why not. You can work out the error bound from the number of elements (it's proportional to the last term in the series) and set the BigDecimal scale to that.
I am trying to get familiar with java multithreaded applications. I tried to think of a simple application that can be parallelized very well. I thought vector addition would be a good application to do so.
However, when running on my linux server (which has 4 cores) I dont get any speed up. The time to execute on 4,2,1 threads is about the same.
Here is the code I came up with:
public static void main(String[]args)throws InterruptedException{
final int threads = Integer.parseInt(args[0]);
final int length= Integer.parseInt(args[1]);
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final double[]result =new double[length];
final double[]array1=getRandomArray(length);
final double[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k]+array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
}
length is always a multiple of threads and getRandomArray() creates a random array of doubles between 0 and 1.
Execution Time for 1-Thread: 84579446ns
Execution Time for 2-Thread: 74211325ns
Execution Time for 4-Thread: 89215100ns
length =10000000
Here is the Code for getRandomArray():
private static double[]getRandomArray(int length){
Random random =new Random();
double[]array= new double[length];
for(int i=0;i<length;i++){
array[i]=random.nextDouble();
}
return array;
}
I would appreciate any help.
The difference is observable for the following code. Try it.
public static void main(String[]args)throws InterruptedException{
for(int z = 0; z < 10; z++) {
final int threads = 1;
final int length= 100_000_000;
final int balk=(length/threads);
Thread[]th = new Thread[threads];
final boolean[]result =new boolean[length];
final boolean[]array1=getRandomArray(length);
final boolean[]array2=getRandomArray(length);
long startingTime =System.nanoTime();
for(int i=0;i<threads;i++){
final int current=i;
th[i]=new Thread(()->{
for(int k=current*balk;k<(current+1)*balk;k++){
result[k]=array1[k] | array2[k];
}
});
th[i].start();
}
for(int i=0;i<threads;i++){
th[i].join();
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime)*1.0/1000/1000);
boolean x = false;
for(boolean d : result) {
x |= d;
}
System.out.println(x);
}
}
First things first you need to warmup your code. This way you will measure compiled code. The first two iterations have the same(approximately) time but the next will differ. Also I changed double to boolean because my machine doesn't have much memory. This allows me to allocate a huge array and it also makes work more CPU consuming.
There is a link in comments. I suggest you to read it.
Hi from my side if you are trying to see how your cores shares work you can make very simple task for all cores, but make them to work constantly on something not shared across different threads (basically to simulate for example merge sort, where threads are working on something complicated and use shared resources in a small amount of time). Using your code i did something like this. In such case you should see almost exactly 2x speed up and 4 times speed up.
public static void main(String[]args)throws InterruptedException{
for(int a=0; a<5; a++) {
final int threads = 2;
final int length = 10;
final int balk = (length / threads);
Thread[] th = new Thread[threads];
System.out.println(Runtime.getRuntime().availableProcessors());
final double[] result = new double[length];
final double[] array1 = getRandomArray(length);
final double[] array2 = getRandomArray(length);
long startingTime = System.nanoTime();
for (int i = 0; i < threads; i++) {
final int current = i;
th[i] = new Thread(() -> {
Random random = new Random();
int meaningless = 0;
for (int k = current * balk; k < (current + 1) * balk; k++) {
result[k] = array1[k] + array2[k];
for (int j = 0; j < 10000000; j++) {
meaningless+=random.nextInt(10);
}
}
});
th[i].start();
}
for (int i = 0; i < threads; i++) {
th[i].join();
}
System.out.println("Time needed: " + ((System.nanoTime() - startingTime) * 1.0) / 1000000000 + " s");
}
}
You see, in your code most time is consumed by building big table, and then threads are executing very fast, their work is so fast that your calculation of time is wrong because most of time is consumed by creating threads. When i invoked code which works on precalculated loop like this:
long startingTime =System.nanoTime();
for(int k=0; k<length; k++){
result[k]=array1[k]|array2[k];
}
System.out.println("Time needed: "+(System.nanoTime()-startingTime));
It worked two times faster than your code with 2 threads. I hope that you understand what i mean in this case and will see my point when i gave my threads much more meaningless work.
A simple test that demonstrates the problem:
package com.test;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveTask;
public class Main extends RecursiveTask<Long> {
private volatile long start;
private volatile long end;
private volatile int deep;
public Main(long start, long end, int index, int deep) {
this.start = start;
this.end = end;
this.deep = deep;
// System.out.println(deep + "-" + index);
}
#Override
protected Long compute() {
long part = (end - start) / 10;
if (part > 1000 && deep < 10) {
List<RecursiveTask<Long>> subtasks = new ArrayList<RecursiveTask<Long>>();
for (int i = 0; i < 10; i++) {
long subtaskEnd = start + part;
if (i == 9) {
subtaskEnd = end;
}
subtasks.add(new Main(start, subtaskEnd, i, deep + 1));
start = subtaskEnd;
}
//CASE 1: generates 3000+ threads
for (int i = 0; i < 10; i++) {
subtasks.get(i).fork();
}
//CASE 2: generates 4 threads
// invokeAll(subtasks);
//CASE 3: generates 4 threads
// for (int i = 9; i >= 0; i--) {
// subtasks.get(i).fork();
// }
long count = 0;
for (int i = 0; i < 10; i++) {
count += subtasks.get(i).join();
}
return count;
} else {
long startStart = start;
while (start < end) {
start += 1;
}
return start - startStart;
}
}
private static ForkJoinPool executor = new ForkJoinPool();
public static void main(String[] args) throws Exception {
ForkJoinTask<Long> forkJoinTask = executor.submit(new Main(0, Integer.MAX_VALUE / 10, 0, 0));
Long result = forkJoinTask.get();
System.out.println("Final result: " + result);
System.out.println("Number of threads: " + executor.getPoolSize());
}
}
In this sample I create RecursiveTask that simply counts numbers to create some load on CPU. It devides the incoming range in 10 parts recursivly and when the size of the part is less than 1000 or recursion "deepness" is over 10 it starts to count numbers.
There is 3 cases commented in compute() method. Difference is only in the order of forking subtasks. Depending on the order in which I fork subtasks the number of threads in the end is different. On my system it creates 3000+ threads for the first case and 4 threads for the second and third case.
Question is: what's the difference? Do I really need to know internals of this framework to successfully use it?
This is an old problem that I addressed in an article back in 2011, A Java Fork-Join Calamity The article points to part II which shows the fix? for this in Java8 (substitutes stalls instead of extra threads.)
You really can't do much professionally with this framework. There are other frameworks you can use.
In this code taken from this test thread code a thread calls two methods addToTotal() and countPrimes() but only the former is marked synchronized.
What prevents interleaving when countPrimes() is being executed. Aren't the variables used by countPrimes() like i, min, max, count also shared resources. And what about isPrime() which is called by countPrimes() ?
public class ThreadTest2 {
private static final int START = 3000000;
private static int total;
synchronized private static void addToTotal(int x) {
total = total + x;
System.out.println(total + " primes found so far.");
}
private static class CountPrimesThread extends Thread {
int count = 0;
int min, max;
public CountPrimesThread(int min, int max) {
this.min = min;
this.max = max;
}
public void run() {
count = countPrimes(min,max);
System.out.println("There are " + count +
" primes between " + min + " and " + max);
addToTotal(count);
}
}
private static void countPrimesWithThreads(int numberOfThreads) {
int increment = START/numberOfThreads;
System.out.println("\nCounting primes between " + (START+1) + " and "
+ (2*START) + " using " + numberOfThreads + " threads...\n");
long startTime = System.currentTimeMillis();
CountPrimesThread[] worker = new CountPrimesThread[numberOfThreads];
for (int i = 0; i < numberOfThreads; i++)
worker[i] = new CountPrimesThread(START+i*increment+1, START+(i+1)*increment );
total = 0;
for (int i = 0; i < numberOfThreads; i++)
worker[i].start();
for (int i = 0; i < numberOfThreads; i++) {
while (worker[i].isAlive()) {
try {
worker[i].join();
} catch (InterruptedException e) {
}
}
}
long elapsedTime = System.currentTimeMillis() - startTime;
System.out.println("\nThe number of primes is " + total + ".");
System.out.println("\nTotal elapsed time: " + (elapsedTime/1000.0) + " seconds.\n");
}
public static void main(String[] args) {
int processors = Runtime.getRuntime().availableProcessors();
if (processors == 1)
System.out.println("Your computer has only 1 available processor.\n");
else
System.out.println("Your computer has " + processors + " available processors.\n");
int numberOfThreads = 0;
while (numberOfThreads < 1 || numberOfThreads > 5) {
System.out.print("How many threads do you want to use (from 1 to 5) ? ");
numberOfThreads = TextIO.getlnInt();
if (numberOfThreads < 1 || numberOfThreads > 5)
System.out.println("Please enter 1, 2, 3, 4, or 5 !");
}
countPrimesWithThreads(numberOfThreads);
}
private static int countPrimes(int min, int max) {
int count = 0;
for (int i = min; i <= max; i++)
if (isPrime(i))
count++;
return count;
}
private static boolean isPrime(int x) {
int top = (int)Math.sqrt(x);
for (int i = 2; i <= top; i++)
if ( x % i == 0 )
return false;
return true;
}
}
countPrimes does not need synchronization because it does not access any shared variable (it only works with the arguments and local variables). So there is nothing to synchronize.
On the other hand, the total variable is updated from several threads and the access needs to be synchronized to ensure correctness.
What prevents interleaving when countPrimes() is being executed?
Nothing. We don't need to prevent it (see below). And since we don't need to, preventing interleaving would be a bad thing because it would reduce parallelism.
Aren't the variables used by countPrimes() like i, min, max,count` also shared resources?
No. They are local to the current thread; i.e. to the thread whose run() method call is in progress. Nothing else shares them.
And what about isPrime() which is called by countPrimes()?
Same deal. It is only using local variables, so no synchronization is necessary.
The synchronized keyword simply acquires the monitor for some object. If another thread already has the monitor it will have to wait for that thread to finish before it can acquire it and proceed. Any piece of code that synchronizes on a common object will not be able to run concurrently since only one thread can acquire the monitor on that object at any given time. In the case of methods the monitor used is implicit. For non-static methods it's the instance it was called on, for static methods it's the Class for the type it's called on.
That's one possible reason but it hardly constitutes an accurate indication of when to use the keyword.
To answer the question I would say you use synchronized whenever you don't want two threads concurrently executing a critical section based upon a common monitor. The situations in which you would need this are many and riddled with far too many gotchas and exceptions to explain fully.
You can't prevent access to an entire class with synchronized. You can make every method synchronized, but still that's not quite the same thing. Plus, it only prevents other threads from accessing the critical section when synchronized on the same monitor.
So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.