How to calculate Euler's number faster with Java multithreading - java

So I have a task to calculate Euler's number using multiple threads, using this formula: sum( ((3k)^2 + 1) / ((3k)!) ), for k = 0...infinity.
import java.math.BigDecimal;
import java.math.BigInteger;
import java.io.FileWriter;
import java.io.IOException;
import java.math.RoundingMode;
class ECalculator {
private BigDecimal sum;
private BigDecimal[] series;
private int length;
public ECalculator(int threadCount) {
this.length = threadCount;
this.sum = new BigDecimal(0);
this.series = new BigDecimal[threadCount];
for (int i = 0; i < this.length; i++) {
this.series[i] = BigDecimal.ZERO;
}
}
public synchronized void addToSum(BigDecimal element) {
this.sum = this.sum.add(element);
}
public void addToSeries(int id, BigDecimal element) {
if (id - 1 < length) {
this.series[id - 1] = this.series[id - 1].add(element);
}
}
public synchronized BigDecimal getSum() {
return this.sum;
}
public BigDecimal getSeriesSum() {
BigDecimal result = BigDecimal.ZERO;
for (int i = 0; i < this.length; i++) {
result = result.add(this.series[i]);
}
return result;
}
}
class ERunnable implements Runnable {
private final int id;
private final int threadCount;
private final int threadRemainder;
private final int elements;
private final boolean quietFlag;
private ECalculator eCalc;
public ERunnable(int threadCount, int threadRemainder, int id, int elements, boolean quietFlag, ECalculator eCalc) {
this.id = id;
this.threadCount = threadCount;
this.threadRemainder = threadRemainder;
this.elements = elements;
this.quietFlag = quietFlag;
this.eCalc = eCalc;
}
#Override
public void run() {
if (!quietFlag) {
System.out.println(String.format("Thread-%d started.", this.id));
}
long start = System.currentTimeMillis();
int k = this.threadRemainder;
int iteration = 0;
BigInteger currentFactorial = BigInteger.valueOf(intFactorial(3 * k));
while (iteration < this.elements) {
if (iteration != 0) {
for (int i = 3 * (k - threadCount) + 1; i <= 3 * k; i++) {
currentFactorial = currentFactorial.multiply(BigInteger.valueOf(i));
}
}
this.eCalc.addToSeries(this.id, new BigDecimal(Math.pow(3 * k, 2) + 1).divide(new BigDecimal(currentFactorial), 100, RoundingMode.HALF_UP));
iteration += 1;
k += this.threadCount;
}
long stop = System.currentTimeMillis();
if (!quietFlag) {
System.out.println(String.format("Thread-%d stopped.", this.id));
System.out.println(String.format("Thread %d execution time: %d milliseconds", this.id, stop - start));
}
}
public int intFactorial(int n) {
int result = 1;
for (int i = 1; i <= n; i++) {
result *= i;
}
return result;
}
}
public class TaskRunner {
public static final String DEFAULT_FILE_NAME = "result.txt";
public static void main(String[] args) throws InterruptedException {
int threadCount = 2;
int precision = 10000;
int elementsPerTask = precision / threadCount;
int remainingElements = precision % threadCount;
boolean quietFlag = false;
calculate(threadCount, elementsPerTask, remainingElements, quietFlag, DEFAULT_FILE_NAME);
}
public static void writeResult(String filename, String result) {
try {
FileWriter writer = new FileWriter(filename);
writer.write(result);
writer.close();
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}
public static void calculate(int threadCount, int elementsPerTask, int remainingElements, boolean quietFlag, String outputFile) throws InterruptedException {
long start = System.currentTimeMillis();
Thread[] threads = new Thread[threadCount];
ECalculator eCalc = new ECalculator(threadCount);
for (int i = 0; i < threadCount; i++) {
if (i == 0) {
threads[i] = new Thread(new ERunnable(threadCount, i, i + 1, elementsPerTask + remainingElements, quietFlag, eCalc));
} else {
threads[i] = new Thread(new ERunnable(threadCount, i, i + 1, elementsPerTask, quietFlag, eCalc));
}
threads[i].start();
}
for (int i = 0; i < threadCount; i++) {
threads[i].join();
}
String result = eCalc.getSeriesSum().toString();
if (!quietFlag) {
System.out.println("E = " + result);
}
writeResult(outputFile, result);
long stop = System.currentTimeMillis();
System.out.println("Calculated in: " + (stop - start) + " milliseconds" );
}
}
I stripped out the prints, etc. in the code that have no effect. My problem is that the more threads I use the slower it gets. Currently the fastest run I have is for 1 thread. I am sure the factorial calculation is causing some issues. I tried using a thread pool but still got the same times.
How can I make it so that running it with more threads, up until some point, will speed up the calculation process?
How would one go about calculating this big factorials?
The precision parameter that is passed is the amount of elements in the sum that are used. Can I set the BigDecimal scale to be somehow dependent on that precision so I don't hard code it?
EDIT
I updated the code block to be in 1 file only and runnable without external libs.
EDIT 2
I found out that the factorial code messes up with the time. If I let the threads ramp up to some high precision without calculating factorials the time goes down with increasing threads. Yet I cannot implement the factorial calculating in any way while keeping the time decreasing.
EDIT 3
Adjusting code to address answers.
private static BigDecimal partialCalculator(int start, int threadCount, int id) {
BigDecimal nBD = BigDecimal.valueOf(start);
BigDecimal result = nBD.multiply(nBD).multiply(BigDecimal.valueOf(9)).add(BigDecimal.valueOf(1));
for (int i = start; i > 0; i -= threadCount) {
BigDecimal iBD = BigDecimal.valueOf(i);
BigDecimal iBD1 = BigDecimal.valueOf(i - 1);
BigDecimal iBD3 = BigDecimal.valueOf(3).multiply(iBD);
BigDecimal prevNumerator = iBD1.multiply(iBD1).multiply(BigDecimal.valueOf(9)).add(BigDecimal.valueOf(1));
// 3 * i * (3 * i - 1) * (3 * i - 2);
BigDecimal divisor = iBD3.multiply(iBD3.subtract(BigDecimal.valueOf(1))).multiply(iBD3.subtract(BigDecimal.valueOf(2)));
result = result.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(prevNumerator);
}
return result;
}
public static void main(String[] args) {
int threadCount = 3;
int precision = 6;
ExecutorService executorService = Executors.newFixedThreadPool(threadCount);
ArrayList<Future<BigDecimal> > futures = new ArrayList<Future<BigDecimal> >();
for (int i = 0; i < threadCount; i++) {
int start = precision - i;
System.out.println(start);
final int id = i + 1;
futures.add(executorService.submit(() -> partialCalculator(start, threadCount, id)));
}
BigDecimal result = BigDecimal.ZERO;
try {
for (int i = 0; i < threadCount; i++) {
result = result.add(futures.get(i).get());
}
} catch (Exception e) {
e.printStackTrace();
}
executorService.shutdown();
System.out.println(result);
}
Seems to be working properly for 1 thread but messes up the calculation for multiple.

After a review of the updated code, I've made the following observations:
First of all, the program runs for a fraction of a second. That means that this is a micro benchmark. Several key features in Java make micro benchmarks difficult to implement reliably. See How do I write a correct micro-benchmark in Java? For example, if the program doesn't run enough repetitions, the "just in time" compiler doesn't have time to kick in to compile it to native code, and you end up benchmarking the intepreter. It seems possible that in your case the JIT compiler takes longer to kick in when there are multiple threads,
As an example, to make your program do more work, I changed the BigDecimal precision from 100 to 10,000 and added a loop around the main method. The execution times were measured as follows:
1 thread:
Calculated in: 2803 milliseconds
Calculated in: 1116 milliseconds
Calculated in: 1040 milliseconds
Calculated in: 1066 milliseconds
Calculated in: 1036 milliseconds
2 threads:
Calculated in: 2354 milliseconds
Calculated in: 856 milliseconds
Calculated in: 624 milliseconds
Calculated in: 659 milliseconds
Calculated in: 664 milliseconds
4 threads:
Calculated in: 1961 milliseconds
Calculated in: 797 milliseconds
Calculated in: 623 milliseconds
Calculated in: 536 milliseconds
Calculated in: 497 milliseconds
The second observation is that there is a significant part of the workload that does not benefit from multiple threads: every thread is computing every factorial. This means the speed-up cannot be linear - as described by Amdahl's law.
So how can we get the result without computing factorials? One way is with Horner's method. As an example, consider the simpler series sum(1/k!) which also conveges to e but a little slower than yours.
Let's say you want to compute sum(1/k!) up to k = 100. With Horner's method you start from the end and extract common factors:
sum(1/k!, k=0..n) = 1/100! + 1/99! + 1/98! + ... + 1/1! + 1/0!
= ((... (((1/100 + 1)/99 + 1)/98 + ...)/2 + 1)/1 + 1
See how you start with 1, divide by 100 and add 1, divide by 99 and add 1, divide by 98 and add 1, and so on? That makes a very simple program:
private static BigDecimal serialHornerMethod() {
BigDecimal accumulator = BigDecimal.ONE;
for (int k = 10000; k > 0; k--) {
BigDecimal divisor = new BigDecimal(k);
accumulator = accumulator.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(BigDecimal.ONE);
}
return accumulator;
}
Ok that's a serial method, how do you make it use parallel? Here's an example for two threads: First split the series into even and odd terms:
1/100! + 1/99! + 1/98! + 1/97! + ... + 1/1! + 1/0! =
(1/100! + 1/98! + ... + 1/0!) + (1/99! + 1/97! + ... + 1/1!)
Then apply Horner's method to both the even and odd terms:
1/100! + 1/98! + 1/96! + ... + 1/2! + 1/0! =
((((1/(100*99) + 1)/(98*97) + 1)/(96*95) + ...)/(2*1) + 1
and:
1/99! + 1/97! + 1/95! + ... + 1/3! + 1/1! =
((((1/(99*98) + 1)/(97*96) + 1)/(95*94) + ...)/(3*2) + 1
This is just as easy to implement as the serial method, and you get pretty close to linear speedup going from 1 to 2 threads:
private static BigDecimal partialHornerMethod(int start) {
BigDecimal accumulator = BigDecimal.ONE;
for (int i = start; i > 0; i -= 2) {
int f = i * (i + 1);
BigDecimal divisor = new BigDecimal(f);
accumulator = accumulator.divide(divisor, 10000, RoundingMode.HALF_EVEN)
.add(BigDecimal.ONE);
}
return accumulator;
}
// Usage:
ExecutorService executorService = Executors.newFixedThreadPool(2);
Future<BigDecimal> submit = executorService.submit(() -> partialHornerMethod(10000));
Future<BigDecimal> submit1 = executorService.submit(() -> partialHornerMethod(9999));
BigDecimal result = submit1.get().add(submit.get());

There is a lot of contention between the threads: they all compete to get a lock on the ECalculator object after every little bit of computation, because of this method:
public synchronized void addToSum(BigDecimal element) {
this.sum = this.sum.add(element);
}
In general having threads compete for frequent access to a common resource leads to bad performance, because you're asking for the operating system to intervene and tell the program which thread can continue. I haven't tested your code to confirm that this is the issue because it's not self-contained.
To fix this, have the threads accumulate their results separately, and merge results after the threads have finished. That is, create a sum variable in ERunnable, and then change the methods:
// ERunnable.run:
this.sum = this.sum.add(new BigDecimal(Math.pow(3 * k, 2) + 1).divide(new BigDecimal(factorial(3 * k)), 100, RoundingMode.HALF_UP));
// TaskRunner.calculate:
for (int i = 0; i < threadCount; i++) {
threads[i].join();
eCalc.addToSum(/* recover the sum computed by thread */);
}
By the way would be easier if you used the higher level java.util.concurrent API instead of creating thread objects yourself. You could wrap the computation in a Callable which can return a result.
Q2 How would one go about calculating this big factorials?
Usually you don't. Instead, you reformulate the problem so that it does not involve the direct computation of factorials. One technique is Horner's method.
Q3 The precision parameter that is passed is the amount of elements in the sum that are used. Can I set the BigDecimal scale to be somehow dependent on that precision so I don't hard code it?
Sure why not. You can work out the error bound from the number of elements (it's proportional to the last term in the series) and set the BigDecimal scale to that.

Related

Does the JIT Optimizer Optimize Multiplication?

In my computer architecture class I just learned that running an algebraic expression involving multiplication through a multiplication circuit can be more costly than running it though an addition circuit, if the number of required multiplications are less than 3. Ex: 3x. If I'm doing this type of computation a few billion times, does it pay off to write it as: x + x + x or does the JIT optimizer optimize for this?
I wouldn't expect to be a huge difference on writing it one way or the other.
The compiler will probably take care of making all of those equivalent.
You can try each method and measure how long it takes, that could give you a good hint to answer your own question.
Here's some code that does the same calculations 10 million times using different approaches (x + x + x, 3*x, and a bit shift followed by a subtraction).
They seem to all take approx the same amount of time as measured by System.nanoTime.
Sample output for one run:
sum : 594599531
mult : 568783654
shift : 564081012
You can also take a look at this question that talks about how compiler's optimization can probably handle those and more complex cases: Is shifting bits faster than multiplying and dividing in Java? .NET?
Code:
import java.util.Random;
public class TestOptimization {
public static void main(String args[]) {
Random rn = new Random();
long l1 = 0, l2 = 0, l3 = 0;
long nano1 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l1 += sum(num);
}
long nano2 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l2 += mult(num);
}
long nano3 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l3 += shift(num);
}
long nano4 = System.nanoTime();
System.out.println(l1);
System.out.println(l2);
System.out.println(l3);
System.out.println("sum : " + (nano2 - nano1));
System.out.println("mult : " + (nano3 - nano2));
System.out.println("shift : " + (nano4 - nano3));
}
private static long sum(long x) {
return x + x + x;
}
private static long mult(long x) {
return 3 * x;
}
private static long shift(long x) {
return (x << 2) - x;
}
}

ThreadLocalRandom with shared static Random instance performance comparing test

In our project for one task we used static Random instance for random numbers generation goal. After Java 7 release new ThreadLocalRandom class appeared for generating random numbers.
From spec:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention. Use of ThreadLocalRandom is particularly appropriate when multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools.
and also:
When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.
So I've made my little test:
public class ThreadLocalRandomTest {
private static final int THREAD_COUNT = 100;
private static final int GENERATED_NUMBER_COUNT = 1000;
private static final int INT_RIGHT_BORDER = 5000;
private static final int EXPERIMENTS_COUNT = 5000;
public static void main(String[] args) throws InterruptedException {
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Length of generated numbers chain for each thread: " + GENERATED_NUMBER_COUNT);
System.out.println("Right border integer: " + INT_RIGHT_BORDER);
System.out.println("Count of experiments: " + EXPERIMENTS_COUNT);
int repeats = 0;
int workingTime = 0;
long startTime = 0;
long endTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForSharedRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for shared Random instance: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
repeats = 0;
workingTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForTheadLocalRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for ThreadLocalRandom: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
}
private static int calculateRepeatsForSharedRandom() throws InterruptedException {
final Random rand = new Random();
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
private static int calculateRepeatsForTheadLocalRandom() throws InterruptedException {
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = ThreadLocalRandom.current().nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
}
I've also added test for non-shared Random and got next results:
Number of threads: 100
Length of generated numbers chain for each thread: 100
Right border integer: 5000
Count of experiments: 10000
Average repeats for non-shared Random instance: 8646. Average working time: 13 ms.
Average repeats for shared Random instance: 8646. Average working time: 13 ms.
Average repeats for ThreadLocalRandom: 8646. Average working time: 13 ms.
To me it's little strange, as I expected at least speed increasing when using ThreadLocalRandom comparing to shared Random instance, but see no difference at all.
Can someone explain why it works that way, maybe I haven't done testing properly. Thank you!
You're not running anything in parallel because you're waiting for each thread to finish immediately after starting it. You need a waiting loop outside the loop that starts the threads:
List<Thread> threads = new ArrayList<Thread>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
threads.add(thread);
thread.start();
}
for (Thread thread: threads) {
thread.join();
}
Your testing code is flawed for one. The bane of benchmarkers everywhere.
thread.start();
thread.join();
why not save LOCs and write
thread.run();
the outcome is the same.
EDIT: If you don't realize the outcome from the above, it means that you're running single threaded tests, there's no multithreading going on.
Maybe it would be easier to just have a look at what actually happens. Here is the source for ThreadLocal.get() which is also called for the ThreadLocalRandom.current().
public T get() {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null) {
ThreadLocalMap.Entry e = map.getEntry(this);
if (e != null)
return (T)e.value;
}
return setInitialValue();
}
Where ThreadLocalMap is a specialized HashMap-like implementation with optimizations.
So what basically happens is that ThreadLocal holds a map Thread->Object - or in this case Thread->Random - which is then looked up and either returned or created. As this is nothing 'magical', the timing will be equal to a HashMap-lookup + the initial creation overhead of the actual Object to be returned. Since a HashMap lookup (in this optimized case) is linear, the cost for a lookup is k, where k is the calculation cost of the hash function.
So you can make some assumptions:
ThreadLocal will be faster than creating the object each time in each Runnable, unless the creation cost is much smaller than k. So looking up Random is a good thing, putting an int inside might not be so smart.
ThreadLocal will be better than using your own HashMap, as such a generic implementation can be assumed to be equal to k or worse.
ThreadLocal will be slower than using any lookup with a cost < k. Example: store everything in an array first, then do myRandoms[threadID].
But then this assumes that you know which threads will be processing your work in the first place, so this isn't a real candidate for ThreadLocal anyways.

ForkJoinPool creating thouthands of threads

A simple test that demonstrates the problem:
package com.test;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveTask;
public class Main extends RecursiveTask<Long> {
private volatile long start;
private volatile long end;
private volatile int deep;
public Main(long start, long end, int index, int deep) {
this.start = start;
this.end = end;
this.deep = deep;
// System.out.println(deep + "-" + index);
}
#Override
protected Long compute() {
long part = (end - start) / 10;
if (part > 1000 && deep < 10) {
List<RecursiveTask<Long>> subtasks = new ArrayList<RecursiveTask<Long>>();
for (int i = 0; i < 10; i++) {
long subtaskEnd = start + part;
if (i == 9) {
subtaskEnd = end;
}
subtasks.add(new Main(start, subtaskEnd, i, deep + 1));
start = subtaskEnd;
}
//CASE 1: generates 3000+ threads
for (int i = 0; i < 10; i++) {
subtasks.get(i).fork();
}
//CASE 2: generates 4 threads
// invokeAll(subtasks);
//CASE 3: generates 4 threads
// for (int i = 9; i >= 0; i--) {
// subtasks.get(i).fork();
// }
long count = 0;
for (int i = 0; i < 10; i++) {
count += subtasks.get(i).join();
}
return count;
} else {
long startStart = start;
while (start < end) {
start += 1;
}
return start - startStart;
}
}
private static ForkJoinPool executor = new ForkJoinPool();
public static void main(String[] args) throws Exception {
ForkJoinTask<Long> forkJoinTask = executor.submit(new Main(0, Integer.MAX_VALUE / 10, 0, 0));
Long result = forkJoinTask.get();
System.out.println("Final result: " + result);
System.out.println("Number of threads: " + executor.getPoolSize());
}
}
In this sample I create RecursiveTask that simply counts numbers to create some load on CPU. It devides the incoming range in 10 parts recursivly and when the size of the part is less than 1000 or recursion "deepness" is over 10 it starts to count numbers.
There is 3 cases commented in compute() method. Difference is only in the order of forking subtasks. Depending on the order in which I fork subtasks the number of threads in the end is different. On my system it creates 3000+ threads for the first case and 4 threads for the second and third case.
Question is: what's the difference? Do I really need to know internals of this framework to successfully use it?
This is an old problem that I addressed in an article back in 2011, A Java Fork-Join Calamity The article points to part II which shows the fix? for this in Java8 (substitutes stalls instead of extra threads.)
You really can't do much professionally with this framework. There are other frameworks you can use.

Is there any "threshold" justifying multithreaded computation?

So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.

What's the best way to perform DFS on a very large tree?

Here's the situation:
The application world consists of hundreds of thousands of states.
Given a state, I can work out a set of 3 or 4 other reachable states. A simple recursion can build a tree of states that gets very large very fast.
I need to perform a DFS to a specific depth in this tree from the root state, to search for the subtree which contains the 'minimal' state (calculating the value of the node is irrelevant to the question).
Using a single thread to perform the DFS works, but is very slow. Covering 15 levels down can take a few good minutes, and I need to improve this atrocious performance. Trying to assign a thread to each subtree created too many threads and caused an OutOfMemoryError. Using a ThreadPoolExecutor wasn't much better.
My question: What's the most efficient way to traverse this large tree?
I don't believe navigating the tree is your problem as your tree has about 36 million nodes. Instead is it more likely what you are doing with each node is expensive.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicLong;
public class Main {
public static final int TOP_LEVELS = 2;
enum BuySell {}
private static final AtomicLong called = new AtomicLong();
public static void main(String... args) throws InterruptedException {
int maxLevels = 15;
long start = System.nanoTime();
method(maxLevels);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to navigate %,d levels called %,d times%n", time / 1e9, maxLevels, called.longValue());
}
public static void method(int maxLevels) throws InterruptedException {
ExecutorService service = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
try {
int result = method(service, 0, maxLevels - 1, new int[maxLevels]).call();
} catch (Exception e) {
e.printStackTrace();
}
service.shutdown();
service.awaitTermination(10, TimeUnit.MINUTES);
}
// single threaded process the highest levels of the tree.
private static Callable<Integer> method(final ExecutorService service, final int level, final int maxLevel, final int[] options) {
int choices = level % 2 == 0 ? 3 : 4;
final List<Callable<Integer>> callables = new ArrayList<Callable<Integer>>(choices);
for (int i = 0; i < choices; i++) {
options[level] = i;
Callable<Integer> callable = level < TOP_LEVELS ?
method(service, level + 1, maxLevel, options) :
method1(service, level + 1, maxLevel, options);
callables.add(callable);
}
return new Callable<Integer>() {
#Override
public Integer call() throws Exception {
Integer min = Integer.MAX_VALUE;
for (Callable<Integer> result : callables) {
Integer num = result.call();
if (min > num)
min = num;
}
return min;
}
};
}
// at this level, process the branches in separate threads.
private static Callable<Integer> method1(final ExecutorService service, final int level, final int maxLevel, final int[] options) {
int choices = level % 2 == 0 ? 3 : 4;
final List<Future<Integer>> futures = new ArrayList<Future<Integer>>(choices);
for (int i = 0; i < choices; i++) {
options[level] = i;
final int[] optionsCopy = options.clone();
Future<Integer> future = service.submit(new Callable<Integer>() {
#Override
public Integer call() {
return method2(level + 1, maxLevel, optionsCopy);
}
});
futures.add(future);
}
return new Callable<Integer>() {
#Override
public Integer call() throws Exception {
Integer min = Integer.MAX_VALUE;
for (Future<Integer> result : futures) {
Integer num = result.get();
if (min > num)
min = num;
}
return min;
}
};
}
// at these levels each task processes in its own thread.
private static int method2(int level, int maxLevel, int[] options) {
if (level == maxLevel) {
return process(options);
}
int choices = level % 2 == 0 ? 3 : 4;
int min = Integer.MAX_VALUE;
for (int i = 0; i < choices; i++) {
options[level] = i;
int n = method2(level + 1, maxLevel, options);
if (min > n)
min = n;
}
return min;
}
private static int process(final int[] options) {
int min = options[0];
for (int i : options)
if (min > i)
min = i;
called.incrementAndGet();
return min;
}
}
prints
Took 1.273 second to navigate 15 levels called 35,831,808 times
I suggest you limit the number of threads and only use separate threads for the highest levels of the tree. How many cores do you have? Once you have enough threads to keep every core busy, you don't need to create more threads as this just adds overhead.
Java has a built in Stack which thread safe, however I would just use ArrayList which is more efficient.
You will definitely have to use an iterative method. Simplest way is a stack based DFS with a pseudo code similar to this:
STACK.push(root)
while (STACK.nonempty)
current = STACK.pop
if (current.done) continue
// ... do something with node ...
current.done = true
FOREACH (neighbor n of current)
if (! n.done )
STACK.push(n)
The time complexity of this is O(n+m) where n (m) denotes the number of nodes (edges) in your graph. Since you have a tree this is O(n) and should work quickly for n>1.000.000 easily...

Categories

Resources