Stopping all threads when one of them is finished - java

The goal is creating search method that returns the index of the needle found first over all search threads. And when one of them is finished, I need to stop all threads.
The logic is: there are 4 threads. First thread checks first %25 of the haystack, second thread checks %25-%50 of the haystack and so on.
I should stop as soon as one of them printed the text but I always get 4 output because all 4 of them finds needle in the haystack. However, I want only one output.
Example Output: (indexes below)
I found, it is: 622
I found, it is: 4072
I found, it is: 7519
I found, it is: 7264
Here is the SearcherThreat class the extends Thread
public class SearcherThread extends Thread {
// PROPERTIES
private int needle;
private int[] haystack;
private int start, end;
// CONSTRUCTOR
public SearcherThread(int needle, int[] haystack, int start, int end) {
this.needle = needle;
this.haystack = haystack;
this.start = start;
this.end = end;
}
#Override
public void run() {
for (int i = start; i < end && !isInterrupted(); ++i) {
if (haystack[i] == needle) {
System.out.println("I found, it is: " + i);
for (SearcherThread searcher : InterruptTest.searchers) {
searcher.interrupt();
}
}
}
}
}
And this is the class that contains main and threads
import java.util.ArrayList;
public class InterruptTest {
public static ArrayList<SearcherThread> searchers = new ArrayList<SearcherThread>();
public static void main(String[] args) throws InterruptedException {
int itemCount = 10000;
int[] haystack = new int[itemCount];
int domainSize = 1000;
for (int i = 0; i < itemCount; ++i)
haystack[i] = (int) (Math.random() * domainSize);
int needle = 10;
int numThreads = 4;
int numItemsPerThread = haystack.length / numThreads;
int extraItems = haystack.length - numItemsPerThread * numThreads;
for (int i = 0, start = 0; i < numThreads; ++i) {
int numItems = (i < extraItems) ? (numItemsPerThread + 1) : numItemsPerThread;
searchers.add(new SearcherThread(needle, haystack, start, start + numItems));
start += numItems;
}
for (SearcherThread searcher : searchers)
searcher.start();
}
}

I got this output:
[stephen#blackbox tmp]$ java InterruptTest
I found, it is: 855
I found, it is: 3051
[stephen#blackbox tmp]$ java InterruptTest
I found, it is: 2875
I found, it is: 5008
I found, it is: 1081
I found, it is: 8527
[stephen#blackbox tmp]$ java InterruptTest
I found, it is: 2653
I found, it is: 5377
I found, it is: 1092
[stephen#blackbox tmp]$ java InterruptTest
I found, it is: 255
I found, it is: 9095
I found, it is: 6983
I found, it is: 3777
As you can see, the number of threads that is completing varies from one run to the next.
What we have here is a race. What is probably happening is that one thread completes and interrupts the other threads before they have been started. So they don't see the interrupt. The javadoc says:
"Interrupting a thread that is not alive need not have any effect."
Another possibility is that the interrupts are not propagating fast enough. Note that the javadoc does not say that an interrupt() is instantly visible to the interrupted thread.
I can't think of a solution to this that doesn't negate the benefits of multi-threading. On the other hand, in a real-world use-case:
You should be using a thread pool ... because thread creation is relatively expensive.
The threads should be doing more work.
If you measured the actual speedup you were getting in your current test, it is possibly negative.
In summary, in a more realistic test you should see the interrupts working most of the time. And that should be good enough. (It should not matter that occasionally threads are not interrupted fast enough to stop them finding secondary results.)

This is necro but this could help someone else out. You can use Java's built in Executors. With Java 8 it has
newCachedThreadPool()
newFixedThreadPool(int nThreads)
newSingleThreadExecutor()
etc
Here's an example of how you could write around your classes. I'm using the newFixedThreadPool(int nThreads) to match what you were doing in your code
import java.util.concurrent.Callable;
public class SearcherThread implements Callable<Object> {
// PROPERTIES
private int needle;
private int[] haystack;
private int start, end;
// CONSTRUCTOR
public SearcherThread(int needle, int[] haystack, int start, int end) {
this.needle = needle;
this.haystack = haystack;
this.start = start;
this.end = end;
}
#Override
public Object call() throws Exception {
for (int i = start; i < end ; ++i) {
if (haystack[i] == needle) {
return i;
}
}
Thread.sleep(5000);
return null;
}
}
And the Test class
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class InterruptTest {
public static ArrayList<SearcherThread> searchers = new ArrayList<SearcherThread>();
public static void main(String[] args) throws InterruptedException, ExecutionException {
int itemCount = 10000;
int[] haystack = new int[itemCount];
int domainSize = 1000;
for (int i = 0; i < itemCount; ++i)
haystack[i] = (int) (Math.random() * domainSize);
int needle = 10;
int numThreads = 4;
int numItemsPerThread = haystack.length / numThreads;
int extraItems = haystack.length - numItemsPerThread * numThreads;
for (int i = 0, start = 0; i < numThreads; ++i) {
int numItems = (i < extraItems) ? (numItemsPerThread + 1) : numItemsPerThread;
searchers.add(new SearcherThread(needle, haystack, start, start + numItems));
start += numItems;
}
//Preferred to use Executors.newWorkStealingPool() instead
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
executor.shutdown();//shutdown when done
Object result = executor.invokeAny(searchers);//<------- right here lookie
System.out.println("I found, it is: "+result);
}
}
ExecutorService.invokeAny(List<Callable>) will run all the callable threads and get value of the first finished thread.

Related

ThreadLocalRandom with shared static Random instance performance comparing test

In our project for one task we used static Random instance for random numbers generation goal. After Java 7 release new ThreadLocalRandom class appeared for generating random numbers.
From spec:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention. Use of ThreadLocalRandom is particularly appropriate when multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools.
and also:
When all usages are of this form, it is never possible to accidently share a ThreadLocalRandom across multiple threads.
So I've made my little test:
public class ThreadLocalRandomTest {
private static final int THREAD_COUNT = 100;
private static final int GENERATED_NUMBER_COUNT = 1000;
private static final int INT_RIGHT_BORDER = 5000;
private static final int EXPERIMENTS_COUNT = 5000;
public static void main(String[] args) throws InterruptedException {
System.out.println("Number of threads: " + THREAD_COUNT);
System.out.println("Length of generated numbers chain for each thread: " + GENERATED_NUMBER_COUNT);
System.out.println("Right border integer: " + INT_RIGHT_BORDER);
System.out.println("Count of experiments: " + EXPERIMENTS_COUNT);
int repeats = 0;
int workingTime = 0;
long startTime = 0;
long endTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForSharedRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for shared Random instance: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
repeats = 0;
workingTime = 0;
for (int i = 0; i < EXPERIMENTS_COUNT; i++) {
startTime = System.currentTimeMillis();
repeats += calculateRepeatsForTheadLocalRandom();
endTime = System.currentTimeMillis();
workingTime += endTime - startTime;
}
System.out.println("Average repeats for ThreadLocalRandom: " + repeats / EXPERIMENTS_COUNT
+ ". Average working time: " + workingTime / EXPERIMENTS_COUNT + " ms.");
}
private static int calculateRepeatsForSharedRandom() throws InterruptedException {
final Random rand = new Random();
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
private static int calculateRepeatsForTheadLocalRandom() throws InterruptedException {
final Map<Integer, Integer> counts = new HashMap<>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = ThreadLocalRandom.current().nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
thread.start();
thread.join();
}
int repeats = 0;
for (Integer value : counts.values()) {
if (value > 1) {
repeats += value;
}
}
return repeats;
}
}
I've also added test for non-shared Random and got next results:
Number of threads: 100
Length of generated numbers chain for each thread: 100
Right border integer: 5000
Count of experiments: 10000
Average repeats for non-shared Random instance: 8646. Average working time: 13 ms.
Average repeats for shared Random instance: 8646. Average working time: 13 ms.
Average repeats for ThreadLocalRandom: 8646. Average working time: 13 ms.
To me it's little strange, as I expected at least speed increasing when using ThreadLocalRandom comparing to shared Random instance, but see no difference at all.
Can someone explain why it works that way, maybe I haven't done testing properly. Thank you!
You're not running anything in parallel because you're waiting for each thread to finish immediately after starting it. You need a waiting loop outside the loop that starts the threads:
List<Thread> threads = new ArrayList<Thread>();
for (int i = 0; i < THREAD_COUNT; i++) {
Thread thread = new Thread() {
#Override
public void run() {
for (int j = 0; j < GENERATED_NUMBER_COUNT; j++) {
int random = rand.nextInt(INT_RIGHT_BORDER);
if (!counts.containsKey(random)) {
counts.put(random, 0);
}
counts.put(random, counts.get(random) + 1);
}
}
};
threads.add(thread);
thread.start();
}
for (Thread thread: threads) {
thread.join();
}
Your testing code is flawed for one. The bane of benchmarkers everywhere.
thread.start();
thread.join();
why not save LOCs and write
thread.run();
the outcome is the same.
EDIT: If you don't realize the outcome from the above, it means that you're running single threaded tests, there's no multithreading going on.
Maybe it would be easier to just have a look at what actually happens. Here is the source for ThreadLocal.get() which is also called for the ThreadLocalRandom.current().
public T get() {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null) {
ThreadLocalMap.Entry e = map.getEntry(this);
if (e != null)
return (T)e.value;
}
return setInitialValue();
}
Where ThreadLocalMap is a specialized HashMap-like implementation with optimizations.
So what basically happens is that ThreadLocal holds a map Thread->Object - or in this case Thread->Random - which is then looked up and either returned or created. As this is nothing 'magical', the timing will be equal to a HashMap-lookup + the initial creation overhead of the actual Object to be returned. Since a HashMap lookup (in this optimized case) is linear, the cost for a lookup is k, where k is the calculation cost of the hash function.
So you can make some assumptions:
ThreadLocal will be faster than creating the object each time in each Runnable, unless the creation cost is much smaller than k. So looking up Random is a good thing, putting an int inside might not be so smart.
ThreadLocal will be better than using your own HashMap, as such a generic implementation can be assumed to be equal to k or worse.
ThreadLocal will be slower than using any lookup with a cost < k. Example: store everything in an array first, then do myRandoms[threadID].
But then this assumes that you know which threads will be processing your work in the first place, so this isn't a real candidate for ThreadLocal anyways.

ForkJoinPool creating thouthands of threads

A simple test that demonstrates the problem:
package com.test;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.ForkJoinTask;
import java.util.concurrent.RecursiveTask;
public class Main extends RecursiveTask<Long> {
private volatile long start;
private volatile long end;
private volatile int deep;
public Main(long start, long end, int index, int deep) {
this.start = start;
this.end = end;
this.deep = deep;
// System.out.println(deep + "-" + index);
}
#Override
protected Long compute() {
long part = (end - start) / 10;
if (part > 1000 && deep < 10) {
List<RecursiveTask<Long>> subtasks = new ArrayList<RecursiveTask<Long>>();
for (int i = 0; i < 10; i++) {
long subtaskEnd = start + part;
if (i == 9) {
subtaskEnd = end;
}
subtasks.add(new Main(start, subtaskEnd, i, deep + 1));
start = subtaskEnd;
}
//CASE 1: generates 3000+ threads
for (int i = 0; i < 10; i++) {
subtasks.get(i).fork();
}
//CASE 2: generates 4 threads
// invokeAll(subtasks);
//CASE 3: generates 4 threads
// for (int i = 9; i >= 0; i--) {
// subtasks.get(i).fork();
// }
long count = 0;
for (int i = 0; i < 10; i++) {
count += subtasks.get(i).join();
}
return count;
} else {
long startStart = start;
while (start < end) {
start += 1;
}
return start - startStart;
}
}
private static ForkJoinPool executor = new ForkJoinPool();
public static void main(String[] args) throws Exception {
ForkJoinTask<Long> forkJoinTask = executor.submit(new Main(0, Integer.MAX_VALUE / 10, 0, 0));
Long result = forkJoinTask.get();
System.out.println("Final result: " + result);
System.out.println("Number of threads: " + executor.getPoolSize());
}
}
In this sample I create RecursiveTask that simply counts numbers to create some load on CPU. It devides the incoming range in 10 parts recursivly and when the size of the part is less than 1000 or recursion "deepness" is over 10 it starts to count numbers.
There is 3 cases commented in compute() method. Difference is only in the order of forking subtasks. Depending on the order in which I fork subtasks the number of threads in the end is different. On my system it creates 3000+ threads for the first case and 4 threads for the second and third case.
Question is: what's the difference? Do I really need to know internals of this framework to successfully use it?
This is an old problem that I addressed in an article back in 2011, A Java Fork-Join Calamity The article points to part II which shows the fix? for this in Java8 (substitutes stalls instead of extra threads.)
You really can't do much professionally with this framework. There are other frameworks you can use.

[Java]Arrays and Collections test, unexpected outcome?

There are probably hundreds of questions about Java Collections vs. arrays, but this is something I really didn't expect.
I am developing a server for my game, and to communicate between the client and server you need to send packets (obviously), so I did some tests which Collection (or array) I could use best to handle them, HashMap, ArrayList and a PacketHandler array. And the outcome is very unexpected to me, because the ArrayList wins.
The packet handling structure is just like dictionary usage (index to PacketHandler), and because an array is the most primitive form of dictionary use I thought that would easily perform better than an ArrayList. Could someone explain me why this is?
My test
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Random;
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
void handle();
}
/**
* A dummy packet handler.
*/
private class DummyPacketHandler implements PacketHandler {
#Override
public void handle() { }
}
public Main() {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 255; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
// array
long time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
handlers[r.nextInt(255)].handle();
System.out.println((System.currentTimeMillis() - time));
// hashmap
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
m.get(new Integer(r.nextInt(255))).handle();
System.out.println((System.currentTimeMillis() - time));
// arraylist
time = System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
list.get(r.nextInt(255)).handle();
System.out.println((System.currentTimeMillis() - time));
}
public static void main(String[] args) {
new Main();
}
}
I think the problem is quite solved, thanks everybody
The shorter answer is that ArrayList is slightly more optimised the first time, but is still slower in the long run.
How and when the JVM optimise the code before its completely warmed up isn't always obvious and can change between version and based on your command line options.
What is really interesting is what you get when you repeat the test. The reason that makes a difference here is that the code is compiled in stages in the background as you want to have tests where the code is already as fast as it will get right from the start.
There are a few things you can do to make your benchmark more reproducaeable.
generate your random numbers in advance, they are not part of your test but can slow you down.
place each loop in a separate method. The first loop triggers the whole method to be compiled for better or worse.
repeat the test 5 to 10 times and ignore the first one.
Using System.nanoTime() instead of currentTimeMillis() It may not make any difference here but is a good habit to get into.
Use autoboxing where you can so it uses the integer cache or Integer.valueOf(n) which does the same thing. new Integer(n) will always create an object.
make sure your inner loop does something otherwise its quite likely the JIT will optimise it away to nothing. ;)
.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Random;
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
void handle();
}
/**
* A dummy packet handler.
*/
static class DummyPacketHandler implements PacketHandler {
#Override
public void handle() {
}
}
public static void main(String[] args) {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 256; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
int runs = 10000000;
int[] handlerToUse = new int[runs];
for (int i = 0; i < runs; i++)
handlerToUse[i] = r.nextInt(256);
for (int i = 0; i < 5; i++) {
testArray(handlers, runs, handlerToUse);
testHashMap(m, runs, handlerToUse);
testArrayList(list, runs, handlerToUse);
System.out.println();
}
}
private static void testArray(PacketHandler[] handlers, int runs, int[] handlerToUse) {
// array
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
handlers[handlerToUse[i]].handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
private static void testHashMap(HashMap<Integer, PacketHandler> m, int runs, int[] handlerToUse) {
// hashmap
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
m.get(handlerToUse[i]).handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
private static void testArrayList(ArrayList<PacketHandler> list, int runs, int[] handlerToUse) {
// arraylist
long time = System.nanoTime();
for (int i = 0; i < runs; i++)
list.get(handlerToUse[i]).handle();
System.out.print((System.nanoTime() - time)/1e6+" ");
}
}
prints for array HashMap ArrayList
24.62537 263.185092 24.19565
28.997305 206.956117 23.437585
19.422327 224.894738 21.191718
14.154433 194.014725 16.927638
13.897081 163.383876 16.678818
After the code warms up, the array is marginally faster.
There are at least a few problems with your benchmark:
you run your tests directly in main, meaning that when you main method gets compiled, the JIT compiler has not had time to optimise all the code because it has not run it yet
the map method creates a new integer each time, which is not fair: use m.get(r.nextInt(255)).handle(); to allow the Integer cache to be used
you need to run your test several times before you can draw conclusions
you are not using the result of what you do in your loops and the JIT is therefore allowed to simply ignore them
monitor GC as it might always run at the same time and bias the results of one of your loop and add a System.gc() call between each loop.
But before doing all that, read this post ;-)
After tweaking your code a bit, I get these results:
Array: 116
Map: 139
List: 117
So array and list are close to identical once compiled and map is slightly slower.
Code:
public class Main {
/**
* Packet handler interface.
*/
private interface PacketHandler {
int handle();
}
/**
* A dummy packet handler.
*/
private class DummyPacketHandler implements PacketHandler {
#Override
public int handle() {
return 123;
}
}
public Main() {
Random r = new Random();
PacketHandler[] handlers = new PacketHandler[256];
HashMap<Integer, PacketHandler> m = new HashMap<Integer, PacketHandler>();
ArrayList<PacketHandler> list = new ArrayList<PacketHandler>();
// packet handler initialization
for (int i = 0; i < 255; i++) {
DummyPacketHandler p = new DummyPacketHandler();
handlers[i] = p;
m.put(new Integer(i), p);
list.add(p);
}
long sum = 0;
runArray(handlers, r, 20000);
runMap(m, r, 20000);
runList(list, r, 20000);
// array
long time = System.nanoTime();
sum += runArray(handlers, r, 10000000);
System.out.println("Array: " + (System.nanoTime() - time) / 1000000);
// hashmap
time = System.nanoTime();
sum += runMap(m, r, 10000000);
System.out.println("Map: " + (System.nanoTime() - time) / 1000000);
// arraylist
time = System.nanoTime();
sum += runList(list, r, 10000000);
System.out.println("List: " + (System.nanoTime() - time) / 1000000);
System.out.println(sum);
}
public static void main(String[] args) {
new Main();
}
private long runArray(PacketHandler[] handlers, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += handlers[r.nextInt(255)].handle();
return sum;
}
private long runMap(HashMap<Integer, PacketHandler> m, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += m.get(new Integer(r.nextInt(255))).handle();
return sum;
}
private long runList(List<PacketHandler> list, Random r, int loops) {
long sum = 0;
for (int i = 0; i < loops; i++)
sum += list.get(r.nextInt(255)).handle();
return sum;
}
}

Is there any "threshold" justifying multithreaded computation?

So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.

What's the best way to perform DFS on a very large tree?

Here's the situation:
The application world consists of hundreds of thousands of states.
Given a state, I can work out a set of 3 or 4 other reachable states. A simple recursion can build a tree of states that gets very large very fast.
I need to perform a DFS to a specific depth in this tree from the root state, to search for the subtree which contains the 'minimal' state (calculating the value of the node is irrelevant to the question).
Using a single thread to perform the DFS works, but is very slow. Covering 15 levels down can take a few good minutes, and I need to improve this atrocious performance. Trying to assign a thread to each subtree created too many threads and caused an OutOfMemoryError. Using a ThreadPoolExecutor wasn't much better.
My question: What's the most efficient way to traverse this large tree?
I don't believe navigating the tree is your problem as your tree has about 36 million nodes. Instead is it more likely what you are doing with each node is expensive.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicLong;
public class Main {
public static final int TOP_LEVELS = 2;
enum BuySell {}
private static final AtomicLong called = new AtomicLong();
public static void main(String... args) throws InterruptedException {
int maxLevels = 15;
long start = System.nanoTime();
method(maxLevels);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to navigate %,d levels called %,d times%n", time / 1e9, maxLevels, called.longValue());
}
public static void method(int maxLevels) throws InterruptedException {
ExecutorService service = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
try {
int result = method(service, 0, maxLevels - 1, new int[maxLevels]).call();
} catch (Exception e) {
e.printStackTrace();
}
service.shutdown();
service.awaitTermination(10, TimeUnit.MINUTES);
}
// single threaded process the highest levels of the tree.
private static Callable<Integer> method(final ExecutorService service, final int level, final int maxLevel, final int[] options) {
int choices = level % 2 == 0 ? 3 : 4;
final List<Callable<Integer>> callables = new ArrayList<Callable<Integer>>(choices);
for (int i = 0; i < choices; i++) {
options[level] = i;
Callable<Integer> callable = level < TOP_LEVELS ?
method(service, level + 1, maxLevel, options) :
method1(service, level + 1, maxLevel, options);
callables.add(callable);
}
return new Callable<Integer>() {
#Override
public Integer call() throws Exception {
Integer min = Integer.MAX_VALUE;
for (Callable<Integer> result : callables) {
Integer num = result.call();
if (min > num)
min = num;
}
return min;
}
};
}
// at this level, process the branches in separate threads.
private static Callable<Integer> method1(final ExecutorService service, final int level, final int maxLevel, final int[] options) {
int choices = level % 2 == 0 ? 3 : 4;
final List<Future<Integer>> futures = new ArrayList<Future<Integer>>(choices);
for (int i = 0; i < choices; i++) {
options[level] = i;
final int[] optionsCopy = options.clone();
Future<Integer> future = service.submit(new Callable<Integer>() {
#Override
public Integer call() {
return method2(level + 1, maxLevel, optionsCopy);
}
});
futures.add(future);
}
return new Callable<Integer>() {
#Override
public Integer call() throws Exception {
Integer min = Integer.MAX_VALUE;
for (Future<Integer> result : futures) {
Integer num = result.get();
if (min > num)
min = num;
}
return min;
}
};
}
// at these levels each task processes in its own thread.
private static int method2(int level, int maxLevel, int[] options) {
if (level == maxLevel) {
return process(options);
}
int choices = level % 2 == 0 ? 3 : 4;
int min = Integer.MAX_VALUE;
for (int i = 0; i < choices; i++) {
options[level] = i;
int n = method2(level + 1, maxLevel, options);
if (min > n)
min = n;
}
return min;
}
private static int process(final int[] options) {
int min = options[0];
for (int i : options)
if (min > i)
min = i;
called.incrementAndGet();
return min;
}
}
prints
Took 1.273 second to navigate 15 levels called 35,831,808 times
I suggest you limit the number of threads and only use separate threads for the highest levels of the tree. How many cores do you have? Once you have enough threads to keep every core busy, you don't need to create more threads as this just adds overhead.
Java has a built in Stack which thread safe, however I would just use ArrayList which is more efficient.
You will definitely have to use an iterative method. Simplest way is a stack based DFS with a pseudo code similar to this:
STACK.push(root)
while (STACK.nonempty)
current = STACK.pop
if (current.done) continue
// ... do something with node ...
current.done = true
FOREACH (neighbor n of current)
if (! n.done )
STACK.push(n)
The time complexity of this is O(n+m) where n (m) denotes the number of nodes (edges) in your graph. Since you have a tree this is O(n) and should work quickly for n>1.000.000 easily...

Categories

Resources