Just wondering if anyone would be able to take a look at this code for implementing the quicksort algorithm and answer me a few questions, please :-)
public class Run
{
/***************************************************************************
* Quicksort code from Sedgewick 7.1, 7.2.
**************************************************************************/
public static void quicksort(double[] a)
{
//shuffle(a); // to guard against worst-case
quicksort(a, 0, a.length - 1, 0);
}
static void quicksort(final double[] a, final int left, final int right, final int tdepth)
{
if (right <= left)
return;
final int i = partition(a, left, right);
if ((tdepth < 4) && ((i - left) > 1000))
{
final Thread t = new Thread()
{
public void run()
{
quicksort(a, left, i - 1, tdepth + 1);
}
};
t.start();
quicksort(a, i + 1, right, tdepth + 1);
try
{
t.join();
}
catch (InterruptedException e)
{
throw new RuntimeException("Cancelled", e);
}
} else
{
quicksort(a, left, i - 1, tdepth);
quicksort(a, i + 1, right, tdepth);
}
}
// partition a[left] to a[right], assumes left < right
private static int partition(double[] a, int left, int right)
{
int i = left - 1;
int j = right;
while (true)
{
while (less(a[++i], a[right]))
// find item on left to swap
; // a[right] acts as sentinel
while (less(a[right], a[--j]))
// find item on right to swap
if (j == left)
break; // don't go out-of-bounds
if (i >= j)
break; // check if pointers cross
exch(a, i, j); // swap two elements into place
}
exch(a, i, right); // swap with partition element
return i;
}
// is x < y ?
private static boolean less(double x, double y)
{
return (x < y);
}
// exchange a[i] and a[j]
private static void exch(double[] a, int i, int j)
{
double swap = a[i];
a[i] = a[j];
a[j] = swap;
}
// shuffle the array a[]
private static void shuffle(double[] a)
{
int N = a.length;
for (int i = 0; i < N; i++)
{
int r = i + (int) (Math.random() * (N - i)); // between i and N-1
exch(a, i, r);
}
}
// test client
public static void main(String[] args)
{
int N = 5000000; // Integer.parseInt(args[0]);
// generate N random real numbers between 0 and 1
long start = System.currentTimeMillis();
double[] a = new double[N];
for (int i = 0; i < N; i++)
a[i] = Math.random();
long stop = System.currentTimeMillis();
double elapsed = (stop - start) / 1000.0;
System.out.println("Generating input: " + elapsed + " seconds");
// sort them
start = System.currentTimeMillis();
quicksort(a);
stop = System.currentTimeMillis();
elapsed = (stop - start) / 1000.0;
System.out.println("Quicksort: " + elapsed + " seconds");
}
}
My questions are:
What is the purpose of the variable tdepth?
Is this considered a "proper" implementation of a parallel quicksort? I ask becuase it doesn't use implements Runnable or extends Thread...
If it doesn't already, is it possible to modify this code to use multiple threads? By passing in the number of threads you want to use as a parameter, for example...?
Many thanks,
Brian
1. It's used to keep track of recursion depth. This is checked to decide whether to run in parallel. Notice how when the function runs in parallel it passes tdepth + 1 (which becomes tdepth in the called quicksort's parameters). This is a basic way of avoiding too many parallel threads.
2. Yes, it's definitely using another thread. The code:
new Thread()
{
public void run()
{
quicksort(a, left, i - 1, tdepth + 1);
}
};
creates an anonymous inner class (which extends Thread), which is then started.
Apparently, tdepth is used to avoid creating too many threads
It uses an anonymous class, which implicitly extends Thread
It does that already (see point 1.)
tdepth is there so that there's an upper bound on the number of threads created. Note that ever time the method calls itself recursively (which is done in a new thread), tdepth is incremented by one. This way, only the first four levels of recursion will create new threads, presumably to prevent overloading the OS with many threads for little benefit.
This code launches its own threads in the definition of the quicksort method, so it will use parallel processing. One might argue that it could do with some kind of thread management and that e.g. some kind of Executor might be better, but it is definitely parallel. See the call to new Thread() ... followed by start(). Incidentally, the call to t.join() will cause the current thread to wait for the thread t to finish, in case you weren't aware of that.
This code already uses multiple threads, but you can tweak how many it spawns given the comparison on tdepth; increasing or decreasing the value will determine how many levels of recursion create threads. You could complete rewrite the code to use executors and threadpools, or perhaps to perform trinary recursion instead of binary - but I suspect that in the sense you asked; no, there's no simple way to tweak the number of threads.
I did actually wrote a (correctly) multi-threaded QuickSort in Java so maybe I can help a bit...
Question here for anyone interested:
Multithreaded quicksort or mergesort
What is the purpose of the variable
tdepth?
as other have commented, it serves to determine whether to create new threads or not.
Is this considered a "proper"
implementation of a parallel
quicksort? I ask because it doesn't
use implements Runnable or extends
Thread...
I don't think it's that proper for several reasons: first you should make it CPU dependent. There's no point in spawning 16 threads on a CPU that has just one core: a mono-threaded QuickSort shall outperfom the multi-threaded one on a single core machine. On a 16-cores machines, sure, fire up to 16 threads.
Runtime.getRuntime().availableProcessors()
Then the second reason I really don't like it is that it is using last-century low-level Java idiosyncrasish threading details: I prefer to stay away from .join() and use higher level things (see fork/join in the other question or something like CountDownLatch'es, etc.). The problem with things low-level like Java's thread "join" is that it carries no useful meaning: this is 100% Java specific and can be replaced by higher-level threading facilities whose concept are portable across languages.
Then don't comment the shuffle at the beginning. Ever. I've seen dataset where QuickSort degrades quadratically if you remove that shuffle. And it's just an O(n) shuffle, that won't slow down your sort :)
If it doesn't already, is it possible
to modify this code to use multiple
threads? By passing in the number of
threads you want to use as a
parameter, for example...?
I'd try to write and/or reuse an implementation using higher-level concurrency facilities. See the advices in the question I asked here some time ago.
Related
I have a program that applies median filtering to elements in an array. I'm using divide-and-conquer with Java's Fork/Join Framework to accomplish this. Here's my compute() override:
public FilterObject(int lo, int hi, boolean filterType)
{
this.lo = lo;
this.hi = hi;
this.filterType = filterType;
}
public void compute()
{
if (hi - lo <= SEQUENTIAL_THRESHOLD)
{
seqFilter();
}
else
{
FilterObject left = new FilterObject(lo, (hi + lo) / 2, filterType);
FilterObject right = new FilterObject((hi + lo) / 2, hi, filterType);
left.fork();
right.compute();
left.join();
}
}
seqFilter() is a static method that does the actual filtering once a subarray is "small enough".
Is my join() in the right place? I have a timer running in my Main class and the recorded times I'm getting seem way too fast. I'm calling pool.invoke(filt)from my Main class where filt is my FilterObject object and pool is my ForkJoinPool. My timer stops immediately after this call. Is it possible that the main logic is continuing on before the parallel processes have completed? If it is, where can I put join() to stop this happening?
EDIT: Additional question - do I even need the join() in this case? The program forks, but it's not like I actually 'join' my forked objects together again because the recursive split index references just get and set from the same arrays.
Being the method fork(); within compute() how come that does not get called another degree of parallelism each time the method compute() occurs? Is there a boolean flag perhaps?
EDIT:
overriding the method compute() of the class RecursiveTask:
(pseudocode)
if {array.length<100)
do it
else
divide array by 2;
fork();
int righta = rightArray.compute();
int lefta =(Integer)leftArray.join();
return righta +lefta;
So basically this is the compute() method which gets called recursively and when fork() happens it makes it possible to use parallelism and process that task with another core. However being recursive fork() should be called all the times the method gets recursively called. So in the reality it does not happen (there would be no sense). Is it due to a boolean flag that says fork has already been activated?
Thanks in advance.
Look at the API
class Fibonacci extends RecursiveTask<Integer> {
final int n;
Fibonacci(int n) { this.n = n; }
Integer compute() {
if (n <= 1)
return n;
Fibonacci f1 = new Fibonacci(n - 1);
f1.fork();
Fibonacci f2 = new Fibonacci(n - 2);
return f2.compute() + f1.join();
}
}
Each time compute() is called it will place another computation on another thread (or queue) via fork. compute continuously forks until there are no more n available to process. At this point compute will wait until the "right" side finishes while f1.join() waits for the "left" side to finish.
Whenever join is invoked it will actually make the joining thread execute lower level tasks (lower on the binary tree) giving you the parallelism you want
I've got a little bit of work that is easily parallelizable, and I want to use Java threads to split up the work across my four core machine. It's a genetic algorithm applied to the traveling salesman problem. It doesn't sound easily parallelizable, but the first loop is very easily so. The second part where I talk about the actual evolution may or may not be, but I want to know if I'm getting slow down because of the way I'm implementing threading, or if its the algorithm itself.
Also, if anyone has better ideas on how I should be implementing what I'm trying to do, that would be very much appreciated.
In main(), I have this:
final ArrayBlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(numThreads*numIter);
ThreadPoolExecutor tpool = new ThreadPoolExecutor(numThreads, numThreads, 10, TimeUnit.SECONDS, queue);
barrier = new CyclicBarrier(numThreads);
k.init(tpool);
I have a loop that is done inside of init() and looks like this:
for (int i = 0; i < numCities; i++) {
x[i] = rand.nextInt(width);
y[i] = rand.nextInt(height);
}
That I changed to this:
int errorCities = 0, stepCities = 0;
stepCities = numCities/numThreads;
errorCities = numCities - stepCities*numThreads;
// Split up work, assign to threads
for (int i = 1; i <= numThreads; i++) {
int startCities = (i-1)*stepCities;
int endCities = startCities + stepCities;
// This is a bit messy...
if(i <= numThreads) endCities += errorCities;
tpool.execute(new citySetupThread(startCities, endCities));
}
And here is citySetupThread() class:
public class citySetupThread implements Runnable {
int start, end;
public citySetupThread(int s, int e) {
start = s;
end = e;
}
public void run() {
for (int j = start; j < end; j++) {
x[j] = ThreadLocalRandom.current().nextInt(0, width);
y[j] = ThreadLocalRandom.current().nextInt(0, height);
}
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
}
}
The above code is run once in the program, so it was sort of a test case for my threading constructs (this is my first experience with Java threads). I implemented the same sort of thing in a real critical section, specifically the evolution part of the genetic algorithm, whose class is as follows:
public class evolveThread implements Runnable {
int start, end;
public evolveThread(int s, int e) {
start = s;
end = e;
}
public void run() {
// Get midpoint
int n = population.length/2, m;
for (m = start; m > end; m--) {
int i, j;
i = ThreadLocalRandom.current().nextInt(0, n);
do {
j = ThreadLocalRandom.current().nextInt(0, n);
} while(i == j);
population[m].crossover(population[i], population[j]);
population[m].mutate(numCities);
}
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
}
}
Which exists in a function evolve() that is called in init() like so:
for (int p = 0; p < numIter; p++) evolve(p, tpool);
Yes I know that's not terribly good design, but for other reasons I'm stuck with it. Inside of evolve is the relevant parts, shown here:
// Threaded inner loop
int startEvolve = popSize - 1,
endEvolve = (popSize - 1) - (popSize - 1)/numThreads;
// Split up work, assign to threads
for (int i = 0; i < numThreads; i++) {
endEvolve = (popSize - 1) - (popSize - 1)*(i + 1)/numThreads + 1;
tpool.execute(new evolveThread(startEvolve, endEvolve));
startEvolve = endEvolve;
}
// Wait for our comrades
try {
barrier.await();
} catch (InterruptedException ie) {
return;
} catch (BrokenBarrierException bbe) {
return;
}
population[1].crossover(population[0], population[1]);
population[1].mutate(numCities);
population[0].mutate(numCities);
// Pick out the strongest
Arrays.sort(population, population[0]);
current = population[0];
generation++;
What I really want to know is this:
What role does the "queue" have? Am I right to create a queue for as many jobs as I think will be executed for all threads in the pool? If the size isn't sufficiently large, I get RejectedExecutionException's. I just decided to do numThreads*numIterations because that's how many jobs there would be (for the actual evolution method that I mentioned earlier). It's weird though.. I shouldn't have to do this if the barrier.await()'s were working, which leads me to...
Am I using the barrier.await() correctly? Currently I have it in two places: inside the run() method for the Runnable object, and after the for loop that executes all the jobs. I would've thought only one would be required, but I get errors if I remove one or the other.
I'm suspicious of contention for the threads, as that is the only thing I can glean from the absurd slowdown (which does scale with the input parameters). I want to know if it is anything to do with how I'm implementing the thread pool and barriers. If not, then I'll have to look inside the crossover() and mutate() methods, I suppose.
First, I think you may have a bug with how you intended to use the CyclicBarrier. Currently you are initializing it with the number of executor threads as the number of parties. You have an additional party, however; the main thread. So I think you need to do:
barrier = new CyclicBarrier(numThreads + 1);
I think this should work, but personally I find it an odd use of the barrier.
When using a worker-queue thread-pool model I find it easier to use a Semaphore or Java's Future model.
For a semaphore:
class MyRunnable implements Runnable {
private final Semaphore sem;
public MyRunnable(Semaphore sem) {
this.sem = sem;
}
public void run() {
// do work
// signal complete
sem.release()
}
}
Then in your main thread:
Semaphore sem = new Semaphore(0);
for (int i = 0; i < numJobs; ++i) {
threadPool.execute(new MyRunnable(sem));
}
sem.acquire(numJobs);
Its really doing the same thing as the barrier, but I find it easier to think about the worker tasks "signaling" that they are done instead of "sync'ing up" with the main thread again.
For example, if you look at the example code in the CyclicBarrier JavaDoc the call to barrier.await() is inside the loop inside the worker. So it is really synching up the multiple long running worker threads and the main thread is not participating in the barrier. Calling barrier.await() at the end of the worker outside the loop is more signaling completion.
As you increase the number of tasks, you increase the overhead using each task adds. This means you want to minimise the number of tasks i.e. the same as the number of cpus you have. For some tasks using double the number of cpus can be better when the work load is not even.
BTW: You don't need a barrier in each task, you can wait for the future of each task to complete by calling get() on each one.
Issue
When benchmarking a simple QuickSort implementation in Java, I faced unexpected humps in the n vs time graphics I was plotting:
I know HotSpot will attempt to compile code to native after it seems certain methods are being heavily used, so I ran the JVM with -XX:+PrintCompilation. After repeated trials, it seems to be compiling the algorithm's methods always in the same way:
# iteration 6 -> sorting.QuickSort::swap (15 bytes)
# iteration 7 -> sorting.QuickSort::partition (66 bytes)
# iteration 7 -> sorting.QuickSort::quickSort (29 bytes)
I am repeating the above graphic with this added info, just to make things a bit clearer:
At this point, we must all be asking ourselves : why are we still getting those ugly humps AFTER the code is compiled? Maybe it has something to do with the algorithm itself? It sure could be, and luckily for us there's a quick way to sort that out, with -XX:CompileThreshold=0:
Bummer! It really must be something the JVM is doing in the background. But what?
I theorized that although code is being compiled, it may take a while until the compiled code actually starts to be used. Maybe adding a couple of Thread.sleep()s here and there could help us a bit sorting this issue out?
Ouch! The green colored function is the QuickSort's code ran with a 1000ms internal between each run (details in the appendix), while the blue colored function is our old one (just for comparison).
At fist, giving time to the HotSpot only seems to make matters worse! Maybe it only seems worse by some other factor, such as caching issues?
Disclaimer : I am running 1000 trials for each point of the shown graphics, and using System.nanoTime() to measure the results.
EDIT
Some of you may at this stage wonder how the use of sleep() might distort the results. I ran the Red Plot (no native compilation) again, now with the sleeps in-between:
Scary!
Appendix
Here I present the QuickSort code I am using, just in case:
public class QuickSort {
public <T extends Comparable<T>> void sort(int[] table) {
quickSort(table, 0, table.length - 1);
}
private static <T extends Comparable<T>> void quickSort(int[] table,
int first, int last) {
if (first < last) { // There is data to be sorted.
// Partition the table.
int pivotIndex = partition(table, first, last);
// Sort the left half.
quickSort(table, first, pivotIndex - 1);
// Sort the right half.
quickSort(table, pivotIndex + 1, last);
}
}
/**
* #author http://en.wikipedia.org/wiki/Quick_Sort
*/
private static <T extends Comparable<T>> int partition(int[] table,
int first, int last) {
int pivotIndex = (first + last) / 2;
int pivotValue = table[pivotIndex];
swap(table, pivotIndex, last);
int storeIndex = first;
for (int i = first; i < last; i++) {
if (table[i]-(pivotValue) <= 0) {
swap(table, i, storeIndex);
storeIndex++;
}
}
swap(table, storeIndex, last);
return storeIndex;
}
private static <T> void swap(int[] a, int i, int j) {
int h = a[i];
a[i] = a[j];
a[j] = h;
}
}
as well the code I am using to run my benchmarks:
public static void main(String[] args) throws InterruptedException, IOException {
QuickSort quickSort = new QuickSort();
int TRIALS = 1000;
File file = new File(Long.toString(System.currentTimeMillis()));
System.out.println("Saving # \"" + file.getAbsolutePath() + "\"");
for (int x = 0; x < 30; ++x) {
// if (x > 4 && x < 17)
// Thread.sleep(1000);
int[] values = new int[x];
long start = System.nanoTime();
for (int i = 0; i < TRIALS; ++i)
quickSort.sort(values);
double duration = (System.nanoTime() - start) / TRIALS;
String line = x + "\t" + duration;
System.out.println(line);
FileUtils.writeStringToFile(file, line + "\r\n", true);
}
}
Well, it seems that I sorted the issue out on my own.
I was right about the idea that compiled code could take a while to kick in. The problem was a flaw in the way I actually implemented my benchmarking code:
if (x > 4 && x < 17)
Thread.sleep(1000);
in here I assumed that as the only "affected" area would be between 4 and 17, I could go on and just do a sleep over those values. This is simply not so. The following plot may be enlightening:
Here I am comparing the original no compilation function (red) to another no compilation function, but separed with sleeps in-between. As you may see, they work in different orders of magnitude, that meaning that mixing results of code with and without sleeps will yield unsound results, as I was guilty of doing.
The original question remains unaswered, yet. What causes the humps to ocurr even after compilation took place? Let's try to find that out, putting a 1s sleep in ALL points taken:
That yields the expected result. The odd humps were happening the native code still didn't kick in.
Comparing a sleep 50ms with a sleep 1000ms function yields yet again, the expected result:
(the gray one seems to still show a bit of delay)
There are certain algorithms whose running time can decrease significantly when one divides up a task and gets each part done in parallel. One of these algorithms is merge sort, where a list is divided into infinitesimally smaller parts and then recombined in a sorted order. I decided to do an experiment to test whether or not I could I increase the speed of this sort by using multiple threads. I am running the following functions in Java on a Quad-Core Dell with Windows Vista.
One function (the control case) is simply recursive:
// x is an array of N elements in random order
public int[] mergeSort(int[] x) {
if (x.length == 1)
return x;
// Dividing the array in half
int[] a = new int[x.length/2];
int[] b = new int[x.length/2+((x.length%2 == 1)?1:0)];
for(int i = 0; i < x.length/2; i++)
a[i] = x[i];
for(int i = 0; i < x.length/2+((x.length%2 == 1)?1:0); i++)
b[i] = x[i+x.length/2];
// Sending them off to continue being divided
mergeSort(a);
mergeSort(b);
// Recombining the two arrays
int ia = 0, ib = 0, i = 0;
while(ia != a.length || ib != b.length) {
if (ia == a.length) {
x[i] = b[ib];
ib++;
}
else if (ib == b.length) {
x[i] = a[ia];
ia++;
}
else if (a[ia] < b[ib]) {
x[i] = a[ia];
ia++;
}
else {
x[i] = b[ib];
ib++;
}
i++;
}
return x;
}
The other is in the 'run' function of a class that extends thread, and recursively creates two new threads each time it is called:
public class Merger extends Thread
{
int[] x;
boolean finished;
public Merger(int[] x)
{
this.x = x;
}
public void run()
{
if (x.length == 1) {
finished = true;
return;
}
// Divide the array in half
int[] a = new int[x.length/2];
int[] b = new int[x.length/2+((x.length%2 == 1)?1:0)];
for(int i = 0; i < x.length/2; i++)
a[i] = x[i];
for(int i = 0; i < x.length/2+((x.length%2 == 1)?1:0); i++)
b[i] = x[i+x.length/2];
// Begin two threads to continue to divide the array
Merger ma = new Merger(a);
ma.run();
Merger mb = new Merger(b);
mb.run();
// Wait for the two other threads to finish
while(!ma.finished || !mb.finished) ;
// Recombine the two arrays
int ia = 0, ib = 0, i = 0;
while(ia != a.length || ib != b.length) {
if (ia == a.length) {
x[i] = b[ib];
ib++;
}
else if (ib == b.length) {
x[i] = a[ia];
ia++;
}
else if (a[ia] < b[ib]) {
x[i] = a[ia];
ia++;
}
else {
x[i] = b[ib];
ib++;
}
i++;
}
finished = true;
}
}
It turns out that function that does not use multithreading actually runs faster. Why? Does the operating system and the java virtual machine not "communicate" effectively enough to place the different threads on different cores? Or am I missing something obvious?
The problem is not multi-threading: I've written a correctly multi-threaded QuickSort in Java and it owns the default Java sort. I did this after witnessing a gigantic dataset being process and had only one core of a 16-cores machine working.
One of your issue (a huge one) is that you're busy looping:
// Wait for the two other threads to finish
while(!ma.finished || !mb.finished) ;
This is a HUGE no-no: it is called busy looping and you're destroying the perfs.
(Another issue is that your code is not spawning any new threads, as it has already been pointed out to you)
You need to use other way to synchronize: an example would be to use a CountDownLatch.
Another thing: there's no need to spawn two new threads when you divide the workload: spawn only one new thread, and do the other half in the current thread.
Also, you probably don't want to create more threads than there are cores availables.
See my question here (asking for a good Open Source multithreaded mergesort/quicksort/whatever). The one I'm using is proprietary, I can't paste it.
Multithreaded quicksort or mergesort
I haven't implemented Mergesort but QuickSort and I can tell you that there's no array copying going on.
What I do is this:
pick a pivot
exchange values as needed
have we reached the thread limit? (depending on the number of cores)
yes: sort first part in this thread
no: spawn a new thread
sort second part in current thread
wait for first part to finish if it's not done yet (using a CountDownLatch).
The code spawning a new thread and creating the CountDownLatch may look like this:
final CountDownLatch cdl = new CountDownLatch( 1 );
final Thread t = new Thread( new Runnable() {
public void run() {
quicksort(a, i+1, r );
cdl.countDown();
}
} };
The advantage of using synchronization facilities like the CountDownLatch is that it is very efficient and that your not wasting time dealing with low-level Java synchronization idiosynchrasies.
In your case, the "split" may look like this (untested, it is just to give an idea):
if ( threads.getAndIncrement() < 4 ) {
final CountDownLatch innerLatch = new CountDownLatch( 1 );
final Thread t = new Merger( innerLatch, b );
t.start();
mergeSort( a );
while ( innerLatch.getCount() > 0 ) {
try {
innerLatch.await( 1000, TimeUnit.SECONDS );
} catch ( InterruptedException e ) {
// Up to you to decide what to do here
}
}
} else {
mergeSort( a );
mergeSort( b );
}
(don't forget to "countdown" the latch when each merge is done)
Where you'd replace the number of threads (up to 4 here) by the number of available cores. You may use the following (once, say to initialize some static variable at the beginning of your program: the number of cores is unlikely to change [unless you're on a machine allowing CPU hotswapping like some Sun systems allows]):
Runtime.getRuntime().availableProcessors()
As others said; This code isn't going to work because it starts no new threads. You need to call the start() method instead of the run() method to create new threads. It also has concurrency errors: the checks on the finished variable are not thread safe.
Concurrent programming can be pretty difficult if you do not understand the basics. You might read the book Java Concurrency in Practice by Brian Goetz. It explains the basics and explains constructs (such as Latch, etc) to ease building concurrent programs.
The overhead cost of synchronization may be comparatively large and prevent many optimizations.
Furthermore you are creating way too many threads.
The other is in the 'run' function of a class that extends thread, and recursively creates two new threads each time it is called.
You would be better off with a fixed number of threads, suggestively 4 on a quad core. This could be realized with a thread pool (tutorial) and the pattern would be "bag of tasks". But perhaps it would be better yet, to initially divide the task into four equally large tasks and do "single-threaded" sorting on those tasks. This would then utilize the caches a lot better.
Instead of having a "busy-loop" waiting for the threads to finish (stealing cpu-cycles) you should have a look at Thread.join().
How many elements in the array you have to do sort? If there are too few elements, the time of sync and CPU switching will over the time you save for dividing the job for paralleling