I have a program that applies median filtering to elements in an array. I'm using divide-and-conquer with Java's Fork/Join Framework to accomplish this. Here's my compute() override:
public FilterObject(int lo, int hi, boolean filterType)
{
this.lo = lo;
this.hi = hi;
this.filterType = filterType;
}
public void compute()
{
if (hi - lo <= SEQUENTIAL_THRESHOLD)
{
seqFilter();
}
else
{
FilterObject left = new FilterObject(lo, (hi + lo) / 2, filterType);
FilterObject right = new FilterObject((hi + lo) / 2, hi, filterType);
left.fork();
right.compute();
left.join();
}
}
seqFilter() is a static method that does the actual filtering once a subarray is "small enough".
Is my join() in the right place? I have a timer running in my Main class and the recorded times I'm getting seem way too fast. I'm calling pool.invoke(filt)from my Main class where filt is my FilterObject object and pool is my ForkJoinPool. My timer stops immediately after this call. Is it possible that the main logic is continuing on before the parallel processes have completed? If it is, where can I put join() to stop this happening?
EDIT: Additional question - do I even need the join() in this case? The program forks, but it's not like I actually 'join' my forked objects together again because the recursive split index references just get and set from the same arrays.
Related
So I have this multithreadded program that generates 2 random walkers, each walker is a separate thread since I need them to move simultaneously. Each walker randomly moves in any of the 4 directions. The first problem is that i think stdDraw is not thread safe and therefore without having a lock around my entire function, it tends to draw random squares at random points for no reason and the whole thing become pretty glitchy. When i put a lock around my function then one thread becomes slower that the other since it sometimes has to wait for the lock. So the threas are not simultaneous anymore. Is there a solution to this? The other problem i have is I want it to break out of the loop when the two walkers intersect, but for some reason the two threads dont know about the position of the other. One thinks that the position of the other is always at (0,0). Thanks!
import java.awt.Color;
public class WalkerThread implements Runnable {
String name;
static Integer lock = new Integer(1000);
int num;
static int steps = 0, steps2 = 0;
static int x = 0, y = 0;
static int x2 = -1, y2 = -2;
public WalkerThread(String s, int n) {
this.name = s;
this.num = n;
}
#Override
public void run() {
int N = 10;
StdDraw.create(600, 600);
StdDraw.setScale(-N, -N, +N, +N);
StdDraw.clear(Color.gray);
do {
synchronized (lock) {
if (num == 1) {
StdDraw.go(x, y);
StdDraw.setColor(Color.white);
StdDraw.spot(0.9, 0.9);
double r = Math.random();
if (r < 0.25)
x--;
else if (r < 0.50)
x++;
else if (r < 0.75)
y--;
else if (r < 1.00)
y++;
steps++;
StdDraw.setColor(Color.blue);
StdDraw.go(x, y);
StdDraw.spot(0.9, 0.9);
StdDraw.pause(40);
}
if (num == 2) {
StdDraw.go(x2, y2);
StdDraw.setColor(Color.yellow);
StdDraw.spot(0.9, 0.9);
double r2 = Math.random();
if (r2 < 0.25)
x2--;
else if (r2 < 0.50)
x2++;
else if (r2 < 0.75)
y2--;
else if (r2 < 1.00)
y2++;
steps2++;
StdDraw.setColor(Color.green);
StdDraw.go(x2, y2);
StdDraw.spot(0.9, 0.9);
StdDraw.pause(40);
}
}// lock
/*String pict = steps + ".png";
StdDraw.save(pict);*/
//if (posX == posX2 && posY == posY2) break;
} while ((Math.abs(x) < N && Math.abs(y) < N) && (Math.abs(x2) < N && Math.abs(y2) < N));
System.out.printf("Total steps of %s is %d and %d \n", name, steps, steps2);
}
}
//MAIN
public class Walkers{
public static void main(String[] args) {
Thread t1 = new Thread(new WalkerThread("one", 1));
Thread t2 = new Thread(new WalkerThread("two", 2));
t1.start();
t2.start();
}
}
Avoid Math.random() when going multi-threaded - create an r = new Random() in your Walker constructor, and use it as r.nextDouble().
Instead of the big if, take the differences between both branches (just a couple of colors) and place them in the constructor. Also, threads have separate namespaces. You don't need to keep x and x2 separate - each thread would have its own private x, invisible from the other thread. Your code could roughly end up 1/2 the size.
As far as synchronization goes, you have two problems. The first problem is that StdDraw is built on Swing (it runs in a JFrame, for example), which is not thread-safe. In particular, all drawing must happen in something called the event thread. This means that you should place all the drawing code within something like
SwingUtilities.invokeLater(new Runnable() {
#Override
public void run() {
synchronized (lock) {
// ... your calls to StdDraw here ...
}
}
});
However, this opens a big can of worms. First, the drawing code needs to access your data, which you will therefore want to prevent from changing at the same time. You can protect it with yet more synchronized (lock) { ... }, but that will mean that only one thread will be executing in any given moment. That's not what multithreading is for.
The simpler answer is, taking a peek at Elyasin's answer, to forget about parallel execution (it is really not needed here), and embrace turn-taking:
do {
bool turn = false;
// ... current init code here
if (turn) {
// ... current code for num==1
} else {
// ... current code for num==2
}
turn = !turn; // reverse turn for next round
} while (/* ... */);
No threads, no locks, no synchronization, and it should work smoothly and without artifacts.
So I have this multithreaded program that generates 2 random walkers,
each walker is a separate thread since I need them to move
simultaneously. Each walker randomly moves in any of the 4 directions.
You clearly state that you want two random walkers, any of the four directions is chosen randomly by any of the two walkers. So we stick to this requirement.
The first problem is that I think stdDraw is not thread safe and
therefore without having a lock around my entire function it tends to
draw random squares at random points for no reason and the whole thing
becomes pretty glitchy. When I put a lock around my function then one
thread becomes slower than the other one, since it sometimes has to wait
for the lock. So the threads are not simultaneous anymore. Is there a
solution to this?
Thread safety and randomness are not really correlated here. As clarified above you want the walkers to be random. This has nothing to with thread safety in the first place. Simply put: Thread safety means that if several threads share a data structure/address space, then access to it is guaranteed to be free of race conditions.
Not sure what you mean with random squares at random points for no reason. A lock is usually used to grant permissions to execute, or to grant access to one or more shared resources. Not sure why you use a lock here, I don't see a shared resource and I don't see why you use the lock to control thread execution one at a time if you don't want this in the first place.
The two random walkers are independent and the only shared resource I see is the 2D plane.
If you want the two walkers to execute simultaneously/concurrently then you should not use a lock the way you did I think.
I am not even sure if thread safety really is an issue here, maybe you don't need thread safety?
The other problem I have is I want it to break out of the loop when
the two walkers intersect, but for some reason the two threads don't
know about the positions of each other. One thinks that the position of
the other one is always at (0,0).
Oh, now that is a good follow up question. Maybe there is a shared resource then? Will it have to be thread safe then?
That is the 2D plane, which would know if the two walkers intersect or not? (I did not look into the StdDraw to be honest, but you would know to find out I think.) Find a way to get the two coordinates of the two random walkers from the StdDraw and check for intersection. If that is not possible then use a shared resource, i.e. a data structure that holds both coordinates of 1st random walker and 2nd random walker.
You would not need to care much about thread safety, because one random walker would only read (and not write) the values/coordinates of the other random walker.
Try that out and let us know.
Being the method fork(); within compute() how come that does not get called another degree of parallelism each time the method compute() occurs? Is there a boolean flag perhaps?
EDIT:
overriding the method compute() of the class RecursiveTask:
(pseudocode)
if {array.length<100)
do it
else
divide array by 2;
fork();
int righta = rightArray.compute();
int lefta =(Integer)leftArray.join();
return righta +lefta;
So basically this is the compute() method which gets called recursively and when fork() happens it makes it possible to use parallelism and process that task with another core. However being recursive fork() should be called all the times the method gets recursively called. So in the reality it does not happen (there would be no sense). Is it due to a boolean flag that says fork has already been activated?
Thanks in advance.
Look at the API
class Fibonacci extends RecursiveTask<Integer> {
final int n;
Fibonacci(int n) { this.n = n; }
Integer compute() {
if (n <= 1)
return n;
Fibonacci f1 = new Fibonacci(n - 1);
f1.fork();
Fibonacci f2 = new Fibonacci(n - 2);
return f2.compute() + f1.join();
}
}
Each time compute() is called it will place another computation on another thread (or queue) via fork. compute continuously forks until there are no more n available to process. At this point compute will wait until the "right" side finishes while f1.join() waits for the "left" side to finish.
Whenever join is invoked it will actually make the joining thread execute lower level tasks (lower on the binary tree) giving you the parallelism you want
Issue
When benchmarking a simple QuickSort implementation in Java, I faced unexpected humps in the n vs time graphics I was plotting:
I know HotSpot will attempt to compile code to native after it seems certain methods are being heavily used, so I ran the JVM with -XX:+PrintCompilation. After repeated trials, it seems to be compiling the algorithm's methods always in the same way:
# iteration 6 -> sorting.QuickSort::swap (15 bytes)
# iteration 7 -> sorting.QuickSort::partition (66 bytes)
# iteration 7 -> sorting.QuickSort::quickSort (29 bytes)
I am repeating the above graphic with this added info, just to make things a bit clearer:
At this point, we must all be asking ourselves : why are we still getting those ugly humps AFTER the code is compiled? Maybe it has something to do with the algorithm itself? It sure could be, and luckily for us there's a quick way to sort that out, with -XX:CompileThreshold=0:
Bummer! It really must be something the JVM is doing in the background. But what?
I theorized that although code is being compiled, it may take a while until the compiled code actually starts to be used. Maybe adding a couple of Thread.sleep()s here and there could help us a bit sorting this issue out?
Ouch! The green colored function is the QuickSort's code ran with a 1000ms internal between each run (details in the appendix), while the blue colored function is our old one (just for comparison).
At fist, giving time to the HotSpot only seems to make matters worse! Maybe it only seems worse by some other factor, such as caching issues?
Disclaimer : I am running 1000 trials for each point of the shown graphics, and using System.nanoTime() to measure the results.
EDIT
Some of you may at this stage wonder how the use of sleep() might distort the results. I ran the Red Plot (no native compilation) again, now with the sleeps in-between:
Scary!
Appendix
Here I present the QuickSort code I am using, just in case:
public class QuickSort {
public <T extends Comparable<T>> void sort(int[] table) {
quickSort(table, 0, table.length - 1);
}
private static <T extends Comparable<T>> void quickSort(int[] table,
int first, int last) {
if (first < last) { // There is data to be sorted.
// Partition the table.
int pivotIndex = partition(table, first, last);
// Sort the left half.
quickSort(table, first, pivotIndex - 1);
// Sort the right half.
quickSort(table, pivotIndex + 1, last);
}
}
/**
* #author http://en.wikipedia.org/wiki/Quick_Sort
*/
private static <T extends Comparable<T>> int partition(int[] table,
int first, int last) {
int pivotIndex = (first + last) / 2;
int pivotValue = table[pivotIndex];
swap(table, pivotIndex, last);
int storeIndex = first;
for (int i = first; i < last; i++) {
if (table[i]-(pivotValue) <= 0) {
swap(table, i, storeIndex);
storeIndex++;
}
}
swap(table, storeIndex, last);
return storeIndex;
}
private static <T> void swap(int[] a, int i, int j) {
int h = a[i];
a[i] = a[j];
a[j] = h;
}
}
as well the code I am using to run my benchmarks:
public static void main(String[] args) throws InterruptedException, IOException {
QuickSort quickSort = new QuickSort();
int TRIALS = 1000;
File file = new File(Long.toString(System.currentTimeMillis()));
System.out.println("Saving # \"" + file.getAbsolutePath() + "\"");
for (int x = 0; x < 30; ++x) {
// if (x > 4 && x < 17)
// Thread.sleep(1000);
int[] values = new int[x];
long start = System.nanoTime();
for (int i = 0; i < TRIALS; ++i)
quickSort.sort(values);
double duration = (System.nanoTime() - start) / TRIALS;
String line = x + "\t" + duration;
System.out.println(line);
FileUtils.writeStringToFile(file, line + "\r\n", true);
}
}
Well, it seems that I sorted the issue out on my own.
I was right about the idea that compiled code could take a while to kick in. The problem was a flaw in the way I actually implemented my benchmarking code:
if (x > 4 && x < 17)
Thread.sleep(1000);
in here I assumed that as the only "affected" area would be between 4 and 17, I could go on and just do a sleep over those values. This is simply not so. The following plot may be enlightening:
Here I am comparing the original no compilation function (red) to another no compilation function, but separed with sleeps in-between. As you may see, they work in different orders of magnitude, that meaning that mixing results of code with and without sleeps will yield unsound results, as I was guilty of doing.
The original question remains unaswered, yet. What causes the humps to ocurr even after compilation took place? Let's try to find that out, putting a 1s sleep in ALL points taken:
That yields the expected result. The odd humps were happening the native code still didn't kick in.
Comparing a sleep 50ms with a sleep 1000ms function yields yet again, the expected result:
(the gray one seems to still show a bit of delay)
Just wondering if anyone would be able to take a look at this code for implementing the quicksort algorithm and answer me a few questions, please :-)
public class Run
{
/***************************************************************************
* Quicksort code from Sedgewick 7.1, 7.2.
**************************************************************************/
public static void quicksort(double[] a)
{
//shuffle(a); // to guard against worst-case
quicksort(a, 0, a.length - 1, 0);
}
static void quicksort(final double[] a, final int left, final int right, final int tdepth)
{
if (right <= left)
return;
final int i = partition(a, left, right);
if ((tdepth < 4) && ((i - left) > 1000))
{
final Thread t = new Thread()
{
public void run()
{
quicksort(a, left, i - 1, tdepth + 1);
}
};
t.start();
quicksort(a, i + 1, right, tdepth + 1);
try
{
t.join();
}
catch (InterruptedException e)
{
throw new RuntimeException("Cancelled", e);
}
} else
{
quicksort(a, left, i - 1, tdepth);
quicksort(a, i + 1, right, tdepth);
}
}
// partition a[left] to a[right], assumes left < right
private static int partition(double[] a, int left, int right)
{
int i = left - 1;
int j = right;
while (true)
{
while (less(a[++i], a[right]))
// find item on left to swap
; // a[right] acts as sentinel
while (less(a[right], a[--j]))
// find item on right to swap
if (j == left)
break; // don't go out-of-bounds
if (i >= j)
break; // check if pointers cross
exch(a, i, j); // swap two elements into place
}
exch(a, i, right); // swap with partition element
return i;
}
// is x < y ?
private static boolean less(double x, double y)
{
return (x < y);
}
// exchange a[i] and a[j]
private static void exch(double[] a, int i, int j)
{
double swap = a[i];
a[i] = a[j];
a[j] = swap;
}
// shuffle the array a[]
private static void shuffle(double[] a)
{
int N = a.length;
for (int i = 0; i < N; i++)
{
int r = i + (int) (Math.random() * (N - i)); // between i and N-1
exch(a, i, r);
}
}
// test client
public static void main(String[] args)
{
int N = 5000000; // Integer.parseInt(args[0]);
// generate N random real numbers between 0 and 1
long start = System.currentTimeMillis();
double[] a = new double[N];
for (int i = 0; i < N; i++)
a[i] = Math.random();
long stop = System.currentTimeMillis();
double elapsed = (stop - start) / 1000.0;
System.out.println("Generating input: " + elapsed + " seconds");
// sort them
start = System.currentTimeMillis();
quicksort(a);
stop = System.currentTimeMillis();
elapsed = (stop - start) / 1000.0;
System.out.println("Quicksort: " + elapsed + " seconds");
}
}
My questions are:
What is the purpose of the variable tdepth?
Is this considered a "proper" implementation of a parallel quicksort? I ask becuase it doesn't use implements Runnable or extends Thread...
If it doesn't already, is it possible to modify this code to use multiple threads? By passing in the number of threads you want to use as a parameter, for example...?
Many thanks,
Brian
1. It's used to keep track of recursion depth. This is checked to decide whether to run in parallel. Notice how when the function runs in parallel it passes tdepth + 1 (which becomes tdepth in the called quicksort's parameters). This is a basic way of avoiding too many parallel threads.
2. Yes, it's definitely using another thread. The code:
new Thread()
{
public void run()
{
quicksort(a, left, i - 1, tdepth + 1);
}
};
creates an anonymous inner class (which extends Thread), which is then started.
Apparently, tdepth is used to avoid creating too many threads
It uses an anonymous class, which implicitly extends Thread
It does that already (see point 1.)
tdepth is there so that there's an upper bound on the number of threads created. Note that ever time the method calls itself recursively (which is done in a new thread), tdepth is incremented by one. This way, only the first four levels of recursion will create new threads, presumably to prevent overloading the OS with many threads for little benefit.
This code launches its own threads in the definition of the quicksort method, so it will use parallel processing. One might argue that it could do with some kind of thread management and that e.g. some kind of Executor might be better, but it is definitely parallel. See the call to new Thread() ... followed by start(). Incidentally, the call to t.join() will cause the current thread to wait for the thread t to finish, in case you weren't aware of that.
This code already uses multiple threads, but you can tweak how many it spawns given the comparison on tdepth; increasing or decreasing the value will determine how many levels of recursion create threads. You could complete rewrite the code to use executors and threadpools, or perhaps to perform trinary recursion instead of binary - but I suspect that in the sense you asked; no, there's no simple way to tweak the number of threads.
I did actually wrote a (correctly) multi-threaded QuickSort in Java so maybe I can help a bit...
Question here for anyone interested:
Multithreaded quicksort or mergesort
What is the purpose of the variable
tdepth?
as other have commented, it serves to determine whether to create new threads or not.
Is this considered a "proper"
implementation of a parallel
quicksort? I ask because it doesn't
use implements Runnable or extends
Thread...
I don't think it's that proper for several reasons: first you should make it CPU dependent. There's no point in spawning 16 threads on a CPU that has just one core: a mono-threaded QuickSort shall outperfom the multi-threaded one on a single core machine. On a 16-cores machines, sure, fire up to 16 threads.
Runtime.getRuntime().availableProcessors()
Then the second reason I really don't like it is that it is using last-century low-level Java idiosyncrasish threading details: I prefer to stay away from .join() and use higher level things (see fork/join in the other question or something like CountDownLatch'es, etc.). The problem with things low-level like Java's thread "join" is that it carries no useful meaning: this is 100% Java specific and can be replaced by higher-level threading facilities whose concept are portable across languages.
Then don't comment the shuffle at the beginning. Ever. I've seen dataset where QuickSort degrades quadratically if you remove that shuffle. And it's just an O(n) shuffle, that won't slow down your sort :)
If it doesn't already, is it possible
to modify this code to use multiple
threads? By passing in the number of
threads you want to use as a
parameter, for example...?
I'd try to write and/or reuse an implementation using higher-level concurrency facilities. See the advices in the question I asked here some time ago.
There are certain algorithms whose running time can decrease significantly when one divides up a task and gets each part done in parallel. One of these algorithms is merge sort, where a list is divided into infinitesimally smaller parts and then recombined in a sorted order. I decided to do an experiment to test whether or not I could I increase the speed of this sort by using multiple threads. I am running the following functions in Java on a Quad-Core Dell with Windows Vista.
One function (the control case) is simply recursive:
// x is an array of N elements in random order
public int[] mergeSort(int[] x) {
if (x.length == 1)
return x;
// Dividing the array in half
int[] a = new int[x.length/2];
int[] b = new int[x.length/2+((x.length%2 == 1)?1:0)];
for(int i = 0; i < x.length/2; i++)
a[i] = x[i];
for(int i = 0; i < x.length/2+((x.length%2 == 1)?1:0); i++)
b[i] = x[i+x.length/2];
// Sending them off to continue being divided
mergeSort(a);
mergeSort(b);
// Recombining the two arrays
int ia = 0, ib = 0, i = 0;
while(ia != a.length || ib != b.length) {
if (ia == a.length) {
x[i] = b[ib];
ib++;
}
else if (ib == b.length) {
x[i] = a[ia];
ia++;
}
else if (a[ia] < b[ib]) {
x[i] = a[ia];
ia++;
}
else {
x[i] = b[ib];
ib++;
}
i++;
}
return x;
}
The other is in the 'run' function of a class that extends thread, and recursively creates two new threads each time it is called:
public class Merger extends Thread
{
int[] x;
boolean finished;
public Merger(int[] x)
{
this.x = x;
}
public void run()
{
if (x.length == 1) {
finished = true;
return;
}
// Divide the array in half
int[] a = new int[x.length/2];
int[] b = new int[x.length/2+((x.length%2 == 1)?1:0)];
for(int i = 0; i < x.length/2; i++)
a[i] = x[i];
for(int i = 0; i < x.length/2+((x.length%2 == 1)?1:0); i++)
b[i] = x[i+x.length/2];
// Begin two threads to continue to divide the array
Merger ma = new Merger(a);
ma.run();
Merger mb = new Merger(b);
mb.run();
// Wait for the two other threads to finish
while(!ma.finished || !mb.finished) ;
// Recombine the two arrays
int ia = 0, ib = 0, i = 0;
while(ia != a.length || ib != b.length) {
if (ia == a.length) {
x[i] = b[ib];
ib++;
}
else if (ib == b.length) {
x[i] = a[ia];
ia++;
}
else if (a[ia] < b[ib]) {
x[i] = a[ia];
ia++;
}
else {
x[i] = b[ib];
ib++;
}
i++;
}
finished = true;
}
}
It turns out that function that does not use multithreading actually runs faster. Why? Does the operating system and the java virtual machine not "communicate" effectively enough to place the different threads on different cores? Or am I missing something obvious?
The problem is not multi-threading: I've written a correctly multi-threaded QuickSort in Java and it owns the default Java sort. I did this after witnessing a gigantic dataset being process and had only one core of a 16-cores machine working.
One of your issue (a huge one) is that you're busy looping:
// Wait for the two other threads to finish
while(!ma.finished || !mb.finished) ;
This is a HUGE no-no: it is called busy looping and you're destroying the perfs.
(Another issue is that your code is not spawning any new threads, as it has already been pointed out to you)
You need to use other way to synchronize: an example would be to use a CountDownLatch.
Another thing: there's no need to spawn two new threads when you divide the workload: spawn only one new thread, and do the other half in the current thread.
Also, you probably don't want to create more threads than there are cores availables.
See my question here (asking for a good Open Source multithreaded mergesort/quicksort/whatever). The one I'm using is proprietary, I can't paste it.
Multithreaded quicksort or mergesort
I haven't implemented Mergesort but QuickSort and I can tell you that there's no array copying going on.
What I do is this:
pick a pivot
exchange values as needed
have we reached the thread limit? (depending on the number of cores)
yes: sort first part in this thread
no: spawn a new thread
sort second part in current thread
wait for first part to finish if it's not done yet (using a CountDownLatch).
The code spawning a new thread and creating the CountDownLatch may look like this:
final CountDownLatch cdl = new CountDownLatch( 1 );
final Thread t = new Thread( new Runnable() {
public void run() {
quicksort(a, i+1, r );
cdl.countDown();
}
} };
The advantage of using synchronization facilities like the CountDownLatch is that it is very efficient and that your not wasting time dealing with low-level Java synchronization idiosynchrasies.
In your case, the "split" may look like this (untested, it is just to give an idea):
if ( threads.getAndIncrement() < 4 ) {
final CountDownLatch innerLatch = new CountDownLatch( 1 );
final Thread t = new Merger( innerLatch, b );
t.start();
mergeSort( a );
while ( innerLatch.getCount() > 0 ) {
try {
innerLatch.await( 1000, TimeUnit.SECONDS );
} catch ( InterruptedException e ) {
// Up to you to decide what to do here
}
}
} else {
mergeSort( a );
mergeSort( b );
}
(don't forget to "countdown" the latch when each merge is done)
Where you'd replace the number of threads (up to 4 here) by the number of available cores. You may use the following (once, say to initialize some static variable at the beginning of your program: the number of cores is unlikely to change [unless you're on a machine allowing CPU hotswapping like some Sun systems allows]):
Runtime.getRuntime().availableProcessors()
As others said; This code isn't going to work because it starts no new threads. You need to call the start() method instead of the run() method to create new threads. It also has concurrency errors: the checks on the finished variable are not thread safe.
Concurrent programming can be pretty difficult if you do not understand the basics. You might read the book Java Concurrency in Practice by Brian Goetz. It explains the basics and explains constructs (such as Latch, etc) to ease building concurrent programs.
The overhead cost of synchronization may be comparatively large and prevent many optimizations.
Furthermore you are creating way too many threads.
The other is in the 'run' function of a class that extends thread, and recursively creates two new threads each time it is called.
You would be better off with a fixed number of threads, suggestively 4 on a quad core. This could be realized with a thread pool (tutorial) and the pattern would be "bag of tasks". But perhaps it would be better yet, to initially divide the task into four equally large tasks and do "single-threaded" sorting on those tasks. This would then utilize the caches a lot better.
Instead of having a "busy-loop" waiting for the threads to finish (stealing cpu-cycles) you should have a look at Thread.join().
How many elements in the array you have to do sort? If there are too few elements, the time of sync and CPU switching will over the time you save for dividing the job for paralleling