ForkJoinPool vs Plain Recursion

ForkJoinPool vs Plain Recursion - java

After reading about ForkJoinPool, I tried an experiment to test how fast actually ForkJoinPool is, when compared to plain recursion.
I calculated the number of files in a folder recursively, and to my surprize, plain recursion performed way better than ForkJoinPool
Here's my code.
Recursive Task
class DirectoryTask extends RecursiveTask<Long> {
private Directory directory;
#Override
protected Long compute() {
List<RecursiveTask<Long>> forks = new ArrayList<>();
List<Directory> directories = directory.getDirectories();
for (Directory directory : directories) {
DirectoryTask directoryTask = new DirectoryTask(directory);
forks.add(directoryTask);
directoryTask.fork();
}
Long count = directory.getDoumentCount();
for (RecursiveTask<Long> task : forks) {
count += task.join();
}
return count;
}
}
Plain Recursion
private static Long getFileCount(Directory directory) {
Long recursiveCount = 0L;
List<Directory> directories = directory.getDirectories();
if (null != directories) {
for (Directory d : directories) {
recursiveCount += getFileCount(d);
}
}
return recursiveCount + directory.getDoumentCount();
}
Directory Object
class Directory {
private List<Directory> directories;
private Long doumentCount = 0L;
static Directory fromFolder(File file) {
List<Directory> children = new ArrayList<>();
Long documentCount = 0L;
if (!file.isDirectory()) {
throw new IllegalArgumentException("Only directories are allowed");
}
String[] files = file.list();
if (null != files) {
for (String path : files) {
File f = new File(file.getPath() + File.separator + path);
if (f.isHidden()) {
continue;
}
if (f.isDirectory()) {
children.add(Directory.fromFolder(f));
} else {
documentCount++;
}
}
}
return new Directory(children, documentCount);
}
}
Results
Plain Recursion: 3ms
ForkJoinPool: 25ms
Where's the mistake here?
I am just trying to understand whether there is a particular threshold, below which plain recursion is faster than a ForkJoinPool.

Nothing in life comes for free. If you had to move one beer crate from your car to your apartment - what is quicker: carrying it there manually, or first going to the shed, to get the wheelbarrow to use that to move that one crate?
Creating thread objects is a "native" operation that goes down into the underlying Operating System to acquire resources there. That can be a rather costly operation.
Meaning: just throwing "more threads" at a problem doesn't automatically speed things up. To the contrary. When your task is mainly CPU-intensive, there might be small gain from doing things in parallel. When you are doing lots of IO, then having multiple threads allows you to do "less" waiting overall; thus improving your throughput.
In other words: Fork/Join requires considerable activity before it does the real job. Using it for computations that only require a few ms is simply overkill. Thus: you would be looking to "fork/join" operations that work on larger data sets.
For further reading, you might look at parallel streams. Those are using the fork/join framework under the covers; and surprise, it is a misconception to expect arbitrary parallelStream to be "faster" than ordinary streams, too.

There are multiple aspects to this:
Is there a difference between serial (e.g. plain recursion) and parallel (e.g. forkjoin) solutions to the same problem?
What is the scope for parallelizing file system access?
What are the traps for measuring performance?
Answer to #1. Yes there is a difference. Parallelism is not good for a problem that is too small. With a parallel solution, you need to account for the overheads of:
creating and managing threads
passing information from the parent thread to the child threads
returning results from the child threads to the parent thread
synchronized access to shared data structures,
waiting for the slowest / last finishing child thread to finish.
How these play out in practice depend on a variety of things ... including the size of the problem, and the opportunities for parallelism.
The answer to #2 is (probably) not as much as you would think. A typical file system is stored on a disk drive that has physical characteristics such as disk rotation and disk head seeking. These typically become the bottleneck, and less you have a high-end storage system, there is not much scope for parallelism.
The answer to #3 is that there are lots of traps. And those traps can result in very misleading (i.e. invalid) performance results .... if you don't allow for them. One of the biggest traps is that JVMs take time to "warm up"; i.e. load classes, do JIT compilation, resize the heap, and so on.
Another trap that applies to benchmarks that do file system I/O is that a typical OS will do things like caching disk blocks and file / directory metadata. So the second time you access a file or directory it is likely to be faster than the first time.
Having said this, if you have a well-designed, high performance file system (e.g. inodes held on SSDs) and a well designed application, and enough cores, it is possible to achieve extraordinary rates of file system scanning through parallelism. (For example, checking the modification timestamps on half a billion files in under 12 hours ....)

Related

How is LongAccumulator implemented, so that it is more efficient?

I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.
I wonder how is it implemented to be more efficient?

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.
To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.

Multicore Java Program with Native Code

I am using a native C++ library inside a Java program. The Java program is written to make use of many-core systems, but it does not scale: the best speed is with around 6 cores, i.e., adding more cores slows it down. My tests show that the call to the native code itself causes the problem, so I want to make sure that different threads access different instances of the native library, and therefore remove any hidden (memory) dependency between the parallel tasks.
In other words, instead of the static block
static {
System.loadLibrary("theNativeLib");
}
I want multiple instances of the library to be loaded, for each thread dynamically. The main question is if that is possible at all. And then how to do it!
Notes:
- I have implementations in Java 7 fork/join as well as Scala/akka. So any help in each platform is appreciated.
- The parallel tasks are completely independent. In fact, each task may create a couple of new tasks and then terminates; no further dependency!
Here is the test program in fork/join style, in which processNatively is basically a bunch of native calls:
class Repeater extends RecursiveTask<Long> {
final int n;
final processor mol;
public Repeater(final int m, final processor o) {
n=m;
mol = o;
}
#Override
protected Long compute() {
processNatively(mol);
final List<RecursiveTask<Long>> tasks = new ArrayList<>();
for (int i=n; i<9; i++) {
tasks.add(new Repeater(n+1,mol));
}
long count = 1;
for(final RecursiveTask<Long> task : invokeAll(tasks)) {
count += task.join();
}
return count;
}
}
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();
public void repeat(processor mol)
{
final long middle = System.currentTimeMillis();
final long count = forkJoinPool.invoke(new Repeater(0, mol));
System.out.println("Count is "+count);
final long after = System.currentTimeMillis();
System.out.println("Time elapsed: "+(after-middle));
}
Putting it differently:
If I have N threads that use a native library, what happens if each of them calls System.loadLibrary("theNativeLib"); dynamically, instead of calling it once in a static block? Will they share the library anyway? If yes, how can I fool JVM into seeing it as N different libraries loaded independently? (The value of N is not known statically)

The javadoc for System.loadLibrary states that it's the same as calling Runtime.getRuntime().loadLibrary(name). The javadoc for this loadLibrary (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#loadLibrary(java.lang.String) ) states that "If this method is called more than once with the same library name, the second and subsequent calls are ignored.", so it seems you can't load the same library more than once. In terms of fooling the JVM into thinking there are multiple instances, I can't help you there.

You need to ensure you don't have a bottle neck on any shared resources. e.g. say you have 6 hyper threaded cores, you may find that 12 threads is optimal or you might find that 6 thread is optimal (and each thread has a dedicated core)
If you have a heavy floating point routine, it is likely that hyperthreading will be slower rather than faster.
If you are using all the cache, trying to use more can slow your system down. If you are using the limit of CPU to main memory bandwidth, attempting to use more bandwidth can slow your machine.
But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages?
There is only one instance, you cannot load a DLL more than once. If you want to construct a different data set for each thread you need to do this externally to the library and pass this to the library so each thread can work on different data.

Lock-free guard for synchronized acquire/release

I have a shared tempfile resource that is divided into chunks of 4K (or some such value). Each 4K in the file is represented by an index starting from zero. For this shared resource, I track the 4K chunk indices in use and always return the lowest indexed 4K chunk not in use, or -1 if all are in use.
This ResourceSet class for the indices has a public acquire and release method, both of which use synchronized lock whose duration is about like that of generating 4 random numbers (expensive, cpu-wise).
Therefore as you can see from the code that follows, I use an AtomicInteger "counting semaphore" to prevent a large number of threads from entering the critical section at the same time on acquire(), returning -1 (not available right now) if there are too many threads.
Currently, I am using a constant of 100 for the tight CAS loop to try to increment the atomic integer in acquire, and a constant of 10 for the maximum number of threads to then allow into the critical section, which is long enough to create contention. My question is, what should these constants be for a moderate to highly loaded servlet engine that has several threads trying to get access to these 4K chunks?
public class ResourceSet {
// ??? what should this be
// maximum number of attempts to try to increment with CAS on acquire
private static final int CAS_MAX_ATTEMPTS = 50;
// ??? what should this be
// maximum number of threads contending for lock before returning -1 on acquire
private static final int CONTENTION_MAX = 10;
private AtomicInteger latch = new AtomicInteger(0);
... member variables to track free resources
private boolean aquireLatchForAquire ()
{
for (int i = 0; i < CAS_MAX_ATTEMPTS; i++) {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (!latch.compareAndSet(val, val+1))
continue;
if (val < 0 || val >= CONTENTION_MAX) {
latch.decrementAndGet();
// added to fix BUG that comment pointed out, thanks!
return false;
}
}
return false;
}
private void aquireLatchForRelease ()
{
do {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (latch.compareAndSet(val, val+1))
return;
} while (true);
}
public ResourceSet (int totalResources)
{
... initialize
}
public int acquire (ResourceTracker owned)
{
if (!aquireLatchForAquire())
return -1;
try {
synchronized (this) {
... algorithm to compute minimum free resoource or return -1 if all in use
return resourceindex;
}
} finally {
latch.decrementAndGet();
}
}
public boolean release (ResourceIter iter)
{
aquireLatchForRelease();
try {
synchronized (this) {
... iterate and release all resources
}
} finally {
latch.decrementAndGet();
}
}
}

Writting a good and performant spinlock is actually pretty complicated and requires a good understanding of memory barriers. Merely picking a constant is not going to cut it and will definitely not be portable. Google's gperftools has an example that you can look at but is probably way more complicated then what you'd need.
If you really want to reduce contention on the lock, you might want to consider using a more fine-grained and optimistic scheme. A simple one could be to divide your chunks into n groups and associate a lock with each group (also called stripping). This will help reduce contention and increase throughput but it won't help reduce latency. You could also associate an AtomicBoolean to each chunk and CAS to acquire it (retry in case of failure). Do be careful when dealing with lock-free algorithms because they tend to be tricky to get right. If you do get it right, it could considerably reduce the latency of acquiring a chunk.
Note that it's difficult to propose a more fine-grained approach without knowing what your chunk selection algorithm looks like. I also assume that you really do have a performance problem (it's been profiled and everything).
While I'm at it, your spinlock implementation is flawed. You should never spin directly on a CAS because you're spamming memory barriers. This will be incredibly slow with any serious amount of contention (related to the thundering-herd problem). A minimum would be to first check the variable for availability before your CAS (simple if on a no barrier read will do). Even better would be to not have all your threads spinning on the same value. This should avoid the associated cache-line from ping-pong-ing between your cores.
Note that I don't know what type of memory barriers are associated with atomic ops in Java so my above suggestions might not be optimal or correct.
Finally, The Art Of Multiprocessor Programming is a fun book to read to get better acquainted with all the non-sense I've been spewing in this answer.

I'm not sure if it's necessary to forge your own Lock class for this scenario. As JDK provided ReentrantLock, which also leverage CAS instruction during lock acquire. The performance should be pretty good when compared with your personal Lock class.

You can use Semaphore's tryAcquire method if you want your threads to balk on no resource available.
I for one would simply substitute your synchronized keyword with a ReentrantLock and use the tryLock() method on it. If you want to let your threads wait a bit, you can use tryLock(timeout) on the same class. Which one to choose and what value to use for timeout, needs to be determined by way of a performance test.
Creating an explicit gate seems as you seem to be doing seems unnecessary to me. I'm not saying that it can never help, but IMO it's more likely to actually hurt performance, and it's an added complication for sure. So unless you have an performance issue around here (based on a test you did) and you found that this kind of gating helps, I'd recommend to go with the simplest implementation.

In Java what is the performance of AtomicInteger compareAndSet() versus synchronized keyword?

I was implementing a FIFO queue of requests instances (preallocated request objects for speed) and started with using the "synchronized" keyword on the add method. The method was quite short (check if room in fixed size buffer, then add value to array). Using visualVM it appeared the thread was blocking more often than I liked ("monitor" to be precise). So I converted the code over to use AtomicInteger values for things such as keeping track of the current size, then using compareAndSet() in while loops (as AtomicInteger does internally for methods such as incrementAndGet()). The code now looks quite a bit longer.
What I was wondering is what is the performance overhead of using synchronized and shorter code versus longer code without the synchronized keyword (so should never block on a lock).
Here is the old get method with the synchronized keyword:
public synchronized Request get()
{
if (head == tail)
{
return null;
}
Request r = requests[head];
head = (head + 1) % requests.length;
return r;
}
Here is the new get method without the synchronized keyword:
public Request get()
{
while (true)
{
int current = size.get();
if (current <= 0)
{
return null;
}
if (size.compareAndSet(current, current - 1))
{
break;
}
}
while (true)
{
int current = head.get();
int nextHead = (current + 1) % requests.length;
if (head.compareAndSet(current, nextHead))
{
return requests[current];
}
}
}
My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc), even though the code is shorter.
Thanks!

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc)
Yes, in the common case you are right. Java Concurrency in Practice discusses this in section 15.3.2:
[...] at high contention levels locking tends to outperform atomic variables, but at more realistic contention levels atomic variables outperform locks. This is because a lock reacts to contention by suspending threads, reducing CPU usage and synchronization traffic on the shared memory bus. (This is similar to how blocking producers in a producer-consumer design reduces the load on consumers and thereby lets them catch up.) On the other hand, with atomic variables, contention management is pushed back to the calling class. Like most CAS-based algorithms, AtomicPseudoRandom reacts to contention by trying again immediately, which is usually the right approach but in a high-contention environment just creates more contention.
Before we condemn AtomicPseudoRandom as poorly written or atomic variables as a poor choice compared to locks, we should realize that the level of contention in Figure 15.1 is unrealistically high: no real program does nothing but contend for a lock or atomic variable. In practice, atomics tend to scale better than locks because atomics deal more effectively with typical contention levels.
The performance reversal between locks and atomics at differing levels of contention illustrates the strengths and weaknesses of each. With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance. (CAS-based algorithms also outperform lock-based ones on single-CPU systems, since a CAS always succeeds on a single-CPU system except in the unlikely case that a thread is preempted in the middle of the read-modify-write operation.)
(On the figures referred to by the text, Figure 15.1 shows that the performance of AtomicInteger and ReentrantLock is more or less equal when contention is high, while Figure 15.2 shows that under moderate contention the former outperforms the latter by a factor of 2-3.)
Update: on nonblocking algorithms
As others have noted, nonblocking algorithms, although potentially faster, are more complex, thus more difficult to get right. A hint from section 15.4 of JCiA:
Good nonblocking algorithms are known for many common data structures, including stacks, queues, priority queues, and hash tables, though designing new ones is a task best left to experts.
Nonblocking algorithms are considerably more complicated than their lock-based equivalents. The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency. In linked collection classes such as queues, you can sometimes get away with expressing state transformations as changes to individual links and using an AtomicReference to represent each link that must be updated atomically.

I wonder if jvm already does a few spin before really suspending the thread. It anticipate that well written critical sections, like yours, are very short and complete almost immediately. Therefore it should optimistically busy-wait for, I don't know, dozens of loops, before giving up and suspending the thread. If that's the case, it should behave the same as your 2nd version.
what a profiler shows might be very different from what's realy happending in a jvm at full speed, with all kinds of crazy optimizations. it's better to measure and compare throughputs without profiler.

Before doing this kind of synchronization optimizations, you really need a profiler to tell you that it's absolutely necessary.
Yes, synchronized under some conditions may be slower than atomic operation, but compare your original and replacement methods. The former is really clear and easy to maintain, the latter, well it's definitely more complex. Because of this there may be very subtle concurrency bugs, that you will not find during initial testing. I already see one problem, size and head can really get out of sync, because, though each of these operations is atomic, the combination is not, and sometimes this may lead to an inconsistent state.
So, my advise:
Start simple
Profile
If performance is good enough, leave simple implementation as is
If you need performance improvement, then start to get clever (possibly using more specialized lock at first), and TEST, TEST, TEST

Here's code for a busy wait lock.
public class BusyWaitLock
{
private static final boolean LOCK_VALUE = true;
private static final boolean UNLOCK_VALUE = false;
private final static Logger log = LoggerFactory.getLogger(BusyWaitLock.class);
/**
* #author Rod Moten
*
*/
public class BusyWaitLockException extends RuntimeException
{
/**
*
*/
private static final long serialVersionUID = 1L;
/**
* #param message
*/
public BusyWaitLockException(String message)
{
super(message);
}
}
private AtomicBoolean lock = new AtomicBoolean(UNLOCK_VALUE);
private final long maximumWaitTime ;
/**
* Create a busy wait lock with that uses the default wait time of two minutes.
*/
public BusyWaitLock()
{
this(1000 * 60 * 2); // default is two minutes)
}
/**
* Create a busy wait lock with that uses the given value as the maximum wait time.
* #param maximumWaitTime - a positive value that represents the maximum number of milliseconds that a thread will busy wait.
*/
public BusyWaitLock(long maximumWaitTime)
{
if (maximumWaitTime < 1)
throw new IllegalArgumentException (" Max wait time of " + maximumWaitTime + " is too low. It must be at least 1 millisecond.");
this.maximumWaitTime = maximumWaitTime;
}
/**
*
*/
public void lock ()
{
long startTime = System.currentTimeMillis();
long lastLogTime = startTime;
int logMessageCount = 0;
while (lock.compareAndSet(UNLOCK_VALUE, LOCK_VALUE)) {
long waitTime = System.currentTimeMillis() - startTime;
if (waitTime - lastLogTime > 5000) {
log.debug("Waiting for lock. Log message # {}", logMessageCount++);
lastLogTime = waitTime;
}
if (waitTime > maximumWaitTime) {
log.warn("Wait time of {} exceed maximum wait time of {}", waitTime, maximumWaitTime);
throw new BusyWaitLockException ("Exceeded maximum wait time of " + maximumWaitTime + " ms.");
}
}
}
public void unlock ()
{
lock.set(UNLOCK_VALUE);
}
}

Traversing a Binary Tree with multiple threads

So I'm working on a speed contest in Java. I have (number of processors) threads doing work, and they all need to add to a binary tree. Originally I just used a synchronized add method, but I wanted to make it so threads could follow each other through the tree (each thread only has the lock on the object it's accessing). Unfortunately, even for a very large file (48,000 lines), my new binary tree is slower than the old one. I assume this is because I'm getting and releasing a lock every time I move in the tree. Is this the best way to do this or is there a better way?
Each node has a ReentrantLock named lock, and getLock() and releaseLock() just call lock.lock() and lock.unlock();
My code:
public void add(String sortedWord, String word) {
synchronized(this){
if (head == null) {
head = new TreeNode(sortedWord, word);
return;
}
head.getLock();
}
TreeNode current = head, previous = null;
while (current != null) {
// If this is an anagram of another word in the list..
if (current.getSortedWord().equals(sortedWord)) {
current.add(word);
current.releaseLock();
return;
}
// New word is less than current word
else if (current.compareTo(sortedWord) > 0) {
previous = current;
current = current.getLeft();
if(current != null){
current.getLock();
previous.releaseLock();
}
}
// New word greater than current word
else {
previous = current;
current = current.getRight();
if(current != null){
current.getLock();
previous.releaseLock();
}
}
}
if (previous.compareTo(sortedWord) > 0) {
previous.setLeft(sortedWord, word);
}
else {
previous.setRight(sortedWord, word);
}
previous.releaseLock();
}
EDIT: Just to clarify, my code is structured like this: The main thread reads input from a file and adds the words to a queue, each worker thread pull words from the queue and does some work (including sorting them and adding them to the binary tree).

Another thing. There definitely is no place for a binary tree in performance critical code. The cacheing behaviour will kill all performance. It should have a much larger fan out (one cache line) [edit] With a binary tree you access too much non-contiguous memory. Take a look at the material on Judy trees.
And you probably want to start with a radix of at least one character before starting the tree.
And do the compare on an int key instead of a string first.
And perhaps look at tries
And getting rid of all the threads and synchronization. Just try to make the problem memory access bound
[edit]
I would do this a bit different. I would use a thread for each first character of the string, and give them their own BTree (or perhaps a Trie). I'd put a non-blocking work queue in each thread and fill them based on the first character of the string. You can get even more performance by presorting the add queue and doing a merge sort into the BTree. In the BTree, I'd use int keys representing the first 4 characters, only refering to the strings in the leaf pages.
In a speed contest, you hope to be memory access bound, and therefore have no use for threads. If not, you're still doing too much processing per string.

I would actually start looking at the use of compare() and equals() and see if something can be improved there. You might wrap you String object in another class with an different, optimized for your usecase, compare() method. For instance, consider using hashCode() instead of equals(). The hashcode is cached so future calls will be that much faster.
Consider interning the strings. I don't know if the vm will accept that many strings but it's worth checking out.
(this was going to be a comment to an answer but got too wordy).
When reading the nodes you need to get a read-lock for each node as you reach it. If you read-lock the whole tree then you gain nothing.
Once you reach the node you want to modify, you release the read lock for that node and try to acquire the write lock. Code would be something like:
TreeNode current; // add a ReentrantReadWriteLock to each node.
// enter the current node:
current.getLock().readLock().lock();
if (isTheRightPlace(current) {
current.getLock().readLock().unlock();
current.getLock().writeLock().lock(); // NB: getLock returns a ConcurrentRWLock
// do stuff then release lock
current.getLock().writeLock().unlock();
} else {
current.getLock().readLock().unlock();
}

You may try using an upgradeable read/write-lock (maybe its called an upgradeable shared lock or the like, I do not know what Java provides): use a single RWLock for the whole tree. Before traversing the B-Tree, you acquire the read (shared) lock and you release it when done (one acquire and one release in the add method, not more).
At the point where you have to modify the B-Tree, you acquire the write (exclusive) lock (or "upgrade" from read to write lock), insert the node and downgrade to read (shared) lock.
With this technique the synchronization for checking and inserting the head node can also be removed!
It should look somehow like this:
public void add(String sortedWord, String word) {
lock.read();
if (head == null) {
lock.upgrade();
head = new TreeNode(sortedWord, word);
lock.downgrade();
lock.unlock();
return;
}
TreeNode current = head, previous = null;
while (current != null) {
if (current.getSortedWord().equals(sortedWord)) {
lock.upgrade();
current.add(word);
lock.downgrade();
lock.unlock();
return;
}
.. more tree traversal, do not touch the lock here ..
...
}
if (previous.compareTo(sortedWord) > 0) {
lock.upgrade();
previous.setLeft(sortedWord, word);
lock.downgrade();
}
else {
lock.upgrade();
previous.setRight(sortedWord, word);
lock.downgrade();
}
lock.unlock();
}
Unfortunately, after some googling I could not find a suitable "ugradeable" rwlock for Java. "class ReentrantReadWriteLock" is not upgradeable, however, instead of upgrade you can unlock read, then lock write, and (very important): re-check the condition that lead to these lines again (e.g. if( current.getSortedWord().equals(sortedWord) ) {...}). This is important, because another thread may have changed things between read unlock and write lock.
for details check this question and its answers
In the end the traversal of the B-tree will run in parallel. Only when a target node is found, the thread acquires an exclusive lock (and other threads will block only for the time of the insertion).

Locking and unlocking is overhead, and the more you do it, the slower your program will be.
On the other hand, decomposing a task and running portions in parallel will make your program complete more quickly.
Where the "break-even" point lies is highly-dependent on the amount of contention for a particular lock in your program, and the system architecture on which the program is run. If there is little contention (as there appears to be in this program) and many processors, this might be a good approach. However, as the number of threads decreases, the overhead will dominate and a concurrent program will be slower. You have to profile your program on the target platform to determine this.
Another option to consider is a non-locking approach using immutable structures. Rather than modifying a list, for example, you could append the old (linked) list to a new node, then with a compareAndSet operation on an AtomicReference, ensure that you won the data race to set the words collection in current tree node. If not, try again. You could use AtomicReferences for the left and right children in your tree nodes too. Whether this is faster or not, again, would have to be tested on your target platform.

Considering one dataset per line, 48k lines isn't all that much and you can only have wild guesses as to how your operating system and the virtual machine are going to mangle you file IO to make it as fast as possible.
Trying to use a producer/consumer paradigm can be problematically here as you have to balance the overhead of locks vs. the actual amount of IO carefully. You might get better performance if you just try to improve the way you do the File IO (consider something like mmap()).

I would say that the doing it this way is not the way to go, without even taking the synchronization performance issues into account.
The fact that this implementation is slower than the original fully synchronized version may be a problem, but a bigger problem is that the locking in this implementation is not at all robust.
Imagine, for example, that you pass null in for sortedWord; this will result in a NullPointerException being thrown, which will mean you end up with holding onto the lock in the current thread, and therefore leaving your data structure in an inconsistent state. On the other hand, if you just synchronize this method, you don't have to worry about such things. Considering the synchronized version is faster as well, it's an easy choice to make.

You seem to have implemented a Binary Search Tree, not a B-Tree.
Anyway, have you considered using a ConcurrentSkipListMap? This is an ordered data structure (introduced in Java 6), which should have good concurrency.

I've got a dumb question: since you're reading and modifying a file, you're going to be totally limited by how fast the read/write head can move around and the disk can rotate. So what good is it to use threads and processors? The disc can't do two things at once.
Or is this all in RAM?
ADDED: OK, It's not clear to me how much parallelism can help you here (some, maybe), but regardless, what I would suggest is squeezing every cycle out of each thread that you can. This is what I'm talking about. For example, I wonder if innocent-looking sleeper code like those calls to "get" and "compare" methods are taking a more of a % of time than you might expect. If they are, you might be able to do each of them once rather than 2 or 3 times - that sort of thing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.