Traversing a Binary Tree with multiple threads

Traversing a Binary Tree with multiple threads - java

So I'm working on a speed contest in Java. I have (number of processors) threads doing work, and they all need to add to a binary tree. Originally I just used a synchronized add method, but I wanted to make it so threads could follow each other through the tree (each thread only has the lock on the object it's accessing). Unfortunately, even for a very large file (48,000 lines), my new binary tree is slower than the old one. I assume this is because I'm getting and releasing a lock every time I move in the tree. Is this the best way to do this or is there a better way?
Each node has a ReentrantLock named lock, and getLock() and releaseLock() just call lock.lock() and lock.unlock();
My code:
public void add(String sortedWord, String word) {
synchronized(this){
if (head == null) {
head = new TreeNode(sortedWord, word);
return;
}
head.getLock();
}
TreeNode current = head, previous = null;
while (current != null) {
// If this is an anagram of another word in the list..
if (current.getSortedWord().equals(sortedWord)) {
current.add(word);
current.releaseLock();
return;
}
// New word is less than current word
else if (current.compareTo(sortedWord) > 0) {
previous = current;
current = current.getLeft();
if(current != null){
current.getLock();
previous.releaseLock();
}
}
// New word greater than current word
else {
previous = current;
current = current.getRight();
if(current != null){
current.getLock();
previous.releaseLock();
}
}
}
if (previous.compareTo(sortedWord) > 0) {
previous.setLeft(sortedWord, word);
}
else {
previous.setRight(sortedWord, word);
}
previous.releaseLock();
}
EDIT: Just to clarify, my code is structured like this: The main thread reads input from a file and adds the words to a queue, each worker thread pull words from the queue and does some work (including sorting them and adding them to the binary tree).

Another thing. There definitely is no place for a binary tree in performance critical code. The cacheing behaviour will kill all performance. It should have a much larger fan out (one cache line) [edit] With a binary tree you access too much non-contiguous memory. Take a look at the material on Judy trees.
And you probably want to start with a radix of at least one character before starting the tree.
And do the compare on an int key instead of a string first.
And perhaps look at tries
And getting rid of all the threads and synchronization. Just try to make the problem memory access bound
[edit]
I would do this a bit different. I would use a thread for each first character of the string, and give them their own BTree (or perhaps a Trie). I'd put a non-blocking work queue in each thread and fill them based on the first character of the string. You can get even more performance by presorting the add queue and doing a merge sort into the BTree. In the BTree, I'd use int keys representing the first 4 characters, only refering to the strings in the leaf pages.
In a speed contest, you hope to be memory access bound, and therefore have no use for threads. If not, you're still doing too much processing per string.

I would actually start looking at the use of compare() and equals() and see if something can be improved there. You might wrap you String object in another class with an different, optimized for your usecase, compare() method. For instance, consider using hashCode() instead of equals(). The hashcode is cached so future calls will be that much faster.
Consider interning the strings. I don't know if the vm will accept that many strings but it's worth checking out.
(this was going to be a comment to an answer but got too wordy).
When reading the nodes you need to get a read-lock for each node as you reach it. If you read-lock the whole tree then you gain nothing.
Once you reach the node you want to modify, you release the read lock for that node and try to acquire the write lock. Code would be something like:
TreeNode current; // add a ReentrantReadWriteLock to each node.
// enter the current node:
current.getLock().readLock().lock();
if (isTheRightPlace(current) {
current.getLock().readLock().unlock();
current.getLock().writeLock().lock(); // NB: getLock returns a ConcurrentRWLock
// do stuff then release lock
current.getLock().writeLock().unlock();
} else {
current.getLock().readLock().unlock();
}

You may try using an upgradeable read/write-lock (maybe its called an upgradeable shared lock or the like, I do not know what Java provides): use a single RWLock for the whole tree. Before traversing the B-Tree, you acquire the read (shared) lock and you release it when done (one acquire and one release in the add method, not more).
At the point where you have to modify the B-Tree, you acquire the write (exclusive) lock (or "upgrade" from read to write lock), insert the node and downgrade to read (shared) lock.
With this technique the synchronization for checking and inserting the head node can also be removed!
It should look somehow like this:
public void add(String sortedWord, String word) {
lock.read();
if (head == null) {
lock.upgrade();
head = new TreeNode(sortedWord, word);
lock.downgrade();
lock.unlock();
return;
}
TreeNode current = head, previous = null;
while (current != null) {
if (current.getSortedWord().equals(sortedWord)) {
lock.upgrade();
current.add(word);
lock.downgrade();
lock.unlock();
return;
}
.. more tree traversal, do not touch the lock here ..
...
}
if (previous.compareTo(sortedWord) > 0) {
lock.upgrade();
previous.setLeft(sortedWord, word);
lock.downgrade();
}
else {
lock.upgrade();
previous.setRight(sortedWord, word);
lock.downgrade();
}
lock.unlock();
}
Unfortunately, after some googling I could not find a suitable "ugradeable" rwlock for Java. "class ReentrantReadWriteLock" is not upgradeable, however, instead of upgrade you can unlock read, then lock write, and (very important): re-check the condition that lead to these lines again (e.g. if( current.getSortedWord().equals(sortedWord) ) {...}). This is important, because another thread may have changed things between read unlock and write lock.
for details check this question and its answers
In the end the traversal of the B-tree will run in parallel. Only when a target node is found, the thread acquires an exclusive lock (and other threads will block only for the time of the insertion).

Locking and unlocking is overhead, and the more you do it, the slower your program will be.
On the other hand, decomposing a task and running portions in parallel will make your program complete more quickly.
Where the "break-even" point lies is highly-dependent on the amount of contention for a particular lock in your program, and the system architecture on which the program is run. If there is little contention (as there appears to be in this program) and many processors, this might be a good approach. However, as the number of threads decreases, the overhead will dominate and a concurrent program will be slower. You have to profile your program on the target platform to determine this.
Another option to consider is a non-locking approach using immutable structures. Rather than modifying a list, for example, you could append the old (linked) list to a new node, then with a compareAndSet operation on an AtomicReference, ensure that you won the data race to set the words collection in current tree node. If not, try again. You could use AtomicReferences for the left and right children in your tree nodes too. Whether this is faster or not, again, would have to be tested on your target platform.

Considering one dataset per line, 48k lines isn't all that much and you can only have wild guesses as to how your operating system and the virtual machine are going to mangle you file IO to make it as fast as possible.
Trying to use a producer/consumer paradigm can be problematically here as you have to balance the overhead of locks vs. the actual amount of IO carefully. You might get better performance if you just try to improve the way you do the File IO (consider something like mmap()).

I would say that the doing it this way is not the way to go, without even taking the synchronization performance issues into account.
The fact that this implementation is slower than the original fully synchronized version may be a problem, but a bigger problem is that the locking in this implementation is not at all robust.
Imagine, for example, that you pass null in for sortedWord; this will result in a NullPointerException being thrown, which will mean you end up with holding onto the lock in the current thread, and therefore leaving your data structure in an inconsistent state. On the other hand, if you just synchronize this method, you don't have to worry about such things. Considering the synchronized version is faster as well, it's an easy choice to make.

You seem to have implemented a Binary Search Tree, not a B-Tree.
Anyway, have you considered using a ConcurrentSkipListMap? This is an ordered data structure (introduced in Java 6), which should have good concurrency.

I've got a dumb question: since you're reading and modifying a file, you're going to be totally limited by how fast the read/write head can move around and the disk can rotate. So what good is it to use threads and processors? The disc can't do two things at once.
Or is this all in RAM?
ADDED: OK, It's not clear to me how much parallelism can help you here (some, maybe), but regardless, what I would suggest is squeezing every cycle out of each thread that you can. This is what I'm talking about. For example, I wonder if innocent-looking sleeper code like those calls to "get" and "compare" methods are taking a more of a % of time than you might expect. If they are, you might be able to do each of them once rather than 2 or 3 times - that sort of thing.

Related

Is there an improved alternative to Java CopyOnWriteArrayList implementation and how can I request a change to Java spec?

CopyOnWriteArrayList almost has the behavior I want, and if unnecessary copies were removed it would be exactly what I am looking for. In particular, it could act exactly like ArrayList for adds made to the end of the ArrayList - i.e., there is no reason to actually make a new copy every single time which is so wasteful. It could just virtually restrict the end of the ArrayList to capture the snapshot for the readers, and update the end after the new items are added.
This enhancement seems like it would be worth having since for many applications the most common type of addition would be to the end of the ArrayList - which is even a reason for choosing to use an ArrayList to begin with.
There also would be no extra overhead since it could only not copy when appending and although it would still have to check if a re-size is necessary ArrayList has to do this anyways.
Is there any alternative implementation or data structure that has this behavior without the unnecessary copies for additions at the end (i.e., thread-safe and optimized to allow frequent reads with writes only being additions at the end of the list)?
How can I submit a change request to request a change to the Java specification to eliminate copies for additions to the end of a CopyOnWriteArrayList (unless a re-size is necessary)?
I'd really liked to see this changed with the core Java libraries rather than maintaining and using my own custom code.

Sounds like you're looking for a BlockingDeque, and in particular an ArrayBlockingQueue.
You may also want a ConcurrentLinkedQueue, which uses a "wait-free" algorithm (aka non-blocking) and may therefore be faster in many circumstances. It's only a Queue (not a Dequeue) and thus you can only insert/remove at the head of the collection, but it sounds like that might be good for your use case. But in exchange for the wait-free algorithm, it has to use a linked list rather than an array internally, and that means more memory (including more garbage when you pop items) and worse memory locality. The wait-free algorithm also relies on a compare and set (CAS) loop, which means that while it's faster in the "normal" case, it can actually be slower under high contention, as each thread needs to try its CAS several times before it wins and is able to move forward.
My guess is that the reason that lists don't get as much love in java.util.concurrent is that a list is an inherently racy data structure in most use cases other iteration. For instance, something like if (!list.isEmpty()) { return list.get(0); } is racy unless it's surrounded by a synchronized block, in which case you don't need an inherently thread-safe structure. What you really need is a "list-type" interface that only allows operations at the ends -- and that's exactly what Queue and Deque are.

To answer your questions:
I'm not aware of an alternative implementation that is a fully functional list.
If your idea is truly viable, I can think of a number of ways to proceed:
You can submit "requests for enhancement" (RFE) through the Java Bugs Database. However, in this case I doubt that you will get a positive response. (Certainly, not a timely one!)
You could create an RFE issue on Guava or Apache Commons issues tracker. This might be more fruitful, though it depends on convincing them ...
You could submit a patch to the OpenJDK team with an implementation of your idea. I can't say what the result might be ...
You could submit a patch (as above) to Guava or Apache Commons via their respective issues trackers. This is the approach that is most likely to succeed, though it still depends on convincing "them" that it is technically sound, and "a good thing".
You could just put the code for your proposed alternative implementation on Github, and see what happens.
However, all of this presupposes that your idea is actually going to work. Based on the scant information you have provided, I'm doubtful. I suspect that there may be issues with incomplete encapsulation, concurrency and/or not implementing the List abstraction fully / correctly.
I suggest that you put your code on Github so that other people can take a good hard look at it.

there is no reason to actually make a new copy every single time which is so wasteful.
This is how it works. It works by replacing the previous array with new array in a compare and swap action. It is a key part of the thread safety design that you always have a new array even if all you do is replace an entry.
thread-safe and optimized to allow frequent reads with writes only being additions at the end of the list
This is heavily optimised for reads, any other solution will be faster for writes, but slower for reads and you have to decide which one you really want.
You can have a custom data structure which will be the best of both worlds, but it not longer a generic solution which is what CopyOnWriteArrayList and ArrayDeque provide.
How can I submit a change request to request a change to the Java specification to eliminate copies for additions to the end of a CopyOnWriteArrayList (unless a re-size is necessary)?
You can do this through the bugs database, but what you propose is a fundamental change in how the data structure works. I suggest proposing a new/different data structure which works the way you want. In the mean time I suggest implementing it yourself as a working example as you will get want you want faster.
I would start with an AtomicReferenceArray as this can be used to perform the low level actions you need. The only problem with it is it is not resizable so you would need to determine the maximum size you would every need.

CopyOnWriteArrayList has a performance drawback because it creates a copy of the underlying array of the list on write operations. The array copying is making the write operations slow. May be, CopyOnWriteArrayList is advantageous for a usage of a List with high read rate and low write rate.
Eventually I started coding my own implementation using the java.util.concurrent.locks,ReadWriteLock. I did my implementation simply by maintaining object level ReadWriteLock instance, and gaining the read lock in the read operations and gaining the write lock in the write operations. The code looks like this.
public class ConcurrentList< T > implements List< T >
{
private final ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
private final List< T > list;
public ConcurrentList( List<T> list )
{
this.list = list;
}
public boolean remove( Object o )
{
readWriteLock.writeLock().lock();
boolean ret;
try
{
ret = list.remove( o );
}
finally
{
readWriteLock.writeLock().unlock();
}
return ret;
}
public boolean add( T t )
{
readWriteLock.writeLock().lock();
boolean ret;
try
{
ret = list.add( t );
}
finally
{
readWriteLock.writeLock().unlock();
}
return ret;
}
public void clear()
{
readWriteLock.writeLock().lock();
try
{
list.clear();
}
finally
{
readWriteLock.writeLock().unlock();
}
}
public int size()
{
readWriteLock.readLock().lock();
try
{
return list.size();
}
finally
{
readWriteLock.readLock().unlock();
}
}
public boolean contains( Object o )
{
readWriteLock.readLock().lock();
try
{
return list.contains( o );
}
finally
{
readWriteLock.readLock().unlock();
}
}
public T get( int index )
{
readWriteLock.readLock().lock();
try
{
return list.get( index );
}
finally
{
readWriteLock.readLock().unlock();
}
}
//etc
}
The performance improvement observed was notable.
Total time taken for 5000 reads + 5000 write ( read write ratio is 1:1) by 10 threads were
ArrayList - 16450 ns( not thread safe)
ConcurrentList - 20999 ns
Vector -35696 ns
CopyOnWriteArrayList - 197032 ns
please follow this link for more info about the test case used for obtaining above results
However, in order to avoid ConcurrentModificationException when using the Iterator, I just created a copy of the current List and returned the iterator of that. This means this list does not return and Iterator which can modify the original List. Well, for me, this is o.k. for the moment.
public Iterator<T> iterator()
{
readWriteLock.readLock().lock();
try
{
return new ArrayList<T>( list ).iterator();
}
finally
{
readWriteLock.readLock().unlock();
}
}
After some googling I found out that CopyOnWriteArrayList has a similar implementaion, as it does not return an Iterator which can modify the original List. Javadoc says,
The returned iterator provides a snapshot of the state of the list when the iterator was constructed. No synchronization is needed while traversing the iterator. The iterator does NOT support the remove method.

Lock-free guard for synchronized acquire/release

I have a shared tempfile resource that is divided into chunks of 4K (or some such value). Each 4K in the file is represented by an index starting from zero. For this shared resource, I track the 4K chunk indices in use and always return the lowest indexed 4K chunk not in use, or -1 if all are in use.
This ResourceSet class for the indices has a public acquire and release method, both of which use synchronized lock whose duration is about like that of generating 4 random numbers (expensive, cpu-wise).
Therefore as you can see from the code that follows, I use an AtomicInteger "counting semaphore" to prevent a large number of threads from entering the critical section at the same time on acquire(), returning -1 (not available right now) if there are too many threads.
Currently, I am using a constant of 100 for the tight CAS loop to try to increment the atomic integer in acquire, and a constant of 10 for the maximum number of threads to then allow into the critical section, which is long enough to create contention. My question is, what should these constants be for a moderate to highly loaded servlet engine that has several threads trying to get access to these 4K chunks?
public class ResourceSet {
// ??? what should this be
// maximum number of attempts to try to increment with CAS on acquire
private static final int CAS_MAX_ATTEMPTS = 50;
// ??? what should this be
// maximum number of threads contending for lock before returning -1 on acquire
private static final int CONTENTION_MAX = 10;
private AtomicInteger latch = new AtomicInteger(0);
... member variables to track free resources
private boolean aquireLatchForAquire ()
{
for (int i = 0; i < CAS_MAX_ATTEMPTS; i++) {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (!latch.compareAndSet(val, val+1))
continue;
if (val < 0 || val >= CONTENTION_MAX) {
latch.decrementAndGet();
// added to fix BUG that comment pointed out, thanks!
return false;
}
}
return false;
}
private void aquireLatchForRelease ()
{
do {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (latch.compareAndSet(val, val+1))
return;
} while (true);
}
public ResourceSet (int totalResources)
{
... initialize
}
public int acquire (ResourceTracker owned)
{
if (!aquireLatchForAquire())
return -1;
try {
synchronized (this) {
... algorithm to compute minimum free resoource or return -1 if all in use
return resourceindex;
}
} finally {
latch.decrementAndGet();
}
}
public boolean release (ResourceIter iter)
{
aquireLatchForRelease();
try {
synchronized (this) {
... iterate and release all resources
}
} finally {
latch.decrementAndGet();
}
}
}

Writting a good and performant spinlock is actually pretty complicated and requires a good understanding of memory barriers. Merely picking a constant is not going to cut it and will definitely not be portable. Google's gperftools has an example that you can look at but is probably way more complicated then what you'd need.
If you really want to reduce contention on the lock, you might want to consider using a more fine-grained and optimistic scheme. A simple one could be to divide your chunks into n groups and associate a lock with each group (also called stripping). This will help reduce contention and increase throughput but it won't help reduce latency. You could also associate an AtomicBoolean to each chunk and CAS to acquire it (retry in case of failure). Do be careful when dealing with lock-free algorithms because they tend to be tricky to get right. If you do get it right, it could considerably reduce the latency of acquiring a chunk.
Note that it's difficult to propose a more fine-grained approach without knowing what your chunk selection algorithm looks like. I also assume that you really do have a performance problem (it's been profiled and everything).
While I'm at it, your spinlock implementation is flawed. You should never spin directly on a CAS because you're spamming memory barriers. This will be incredibly slow with any serious amount of contention (related to the thundering-herd problem). A minimum would be to first check the variable for availability before your CAS (simple if on a no barrier read will do). Even better would be to not have all your threads spinning on the same value. This should avoid the associated cache-line from ping-pong-ing between your cores.
Note that I don't know what type of memory barriers are associated with atomic ops in Java so my above suggestions might not be optimal or correct.
Finally, The Art Of Multiprocessor Programming is a fun book to read to get better acquainted with all the non-sense I've been spewing in this answer.

I'm not sure if it's necessary to forge your own Lock class for this scenario. As JDK provided ReentrantLock, which also leverage CAS instruction during lock acquire. The performance should be pretty good when compared with your personal Lock class.

You can use Semaphore's tryAcquire method if you want your threads to balk on no resource available.
I for one would simply substitute your synchronized keyword with a ReentrantLock and use the tryLock() method on it. If you want to let your threads wait a bit, you can use tryLock(timeout) on the same class. Which one to choose and what value to use for timeout, needs to be determined by way of a performance test.
Creating an explicit gate seems as you seem to be doing seems unnecessary to me. I'm not saying that it can never help, but IMO it's more likely to actually hurt performance, and it's an added complication for sure. So unless you have an performance issue around here (based on a test you did) and you found that this kind of gating helps, I'd recommend to go with the simplest implementation.

What is the name of this locking technique?

I've got a gigantic Trove map and a method that I need to call very often from multiple threads. Most of the time this method shall return true. The threads are doing heavy number crunching and I noticed that there was some contention due to the following method (it's just an example, my actual code is bit different):
synchronized boolean containsSpecial() {
return troveMap.contains(key);
}
Note that it's an "append only" map: once a key is added, is stays in there forever (which is important for what comes next I think).
I noticed that by changing the above to:
boolean containsSpecial() {
if ( troveMap.contains(key) ) {
// most of the time (>90%) we shall pass here, dodging lock-acquisition
return true;
}
synchronized (this) {
return troveMap.contains(key);
}
}
I get a 20% speedup on my number crunching (verified on lots of runs, running during long times etc.).
Does this optimization look correct (knowing that once a key is there it shall stay there forever)?
What is the name for this technique?
EDIT
The code that updates the map is called way less often than the containsSpecial() method and looks like this (I've synchronized the entire method):
synchronized void addSpecialKeyValue( key, value ) {
....
}

This code is not correct.
Trove doesn't handle concurrent use itself; it's like java.util.HashMap in that regard. So, like HashMap, even seemingly innocent, read-only methods like containsKey() could throw a runtime exception or, worse, enter an infinite loop if another thread modifies the map concurrently. I don't know the internals of Trove, but with HashMap, rehashing when the load factor is exceeded, or removing entries can cause failures in other threads that are only reading.
If the operation takes a significant amount of time compared to lock management, using a read-write lock to eliminate the serialization bottleneck will improve performance greatly. In the class documentation for ReentrantReadWriteLock, there are "Sample usages"; you can use the second example, for RWDictionary, as a guide.
In this case, the map operations may be so fast that the locking overhead dominates. If that's the case, you'll need to profile on the target system to see whether a synchronized block or a read-write lock is faster.
Either way, the important point is that you can't safely remove all synchronization, or you'll have consistency and visibility problems.

It's called wrong locking ;-) Actually, it is some variant of the double-checked locking approach. And the original version of that approach is just plain wrong in Java.
Java threads are allowed to keep private copies of variables in their local memory (think: core-local cache of a multi-core machine). Any Java implementation is allowed to never write changes back into the global memory unless some synchronization happens.
So, it is very well possible that one of your threads has a local memory in which troveMap.contains(key) evaluates to true. Therefore, it never synchronizes and it never gets the updated memory.
Additionally, what happens when contains() sees a inconsistent memory of the troveMap data structure?
Lookup the Java memory model for the details. Or have a look at this book: Java Concurrency in Practice.

This looks unsafe to me. Specifically, the unsynchronized calls will be able to see partial updates, either due to memory visibility (a previous put not getting fully published, since you haven't told the JMM it needs to be) or due to a plain old race. Imagine if TroveMap.contains has some internal variable that it assumes won't change during the course of contains. This code lets that invariant break.
Regarding the memory visibility, the problem with that isn't false negatives (you use the synchronized double-check for that), but that trove's invariants may be violated. For instance, if they have a counter, and they require that counter == someInternalArray.length at all times, the lack of synchronization may be violating that.
My first thought was to make troveMap's reference volatile, and to re-write the reference every time you add to the map:
synchronized (this) {
troveMap.put(key, value);
troveMap = troveMap;
}
That way, you're setting up a memory barrier such that anyone who reads the troveMap will be guaranteed to see everything that had happened to it before its most recent assignment -- that is, its latest state. This solves the memory issues, but it doesn't solve the race conditions.
Depending on how quickly your data changes, maybe a Bloom filter could help? Or some other structure that's more optimized for certain fast paths?

Under the conditions you describe, it's easy to imagine a map implementation for which you can get false negatives by failing to synchronize. The only way I can imagine obtaining false positives is an implementation in which key insertions are non-atomic and a partial key insertion happens to look like another key you are testing for.
You don't say what kind of map you have implemented, but the stock map implementations store keys by assigning references. According to the Java Language Specification:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.
If your map implementation uses object references as keys, then I don't see how you can get in trouble.
EDIT
The above was written in ignorance of Trove itself. After a little research, I found the following post by Rob Eden (one of the developers of Trove) on whether Trove maps are concurrent:
Trove does not modify the internal structure on retrievals. However, this is an implementation detail not a guarantee so I can't say that it won't change in future versions.
So it seems like this approach will work for now but may not be safe at all in a future version. It may be best to use one of Trove's synchronized map classes, despite the penalty.

I think you would be better off with a ConcurrentHashMap which doesn't need explicit locking and allows concurrent reads
boolean containsSpecial() {
return troveMap.contains(key);
}
void addSpecialKeyValue( key, value ) {
troveMap.putIfAbsent(key,value);
}
another option is using a ReadWriteLock which allows concurrent reads but no concurrent writes
ReadWriteLock rwlock = new ReentrantReadWriteLock();
boolean containsSpecial() {
rwlock.readLock().lock();
try{
return troveMap.contains(key);
}finally{
rwlock.readLock().release();
}
}
void addSpecialKeyValue( key, value ) {
rwlock.writeLock().lock();
try{
//...
troveMap.put(key,value);
}finally{
rwlock.writeLock().release();
}
}

Why you reinvent the wheel?
Simply use ConcurrentHashMap.putIfAbsent

LinkedList Vs ConcurrentLinkedQueue

Currently in a multithreaded environment, we are using a LinkedList to hold data. Sometimes in the logs we get NoSuchElementException while it is polling the linkedlist. Please help in understanding the performance impact if we move from the linkedlist to ConcurrentLinkedQueue implementation.
Thanks,
Sachin

When you get a NoSuchElementException then this maybe because of not synchronizing properly.
For example: You're checking with it.hasNext() if an element is in the list and afterwards trying to fetch it with it.next(). This may fail when the element has been removed in between and that can also happen when you use synchronized versions of Collection API.
So your problem cannot really be solved with moving to ConcurrentLinkedQueue. You may not getting an exception but you've to be prepared that null is returned even when you checked before that it is not empty. (This is still the same error but implementation differs.) This is true as long as there is no proper synchronization in YOUR code having checks for emptiness and element retrieving in the SAME synchronized scope.
There is a good chance that you trade NoSuchElementException for having new NullPointerException afterwards.
This may not be an answer directly addressing your question about performance, but having NoSuchElementException in LinkedList as a reason to move to ConcurrentLinkedQueue sounds a bit strange.
Edit
Some pseudo-code for broken implementations:
//list is a LinkedList
if(!list.isEmpty()) {
... list.getFirst()
}
Some pseudo-code for proper sync:
//list is a LinkedList
synchronized(list) {
if(!list.isEmpty()) {
... list.getFirst()
}
}
Some code for "broken" sync (does not work as intended).
This maybe the result of directly switching from LinkedList to CLQ in the hope of getting rid of synchronization on your own.
//queue is instance of CLQ
if(!queue.isEmpty()) { // Does not really make sense, because ...
... queue.poll() //May return null! Good chance for NPE here!
}
Some proper code:
//queue is instance of CLQ
element = queue.poll();
if(element != null) {
...
}
or
//queue is instance of CLQ
synchronized(queue) {
if(!queue.isEmpty()) {
... queue.poll() //is not null
}
}

ConcurrentLinkedQueue [is] an unbounded, thread-safe, FIFO-ordered queue. It uses a linked structure, similar to those we saw in Section 13.2.2 as the basis for skip lists, and in Section 13.1.1 for hash table overflow chaining. We noticed there that one of the main attractions of linked structures is that the insertion and removal operations implemented by pointer rearrangements perform in constant time. This makes them especially useful as queue implementations, where these operations are always required on cells at the ends of the structure, that is, cells that do not need to be located using the slow sequential search of linked structures.
ConcurrentLinkedQueue uses a CAS-based wait-free algorithm that is, one that guarantees that any thread can always complete its current operation, regardless of the state of other threads accessing the queue. It executes queue insertion and removal operations in constant time, but requires linear time to execute size. This is because the algorithm, which relies on co-operation between threads for insertion and removal, does not keep track of the queue size and has to iterate over the queue to calculate it when it is required.
From Java Generics and Collections, ch. 14.2.
Note that ConcurrentLinkedQueue does not implement the List interface, so it suffices as a replacement for LinkedList only if the latter was used purely as a queue. In this case, ConcurrentLinkedQueue is obviously a better choice. There should be no big performance issue unless its size is frequently queried. But as a disclaimer, you can only be sure about performance if you measure it within your own concrete environment and program.

ConcurrentLinkedQueue$Node remains in heap after remove()

I have a multithreaded app writing and reading a ConcurrentLinkedQueue, which is conceptually used to back entries in a list/table. I originally used a ConcurrentHashMap for this, which worked well. A new requirement required tracking the order entries came in, so they could be removed in oldest first order, depending on some conditions. ConcurrentLinkedQueue appeared to be a good choice, and functionally it works well.
A configurable amount of entries are held in memory, and when a new entry is offered when the limit is reached, the queue is searched in oldest-first order for one that can be removed. Certain entries are not to be removed by the system and wait for client interaction.
What appears to be happening is I have an entry at the front of the queue that occurred, say 100K entries ago. The queue appears to have the limited number of configured entries (size() == 100), but when profiling, I found that there were ~100K ConcurrentLinkedQueue$Node objects in memory. This appears to be by design, just glancing at the source for ConcurrentLinkedQueue, a remove merely removes the reference to the object being stored but leaves the linked list in place for iteration.
Finally my question: Is there a "better" lazy way to handle a collection of this nature? I love the speed of the ConcurrentLinkedQueue, I just cant afford the unbounded leak that appears to be possible in this case. If not, it seems like I'd have to create a second structure to track order and may have the same issues, plus a synchronization concern.

What actually is happening here is the remove method prepares a polling thread to null out the linked reference.
The ConcurrentLinkedQueue is a non blocking thread safe Queue implementation. However when you try to poll a Node from the Queue it is a two function process. First you null the value then you null the reference. A CAS operation is a single atomic function that would not offer immidiate resolution for a poll.
What happens when you poll is that the first thread that succeeds will get the value of the node and null that value out, that thread will then try to null the reference. It is possible another thread will then come in and try to poll from the queue. To ensure this Queue holds a non blocking property (that is failure of one thread will not lead to the failure of another thread) that new incomming thread will see if the value is null, if it is null that thread will null the reference and try again to poll().
So what you see happening here is the remove thread is simply preparing any new polling thread to null the reference. Trying to achieve a non blocking remove function I would think is nearly impossible because that would require three atomic functions. The null of the value the null referencing to said node, and finally the new reference from that nodes parent to its successor.
To answer your last question. There is unforutnalty no better way to implement remove and maintain the non blocking state of the queue. That is at least at this point. Once processors start comming out with 2 and 3 way casing then that is possible.

The queue's main semantics is add/poll. If you use poll() on the ConcurrentLinkedQueue, it will be cleaned as it should. Based on your description, poll() should give you removing oldest entry. Why not to use it instead of remove()?

Looking at the source code for 1.6.0_29, it seems that CLQ's iterator was modified to try removing nodes with null items. Instead of:
p = p.getNext();
The code is now:
Node<E> next = succ(p);
if (pred != null && next != null)
pred.casNext(p, next);
p = next;
This was added as part of the fix for bug: http://bugs.sun.com/view_bug.do?bug_id=6785442
Indeed when I try the following I get an OOME with the old version but not with the new one:
Queue<Integer> queue = new ConcurrentLinkedQueue<Integer>();
for (int i=0; i<10000; i++)
{
for (int j=0; j<100000; j++)
{
queue.add(j);
}
boolean start = true;
for (Iterator<Integer> iter = queue.iterator(); iter.hasNext(); )
{
iter.next();
if (!start)
iter.remove();
start = false;
}
System.out.println(i);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.