Performance comparison between compare-and-swap and blocking algorithm

Performance comparison between compare-and-swap and blocking algorithm - java

I have a ConcurrentLinkedQueue that I use as the underlying datastructure. On every put call, I add a unique incremented value to the list. I have both the synchronized and compare-and-swap versions of this method. When I have few threads (e.g., 5) and doing 10 million puts in all, I see that synchronized version works much better. When I have many threads (e.g., 2000) and do the same number of puts in total, I see that CAS works much better. Why does CAS underperform in comparison to blocking algorithm with fewer threads?
// AtomicReference<Foo> latestValue that is initialized
public void put(Double value) {
Foo currentValue;
while (true) {
currentValue = latestValue.get();
Foo newValue = new Foo(value);
if (latestValue.compareAndSet(currentValue, newValue)) {
historyList.add(newValue);
return;
}
}
}
Statistics
NON-BLOCKING
Threads 2000
Puts per thread 10000
Put time average 208493309
BLOCKING
Threads 2000
Puts per thread 10000
Put time average 2370823534
NON-BLOCKING
Threads 2
Puts per thread 10000000
Put time average 13117487385
BLOCKING
Threads 2
Puts per thread 10000000
Put time average 4201127857

TL;DR because in uncontended case JVM will optimize synchronized and replace it with CAS lock.
In your CAS case you got overhead: you are trying to do some computation even if your CAS will fail. Of course it's nothing in comparison to real mutex acquiring, what usually happens when you are using synchronized.
But JVM isn't stupid and when it can see that lock you are currently acquiring is uncontented, it just replaces real mutex with CAS lock (or even with simple store in case of biased locking).
So for two threads in case of synchronized you are measuring just a CAS, but in case of your own CAS implementation you're also measuring time for allocating Foo instance, for compareAndSet and for get().
For 2000 threads JVM doesn't perform CAS-optimization, so your implementation outperforms mutex acquiring as expected.

Related

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).
Use case:
A counter variable which
is updated by multiple threads concurrently
has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.
So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).
However I'm not sure I correctly understand the way LongAdder works.
The JavaDoc says:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
and for sum()
Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.
What is meant by fine-grained synchronization control ...
From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:
private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)
If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".
And then with my write/read^n pattern I probably can't do better then AtomicLong?

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.
In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:
the incrementAndGet in LongAdder is not atomic
... but I don't see incrementAndGet at all in LongAdder. Where are you looking?
Your idea of:
usage in a thread will consist of a w rite with an immediate read afterwards
Requirement is that each read
from the counter (immediately after the writing) gets a unique
incremented value. It is not required that each retrieved counter
value is increasing in the same order as the different
threads(writers) increment the value.
Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.
So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().
Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.
For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.
LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.
Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.
Something Useful
OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.
Re-purpose the LongAdder Idea
You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.
Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.
Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.
Thread-local Counters
A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.
The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.
Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).
A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.
The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.
There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).
1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).
2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.
3 Where N is lg2(number of counters).

There's a subtle difference between the two implementations.
An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.
A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.
LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).
public static void main (String[] args) throws Exception
{
LongAdder adder = new LongAdder();
ExecutorService executor = Executors.newFixedThreadPool(10);
Map<Long, Integer> count = new ConcurrentHashMap<>();
for (int i = 0; i < 10; i++)
{
executor.execute(() -> {
for (int j = 0; j < 1000000; j++)
{
adder.add(1);
count.merge(adder.longValue(), 1, Integer::sum);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.HOURS);
count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}

Java - Synchronized methods causes program to slow down massively

I'm trying to learn about threads and synchronization. I made this test program:
public class Test {
static List<Thread> al = new ArrayList<>();
public static void main(String[] args) throws IOException, InterruptedException {
long startTime = System.currentTimeMillis();
al.add(new Thread(() -> fib1(47)));
al.add(new Thread(() -> fib2(47)));
for (Thread t : al)
t.start();
for (Thread t: al)
t.join();
long totalTime = System.currentTimeMillis() - startTime;
System.out.println(totalTime);
}
public static synchronized int fib1(int x) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
public static synchronized int fib2(int x) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
This program takes around 273 seconds to finish, but if I remove both of the synchronized it runs in 7 seconds instead. What causes this massive difference?
EDIT:
I'm aware that I'm using a terribly slow algorithm for calculating fibonacci numbers. And I'm also aware that the threads don't share resources and thus the methods don't need to be synchronized. However, this is just a test program where I'm trying to figure out how synchronized works and I choose a slow algorithm on purpose so I could measure time taken in milliseconds.

Your program does not get stuck - it's just terribly slow.
This is due to two reasons:
1. Algorithm Complexity
As others and youself have mentioned, the way you compute the Fibonacci numbers is really slow because it computes the same values over and over again. Using a smaller input will bring down the runtime to a reasonable value. But this is not what your question is about.
2. Synchronized
This slows down your program in 2 ways:
First of all, making the methods synchronized is not necessary since they do not modify anything outside of the method itself. In fact it prevents both threads from running at the same time as the methods are static therefore preventing two thread from being in either of them at the same time.
So your code is effectively using only one thread, not two.
Also synchronized adds a significant overhead to the methods since it requires acquiring a lock when entering the method - or at least checking whether the current thread already possesses the lock.
These operations are quite expensive and they have to be done every single time one of the methods is entered. Since - due to the recursion - this happens a lot, it has an extreme impact on the program performance.
Interestingly the performance is much better when you run it with just a single thread - even with the methods being synchronized.
The reason is the runtime optimizations done by the JVM.
If you are using just one thread, the JVM can optimize the synchronized checks away since there cannot be a conflict. This reduces the runtime significantly - but not exactly to the value that it would have without synchronized due to starting with 'cold code' and some remaining runtime checks.
When running with 2 threads on the other hand, the JVM cannot do this optimization, therefore leaving the expensive synchronized operations that cause the code to be so terribly slow.
Btw: fib1 and fib2 are identical, delete one of them

When you put static synchronized on a method that means that, in order for a thread to execute that method, it first has to acquire the lock for the class (which here is Test). The two static fib methods use the same lock. One thread gets the lock, executes the fib method, and releases the lock, then the other thread gets to execute the method. Which thread gets the lock first is up to the OS.
It was already mentioned the locks are re-entrant and there's no problem with calling a synchronized method recursively. The thread holds the lock from the time it first calls the fib method, that call doesn't complete until all the recursive calls have completed, so the method runs to completion before the thread releases the lock.
The main thread isn't doing anything but waiting, and only one of the threads calling a fib method can run at a time. It does make sense that removing the synchronized modifier would speed up things, without locking the two threads can run concurrently, possibly using different processors.
The methods do not modify any shared state so there's no reason to synchronize them. Even if they did need to be synchronized there would still be no reason to have two separate fib methods here, because in any case invoking either the fib1 or fib2 method requires acquiring the same lock.
Using synchronized without static means that the object instance, not the class, is used as the lock. The reason that all the synchronized methods use the same lock is that the point is to protect shared state, an object might have various methods that modify the object's internal state, and to protect that state from concurrent modifications no more than one thread should be executing any one of these methods at a time.

Your program is not deadlocked, and it also isn't appreciably slower because of unnecessary synchronization. Your program appears "stuck" because of the branching factor of your recursive function.
Branching Factor of Recursion
When N >= 4, you recurse twice. In other words, on average, your recursion has a branching factor of two, meaning if you are computing the N-th Fibonacci number recursively, you will call your function about 2^N times. 2^47 is a HUGE number (like, in the hundreds of trillions). As others have suggested, you can cut this number WAY down by saving intermediate results and returning them instead of recomputing them.
More on synchronization
Acquiring locks is expensive. However, in Java, if a thread has a lock and re-enters the same synchronized block that it already owns the lock for, it doesn't have to reacquire the lock. Since each thread already owns the respective lock for each function they enter, they only have to acquire one lock apiece for the duration of your program. The cost of acquiring one lock is weensy compared to recursing hundreds of trillions of times :)

#MartinS is correct that synchronized is not necessary here because you have no shared state. That is, there is no data that you are trying to prevent being accessed concurrently by multiple threads.
However, you are slowing your program down by the addition of the synchronized call. My guess is that without synchronized, you should see two cores spinning at 100% for however long it takes to compute this method. When you add synchronized, whichever thread grabs the lock first gets to spin at 100%. The other one sits there waiting for the lock. When the first thread finishes, the second one gets to go.
You can test this by timing your program (start with smaller values to keep it to a reasonable time). The program should run in approximately half the time without synchronized than it does with.

When the fib1 (or fib2) method recurs, it doesn't release the lock. More over, it acquires the lock again (it is faster than initial locking).
Good news is that synchronized methods in Java are reentrant.
You are better not to synchronize the recursion itself.
Split your recursive methods into two:
one recursive non-synchronized method (it should be private as it is not thread-safe);
one public synchronized method without recursion per se, which calls the second method.
Try to measure such code, you should get 14 seconds, because both threads synchronize on the same lock Test.class.

The issue you see is because a static synchronized method synchronizes on the Class. So your two Threads spend an extraordinary amount of time fighting over the single lock on Test.class.
For the purposes of this learning exercise, the best way to speed it up would be to create two explicit lock objects. In Test, add
static final Object LOCK1 = new Object();
static final Object LOCK2 = new Object();
and then, in fib1() and fib2(), use a synchronized block on those two objects. e.g.
public static int fib1(int x) {
synchronized(LOCK1) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
}
public static int fib2(int x) {
synchronized(LOCK2) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
Now the first thread only needs to grab LOCK1, with no contention, and the second thread only grabs LOCK2, again, with no contention. (So long as you only have those two threads) This should run only slightly slower than the completely unsynchronized code.

Is this incrementAndGet thread-safe? It seems to pull object from eh cache

This servlet seems to fetch an object from ehCache, from an Element which has this object: http://code.google.com/p/adwhirl/source/browse/src/obj/HitObject.java?repo=servers-mobile
It then goes on to increment the counter which is an atomic long:
http://code.google.com/p/adwhirl/source/browse/src/servlet/MetricsServlet.java?repo=servers-mobile#174
//Atomically record the hit
if(i_hitType == AdWhirlUtil.HITTYPE.IMPRESSION.ordinal()) {
ho.impressions.incrementAndGet();
}
else {
ho.clicks.incrementAndGet();
}
This doesn't seem thread-safe to me as multiple threads could be fetching from the cache and if both increment at the same time you might loose a click/impression count.
Do you agree that this is not thread-safe?

AtomicLong and AtomicInteger use a CAS internally -- compare and set (or compare-and-swap). The idea is that you tell the CAS two things: the value you expect the long/int to have, and the value you want to update it to. If the long/int has the value you say it should have, the CAS will atomically make the update and return true; otherwise, it won't make the update, and it'll return false. Many modern chips support CAS very efficiently at the machine-code level; if the JVM is running in an environment that doesn't have a CAS, it can use mutexes (what Java calls synchronization) to implement the CAS. Regardless, once you have a CAS, you can safely implement an atomic increment via this logic (in pseudocode):
long incrementAndGet(atomicLong, byIncrement)
do
oldValue = atomicLong.get() // 1
newValue = oldValue + byIncrement
while ! atomicLong.cas(oldValue, newValue) // 2
return newValue
If another thread has come in and does its own increment between lines // 1 and // 2, the CAS will fail and the loop will try again. Otherwise, the CAS will succeed.
There's a gamble in this kind of approach: if there's low contention, a CAS is faster than a synchronized block isn't as likely to cause a thread context switch. But if there's a lot of contention, some threads are going to have to go through multiple loop iterations per increment, which obviously amounts to wasted work. Generally speaking, the incrementAndGet is going to be faster under most common loads.

The increment is thread safe since AtomicInteger and family guarantee that. But there is a problem with the insertion and fetching from the cache, where two (or more) HitObject could be created and inserted. That would cause potentially losing some hits on the first time this HitObject is accessed. As #denis.solonenko has pointed, there is already a TODO in the code to fix this.
However I'd like to point out that this code only suffers from concurrency on first accessing a given HitObject. Once you have the HitObject in the cache (and there are no more threads creating or inserting the HitObject) then this code is perfectly thread-safe. So this is only a very limited concurrency problem, and probably that's the reason they have not yet fixed it.

Synchronized code performs faster than unsynchronized one

I came out with this stunning result which i absolutely do not know the reason for:
I have two methods which are shortened to:
private static final ConcurrentHashMap<Double,Boolean> mapBoolean =
new ConcurrentHashMap<Double, Boolean>();
private static final ConcurrentHashMap<Double,LinkedBlockingQueue<Runnable>> map
= new ConcurrentHashMap<Double, LinkedBlockingQueue<Runnable>>();
protected static <T> Future<T> execute(final Double id, Callable<T> call){
// where id is the ID number of each thread
synchronized(id)
{
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
protected static <T> Future<T> executeLoosely(final Double id, Callable<T> call){
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
On profiling with over 500 threads, and each thread calling each of the above methods 400 times each, I found out that execute(..) performs atleast 500 times better than executeLoosely(..) which is weird because executeLoosely is not synchronized and hence more threads can process the code simultaneously.
Any reasons??

The overhead of using 500 threads on a machine which I assume doesn't have 500 cores, using tasks which takes about 100-1000x as long as a lookup on a Map to execute code which the JVM could detect doesn't do anything, is likely to produce a random outcome. ;)
Another problem you could have is that a test which faster being performed with one thread can benefit from using synchronized because it biases access to one thread. i.e. it turns your multi-threaded test back into a single threaded one which is the fastest in the first place.
You should compare the timings you get with a single thread doing a loop. If this is faster (which I believe it would be) then its not a useful multi-threaded test.
My guess is that you are running the synchronized code after the unsynchronised code. i.e. after the JVM has warmed up a little. Swap the order you perform these tests and run them many times and you will get different results.

In the non synchronized scenario :
1) wait to acquire lock on a segment of the map, lock, perform operation on the map, unlock, wait to acquire lock on a segment of the other map, lock, perform operation on the other map, unlock.
The segment level locking will be performed only in cases of concurrent write to the segment which doesn't look to be the case in your example.
In the synchronized scenario :
1) wait to lock, perform both the operations, unlock.
The time taken for context switching can have an impact? How many cores does the machine running the test have?
How are the maps structured, same sort of keys?

In Java what is the performance of AtomicInteger compareAndSet() versus synchronized keyword?

I was implementing a FIFO queue of requests instances (preallocated request objects for speed) and started with using the "synchronized" keyword on the add method. The method was quite short (check if room in fixed size buffer, then add value to array). Using visualVM it appeared the thread was blocking more often than I liked ("monitor" to be precise). So I converted the code over to use AtomicInteger values for things such as keeping track of the current size, then using compareAndSet() in while loops (as AtomicInteger does internally for methods such as incrementAndGet()). The code now looks quite a bit longer.
What I was wondering is what is the performance overhead of using synchronized and shorter code versus longer code without the synchronized keyword (so should never block on a lock).
Here is the old get method with the synchronized keyword:
public synchronized Request get()
{
if (head == tail)
{
return null;
}
Request r = requests[head];
head = (head + 1) % requests.length;
return r;
}
Here is the new get method without the synchronized keyword:
public Request get()
{
while (true)
{
int current = size.get();
if (current <= 0)
{
return null;
}
if (size.compareAndSet(current, current - 1))
{
break;
}
}
while (true)
{
int current = head.get();
int nextHead = (current + 1) % requests.length;
if (head.compareAndSet(current, nextHead))
{
return requests[current];
}
}
}
My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc), even though the code is shorter.
Thanks!

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc)
Yes, in the common case you are right. Java Concurrency in Practice discusses this in section 15.3.2:
[...] at high contention levels locking tends to outperform atomic variables, but at more realistic contention levels atomic variables outperform locks. This is because a lock reacts to contention by suspending threads, reducing CPU usage and synchronization traffic on the shared memory bus. (This is similar to how blocking producers in a producer-consumer design reduces the load on consumers and thereby lets them catch up.) On the other hand, with atomic variables, contention management is pushed back to the calling class. Like most CAS-based algorithms, AtomicPseudoRandom reacts to contention by trying again immediately, which is usually the right approach but in a high-contention environment just creates more contention.
Before we condemn AtomicPseudoRandom as poorly written or atomic variables as a poor choice compared to locks, we should realize that the level of contention in Figure 15.1 is unrealistically high: no real program does nothing but contend for a lock or atomic variable. In practice, atomics tend to scale better than locks because atomics deal more effectively with typical contention levels.
The performance reversal between locks and atomics at differing levels of contention illustrates the strengths and weaknesses of each. With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance. (CAS-based algorithms also outperform lock-based ones on single-CPU systems, since a CAS always succeeds on a single-CPU system except in the unlikely case that a thread is preempted in the middle of the read-modify-write operation.)
(On the figures referred to by the text, Figure 15.1 shows that the performance of AtomicInteger and ReentrantLock is more or less equal when contention is high, while Figure 15.2 shows that under moderate contention the former outperforms the latter by a factor of 2-3.)
Update: on nonblocking algorithms
As others have noted, nonblocking algorithms, although potentially faster, are more complex, thus more difficult to get right. A hint from section 15.4 of JCiA:
Good nonblocking algorithms are known for many common data structures, including stacks, queues, priority queues, and hash tables, though designing new ones is a task best left to experts.
Nonblocking algorithms are considerably more complicated than their lock-based equivalents. The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency. In linked collection classes such as queues, you can sometimes get away with expressing state transformations as changes to individual links and using an AtomicReference to represent each link that must be updated atomically.

I wonder if jvm already does a few spin before really suspending the thread. It anticipate that well written critical sections, like yours, are very short and complete almost immediately. Therefore it should optimistically busy-wait for, I don't know, dozens of loops, before giving up and suspending the thread. If that's the case, it should behave the same as your 2nd version.
what a profiler shows might be very different from what's realy happending in a jvm at full speed, with all kinds of crazy optimizations. it's better to measure and compare throughputs without profiler.

Before doing this kind of synchronization optimizations, you really need a profiler to tell you that it's absolutely necessary.
Yes, synchronized under some conditions may be slower than atomic operation, but compare your original and replacement methods. The former is really clear and easy to maintain, the latter, well it's definitely more complex. Because of this there may be very subtle concurrency bugs, that you will not find during initial testing. I already see one problem, size and head can really get out of sync, because, though each of these operations is atomic, the combination is not, and sometimes this may lead to an inconsistent state.
So, my advise:
Start simple
Profile
If performance is good enough, leave simple implementation as is
If you need performance improvement, then start to get clever (possibly using more specialized lock at first), and TEST, TEST, TEST

Here's code for a busy wait lock.
public class BusyWaitLock
{
private static final boolean LOCK_VALUE = true;
private static final boolean UNLOCK_VALUE = false;
private final static Logger log = LoggerFactory.getLogger(BusyWaitLock.class);
/**
* #author Rod Moten
*
*/
public class BusyWaitLockException extends RuntimeException
{
/**
*
*/
private static final long serialVersionUID = 1L;
/**
* #param message
*/
public BusyWaitLockException(String message)
{
super(message);
}
}
private AtomicBoolean lock = new AtomicBoolean(UNLOCK_VALUE);
private final long maximumWaitTime ;
/**
* Create a busy wait lock with that uses the default wait time of two minutes.
*/
public BusyWaitLock()
{
this(1000 * 60 * 2); // default is two minutes)
}
/**
* Create a busy wait lock with that uses the given value as the maximum wait time.
* #param maximumWaitTime - a positive value that represents the maximum number of milliseconds that a thread will busy wait.
*/
public BusyWaitLock(long maximumWaitTime)
{
if (maximumWaitTime < 1)
throw new IllegalArgumentException (" Max wait time of " + maximumWaitTime + " is too low. It must be at least 1 millisecond.");
this.maximumWaitTime = maximumWaitTime;
}
/**
*
*/
public void lock ()
{
long startTime = System.currentTimeMillis();
long lastLogTime = startTime;
int logMessageCount = 0;
while (lock.compareAndSet(UNLOCK_VALUE, LOCK_VALUE)) {
long waitTime = System.currentTimeMillis() - startTime;
if (waitTime - lastLogTime > 5000) {
log.debug("Waiting for lock. Log message # {}", logMessageCount++);
lastLogTime = waitTime;
}
if (waitTime > maximumWaitTime) {
log.warn("Wait time of {} exceed maximum wait time of {}", waitTime, maximumWaitTime);
throw new BusyWaitLockException ("Exceeded maximum wait time of " + maximumWaitTime + " ms.");
}
}
}
public void unlock ()
{
lock.set(UNLOCK_VALUE);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.