Efficient cache synchronization

Efficient cache synchronization - java

Consider this
public Object doGet() {
return getResource();
}
private Object getResource() {
synchronized (lock) {
if (cachedResourceIsStale()) {
downloadNewVersionOfResource();
}
}
return resource;
}
Assuming that doGet will be executed concurrently, and a lot at that, and that downloading the new version of the resource takes a while, are there more efficient ways to do the synchronization in getResource? I know about read/write locks but I don't think they can be applied here.
Why synchronize at all? If the cache goes stale, all threads accessing the resource while it's still being refreshed by the first one, will execute their own refresh. Among other problems this causes, it's hardly efficient.
As BalusC mentions in the comments, I'm currently facing this issue in a servlet, but I'm happy with generic answers because who knows in what situation I'll run into it again.

Assumptions
efficient means that doGet() should complete as quickly as possible
cachedPageIsStale() takes no time at all
downloadNewVersionOfResource() takes a little time
Answer
Sychronizing reduces network load, because only one thread fetches the resource when it expires. Also, it will not unduly delay processing of other threads - as the VM contains no current snapshot the threads could return, they'll have to block, and there is no reason an additional concurrent downloadNewVersionOfResource() will complete any faster (I'd expect the opposite due to network bandwith contention).
So synchronizing is good, and optimal in bandwith consumption and response times. (The CPU overhead of synchronization is vanishingly small compared to I/O waits) - assuming that a current version of the resource might not be available when doGet() is invoked; if your server always had a current version of the resource, it could send it back without delay. (You might have a background thread download the new version just before the old one expires.)
PS
You haven't shown any error handling. You'll have to decide whether to propagate exceptions thrown by downloadNewVersionOfResource() to your callers or continue to serve the old version of the resource.
Edit
So? Let's assume you have 100 connection workers, and the check whether the resource is stale takes one microsecond, the resource is not stale, and serving it takes one second. Then, on average, 100 * 10^-6 / 1 = 0.0001 threads are trying to get the lock. Barely any contention at all. And the overhead of acquiring an untaken lock is on the order of 10^-8 seconds. There is no point optimizing things that already take microsends, when the network will cause delays of milliseonds. And if you don't believe me, do a microbenchmark for synchronization. It is true that frequent, needless sychronization adds significant overhead, and that synchronizing collection classes were deprecated for that reason. But that's because these methods do very little work per invocation, and the relative overhead of sychronization was a lot bigger. I just did a small microbenchmark for the following code:
synchronized (lock) {
c++;
}
On my notebook, this takes 50 nanoseconds (5*10^-8 seconds) averaged over 10 million executions in sun's hotspot vm. This is about 20 times as long as the naked increment operation, so if one does a lot of increments, sychronizing each of those will slow the program an order of magnitude. If however, that method did blocking I/O, waiting, say, 1 ms, adding those same 50 nanoseconds would reduce throughput by 0.005%. Surely you have better opportunities for performance tuning :-)
This is why, you should always measure before starting to optimize. It prevents you from investing hours of your time to save a couple nano seconds processor time.

There might be possibility that you can reduce lock contention (thus improving through-put) by using "lock-striping" -- essentially, split one lock into several ones, each lock protecting particular group of users.
The tricky part is how to figure out how to assign users to groups. The simplest case is when you can assign request from any user to any group. If your data model requires that requests from one user must be processed sequentially, you must introduce some mapping between user requests and groups. Here's sample implementation of StripedLock:
import java.util.concurrent.locks.ReentrantLock;
/**
* Striped locks holder, contains array of {#link java.util.concurrent.locks.ReentrantLock}, on which lock/unlock
* operations are performed. Purpose of this is to decrease lock contention.
* <p>When client requests lock, it gives an integer argument, from which target lock is derived as follows:
* index of lock in array equals to <code>id & (locks.length - 1)</code>.
* Since <code>locks.length</code> is the power of 2, <code>locks.length - 1</code> is string of '1' bits,
* and this means that all lower bits of argument are taken into account.
* <p>Number of locks it can hold is bounded: it can be from set {2, 4, 8, 16, 32, 64}.
*/
public class StripedLock {
private final ReentrantLock[] locks;
/**
* Default ctor, creates 16 locks
*/
public StripedLock() {
this(4);
}
/**
* Creates array of locks, size of array may be any from set {2, 4, 8, 16, 32, 64}
* #param storagePower size of array will be equal to <code>Math.pow(2, storagePower)</code>
*/
public StripedLock(int storagePower) {
if (storagePower < 1 || storagePower > 6)
throw new IllegalArgumentException("storage power must be in [1..6]");
int lockSize = (int) Math.pow(2, storagePower);
locks = new ReentrantLock[lockSize];
for (int i = 0; i < locks.length; i++)
locks[i] = new ReentrantLock();
}
/**
* Locks lock associated with given id.
* #param id value, from which lock is derived
*/
public void lock(int id) {
getLock(id).lock();
}
/**
* Unlocks lock associated with given id.
* #param id value, from which lock is derived
*/
public void unlock(int id) {
getLock(id).unlock();
}
/**
* Map function between integer and lock from locks array
* #param id argument
* #return lock which is result of function
*/
private ReentrantLock getLock(int id) {
return locks[id & (locks.length - 1)];
}
}

Related

Do "chained" CompletableFuture instances stay in memory?

Let's say that in Java I have a method doSomethingAsync(input) that schedules some work via an executor service and returns a CompletableFuture<FooBar>. Let's say that I have a billion (or whatever huge number of distinct inputs). And let's say I chain the CompleteableFuture<FooBar> instances together using thenCombine(), but I don't keep a reference to the previous CompleteableFuture<FooBar> instance. Something like this:
CompletableFuture<FooBar> future = doSomethingAsync(0);
for(int input = 1; i < 1_000_000_000; i++) {
future = future.thenCombine(doSomethingAsync(i), (foo, bar) -> bar);
}
future.join();
The interesting thing is that I can then do future.join() to wait until they all finish. And I can set a bound (e.g. 100) on the queue on the executor service inside doSomethingAsync() so that it makes the submission block when there are two many unfinished tasks in play. That would provides some back-pressure so that I don't run out of memory in the executor service with all 1,000,000,000 tasks being submitted to the executor service at the same time.
At the end of the process, the logic only has a reference to a single CompletableFuture<>—the final one representing the outcome of the final submission to doSomethingAsync(). But there were one billion of them chained together. Here's the big question: will all those 1,000,000,000 CompletableFuture<> instances stick around in memory until the last one is finished because they were chained using thenComposeAsync(), or will the initial CompletableFuture<> instances be garbage collected after they are completed and the executor service submits the "next" task via thenCombine()?

There's nothing in the public documentation about whether a CompletableFuture can be garbage collected after completion, assuming no other "external" strong references to it exist. However, if you look at the source code then you'll find this comment1:
/*
[...]
* Without precautions, CompletableFutures would be prone to
* garbage accumulation as chains of Completions build up, each
* pointing back to its sources. So we null out fields as soon as
* possible. The screening checks needed anyway harmlessly ignore
* null arguments that may have been obtained during races with
* threads nulling out fields. We also try to unlink non-isLive
* (fired or cancelled) Completions from stacks that might
* otherwise never be popped: Method cleanStack always unlinks non
* isLive completions from the head of stack; others may
* occasionally remain if racing with other cancellations or
* removals.
[...]
*/
And if you look through the implementation, you'll see lines like this one:
src = null; dep = null; fn = null;
So, at least as currently implemented, it looks like a CompletableFuture becomes eligible for garbage collection after it completes, assuming you don't maintain a separate strong reference to it yourself, regardless of the subsequent chain.
1. That link is to the source code tagged jdk-20+8. But it looks like that comment (and associated improvements) was added as part of this commit from September 2014. Perhaps as part of JDK-8056249, which was "fixed" for version 9, but looks to have been backported to Java 8

What counts as modification?

I'm relatively new to multi-threading, and I am trying to use 3 different threads in a game I'm creating. One thread is performing the back end updating, another is being used for the drawing, and the third is to load and/or generate new chunks (and soon to save them down when I don't need them). I had the draw and update threads working just fine, then when I added the third thread into the mix, I started to get problems with ConcurrentModificationExceptions. They are occurring inside my for ... all loops, in which I am looping through an ArrayList of chunk objects.
I have tried to lock down when each thread is able to access and modify the chunk ArrayList using a Phaser as follows:
private volatile ArrayList<Chunk> chunks = new ArrayList<Chunk>();
private volatile int chunksStability = 0; //+'ive = # threads accessing, -'ive = # threads editing
private volatile Object chunkStabilityCountLock = new Object();
private volatile Phaser chunkStabilityPhaser = new Phaser() {
protected boolean onAdvance(int phase, int registeredParties) {
synchronized(chunkStabilityCountLock)
{
if (registeredParties == 0)
{
chunksStability = 0;
}
else
{
chunksStability = Math.max(Math.min(chunksStability*-1, 1), -1);
}
}
return false;
}
};
//...
/**
* Prevents other threads from editing <b>World.chunks</b>.
* Calling this will freeze the thread if another thread has called <b>World.destabalizeChunks()</b>
* without calling <b>World.stabalizeChunks()</b>
*/
public void lockEditChunks()
{
chunkStabilityPhaser.register();
if (this.chunkStabilityPhaser.getUnarrivedParties() > 1 && this.chunksStability < 0) //number threads currently editing > 0
{
this.chunkStabilityPhaser.arriveAndAwaitAdvance(); //wait until threads editing finish
}
synchronized(chunkStabilityCountLock)
{
++this.chunksStability;
}
}
public void unlockEditChunks()
{
chunkStabilityPhaser.arriveAndDeregister();
}
/**
* Prevents other threads requiring stability of <b>World.chunks</b> from continuing
* Calling this will freeze the thread if another thread has called <b>World.lockEditChunks()</b>
* without calling <b>World.unlockEditChunks()</b>
*/
public void destabalizeChunks()
{
chunkStabilityPhaser.register();
if (this.chunkStabilityPhaser.getUnarrivedParties() > 1 && this.chunksStability > 0) //number threads currently editing > 0
{
this.chunkStabilityPhaser.arriveAndAwaitAdvance(); //wait until threads editing finish
}
synchronized(chunkStabilityCountLock)
{
--this.chunksStability;
}
}
public void stabalizeChunks()
{
chunkStabilityPhaser.arriveAndDeregister();
}
However, I still haven't had any success. I'm wondering if perhaps the reason I am getting a concurrent modification exception has to do with that I might be making modifications to the actual Chunk objects. Would this count as modification, and result in a ConcurrentModificationException. I do know that I am not performing the modification within the same thread, since the exception is not consistently thrown. Leading me to believe that the error is only occurring when one thread (I don't know which) reaches a specific point in its execution while another is iterating through the chunks ArrayList.
I know the simple solution would be to stop using the for ... all loops, and instead perform the loop manually as follows:
for (int i = 0; i < chunks.size(); ++i)
{
Chunk c = chunks.get(i);
}
However, I am concerned that this will result in occasional twitchy behaviour on screen when chunk objects are shifted around in the arraylist. I don't want to synchronize access to it entirely across all threads because that would hinder performance, and this may turn out to be a fairly large project, requiring maximum efficiency where possible. Additionally I don't have any reason to prevent 2 threads from modifying the Chunk ArrayList if they don't use an iterator or require it's stability, nor do I have any reason to prevent 2 threads from iterating through the list simultaneously when nothing is modifying it.
More complete copies of relevant files:
World.java
Chunk.java
WorldBuilder.java
ChunkLoader.java

Ideally, you should make your code so fast, that you can load chunks in between frames. You should be able design this so the pauses do not take more than couple milliseconds and everything still runs smooth. This way your users get quickly loaded chunks and you do not have to deal with multithreaded code and chase race conditions.
If it turns out you absolutely have to use threads, limit to minimum mutable state shared between them. Ideally you would have two queues, one with load request and one with loaded levels. Those two queues should be the only way those threads communicate. Once some object is sent to another thread, origin thread should no longer use it in any way. This way you can avoid race conditions without adding synchronization.
To more directly answer your question: ConcurrentModificationException occurs only if you modify the collection. Modifying elements stored inside it doesn't affect the list itself.
I highly suspect you have something wrong with your synchronization code. It looks needlessly complicated. In current form only one thread should access chunks at a time. Other have to wait for their turn. Phaser is definitively unnecessary in this case. It's a job for simple synchronized block or, in worst case, read-write lock.

Using a semaphore inside a nested Java 8 parallel stream action may DEADLOCK. Is this a bug?

Consider the following situation: We are using a Java 8 parallel stream to perform a parallel forEach loop, e.g.,
IntStream.range(0,20).parallel().forEach(i -> { /* work done here */})
The number of parallel threads is controlled by the system property "java.util.concurrent.ForkJoinPool.common.parallelism" and usually equal to the number of processors.
Now assume that we like to limit the number of parallel executions for a specific piece of work - e.g. because that part is memory intensive and memory constrain imply a limit of parallel executions.
An obvious and elegant way to limit parallel executions is to use a Semaphore (suggested here), e.g., the following pice of code limits the number of parallel executions to 5:
final Semaphore concurrentExecutions = new Semaphore(5);
IntStream.range(0,20).parallel().forEach(i -> {
concurrentExecutions.acquireUninterruptibly();
try {
/* WORK DONE HERE */
}
finally {
concurrentExecutions.release();
}
});
This works just fine!
However: Using any other parallel stream inside the worker (at /* WORK DONE HERE */) may result in a deadlock.
For me this is an unexpected behavior.
Explanation: Since Java streams use a ForkJoin pool, the inner forEach is forking, and the join appears to be waiting for ever. However, this behavior is still unexpected. Note that parallel streams even work if you set "java.util.concurrent.ForkJoinPool.common.parallelism" to 1.
Note also that it may not be transparent if there is an inner parallel forEach.
Question: Is this behavior in accordance with the Java 8 specification (in that case it would imply that the use of Semaphores inside parallel streams workers is forbidden) or is this a bug?
For convenience: Below is a complete test case. Any combinations of the two booleans work, except "true, true", which results in the deadlock.
Clarification: To make the point clear, let me stress one aspect: The deadlock does not occur at the acquire of the semaphore. Note that the code consists of
acquire semaphore
run some code
release semaphore
and the deadlock occurs at 2. if that piece of code is using ANOTHER parallel stream. Then the deadlock occurs inside that OTHER stream. As a consequence it appears that it is not allowed to use nested parallel streams and blocking operations (like a semaphore) together!
Note that it is documented that parallel streams use a ForkJoinPool and that ForkJoinPool and Semaphore belong to the same package - java.util.concurrent (so one would expect that they interoperate nicely).
/*
* (c) Copyright Christian P. Fries, Germany. All rights reserved. Contact: email#christian-fries.de.
*
* Created on 03.05.2014
*/
package net.finmath.experiments.concurrency;
import java.util.concurrent.Semaphore;
import java.util.stream.IntStream;
/**
* This is a test of Java 8 parallel streams.
*
* The idea behind this code is that the Semaphore concurrentExecutions
* should limit the parallel executions of the outer forEach (which is an
* <code>IntStream.range(0,numberOfTasks).parallel().forEach</code> (for example:
* the parallel executions of the outer forEach should be limited due to a
* memory constrain).
*
* Inside the execution block of the outer forEach we use another parallel stream
* to create an inner forEach. The number of concurrent
* executions of the inner forEach is not limited by us (it is however limited by a
* system property "java.util.concurrent.ForkJoinPool.common.parallelism").
*
* Problem: If the semaphore is used AND the inner forEach is active, then
* the execution will be DEADLOCKED.
*
* Note: A practical application is the implementation of the parallel
* LevenbergMarquardt optimizer in
* {#link http://finmath.net/java/finmath-lib/apidocs/net/finmath/optimizer/LevenbergMarquardt.html}
* In one application the number of tasks in the outer and inner loop is very large (>1000)
* and due to memory limitation the outer loop should be limited to a small (5) number
* of concurrent executions.
*
* #author Christian Fries
*/
public class ForkJoinPoolTest {
public static void main(String[] args) {
// Any combination of the booleans works, except (true,true)
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final int numberOfTasksInOuterLoop = 20; // In real applications this can be a large number (e.g. > 1000).
final int numberOfTasksInInnerLoop = 100; // In real applications this can be a large number (e.g. > 1000).
final int concurrentExecusionsLimitInOuterLoop = 5;
final int concurrentExecutionsLimitForStreams = 10;
final Semaphore concurrentExecutions = new Semaphore(concurrentExecusionsLimitInOuterLoop);
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism",Integer.toString(concurrentExecutionsLimitForStreams));
System.out.println("java.util.concurrent.ForkJoinPool.common.parallelism = " + System.getProperty("java.util.concurrent.ForkJoinPool.common.parallelism"));
IntStream.range(0,numberOfTasksInOuterLoop).parallel().forEach(i -> {
if(isUseSemaphore) {
concurrentExecutions.acquireUninterruptibly();
}
try {
System.out.println(i + "\t" + concurrentExecutions.availablePermits() + "\t" + Thread.currentThread());
if(isUseInnerStream) {
runCodeWhichUsesParallelStream(numberOfTasksInInnerLoop);
}
else {
try {
Thread.sleep(10*numberOfTasksInInnerLoop);
} catch (Exception e) {
}
}
}
finally {
if(isUseSemaphore) {
concurrentExecutions.release();
}
}
});
System.out.println("D O N E");
}
/**
* Runs code in a parallel forEach using streams.
*
* #param numberOfTasksInInnerLoop Number of tasks to execute.
*/
private static void runCodeWhichUsesParallelStream(int numberOfTasksInInnerLoop) {
IntStream.range(0,numberOfTasksInInnerLoop).parallel().forEach(j -> {
try {
Thread.sleep(10);
} catch (Exception e) {
}
});
}
}

Any time you are decomposing a problem into tasks, where those tasks could be blocked on other tasks, and try and execute them in a finite thread pool, you are at risk for pool-induced deadlock. See Java Concurrency in Practice 8.1.
This is unquestionably a bug -- in your code. You're filling up the FJ pool with tasks that are going to block waiting for the results of other tasks in the same pool. Sometimes you get lucky and things manage to not deadlock (just like not all lock-ordering errors result in deadlock all the time), but fundamentally you're skating on some very thin ice here.

I ran your test in a profiler (VisualVM) and I agree: Threads are waiting for the semaphore and on aWaitJoin() in the F/J Pool.
This framework has serious problems where join() is concerned. I’ve been writing a critique about this framework for four years now. The basic join problem starts here.
aWaitJoin() has similar problems. You can peruse the code yourself. When the framework gets to the bottom of the work deque it issues a wait(). What it all comes down to is this framework has no way of doing a context-switch.
There is a way of getting this framework to create compensation threads for the threads that are stalled. You need to implement the ForkJoinPool.ManagedBlocker interface. How you can do this, I have no idea. You’re running a basic API with streams. You’re not implementing the Streams API and writing your own code.
I stick to my comment, above: Once you turn over the parallelism to the API you relinquish your ability to control the inner workings of that parallel mechanism. There is no bug with the API (other than it is using a faulty framework for parallel operations.) The problem is that semaphores or any other method for controlling parallelism within the API are hazardous ideas.

After a bit of investigation of the source code of ForkJoinPool and ForkJoinTask, I assume that I found an answer:
It is a bug (in my opinion), and the bug is in doInvoke() of ForkJoinTask. The problem is actually related to the nesting of the two loops and presumably not to the use of the Semaphore, however, one needs the Semaphore (or s.th. blocking in the outer loop) to make the problem become apparent and result in a deadlock (but I can imagine there are other issues implied by this bug - see Nested Java 8 parallel forEach loop perform poor. Is this behavior expected? ).
The implementation of the doInvoke() method currently looks as follows:
/**
* Implementation for invoke, quietlyInvoke.
*
* #return status upon completion
*/
private int doInvoke() {
int s; Thread t; ForkJoinWorkerThread wt;
return (s = doExec()) < 0 ? s :
((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) ?
(wt = (ForkJoinWorkerThread)t).pool.awaitJoin(wt.workQueue, this) :
externalAwaitDone();
}
(and maybe also in doJoin which looks similar). In the line
((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) ?
it is tested if Thread.currentThread() is an instance of ForkJoinWorkerThread. The reason of this test is to check if the ForkJoinTask is running on a worker thread of the pool or the main thread. I believe that this line is OK for a non-nested parallel for, where it allows to distinguish if the current tasks runs on the main thread or on a pool worker. However, for tasks of the inner loop this test is problematic: Let us call the thread who runs the parallel().forEeach the creator thread. For the outer loop the creator thread is the main thread and it is not an instanceof ForkJoinWorkerThread. However, for inner loops running from a ForkJoinWorkerThread the creator thread is an instanceof ForkJoinWorkerThread too. Hence, in this situation, the test ((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) IS ALWAYS TRUE!
Hence, we always call pool.awaitJoint(wt.workQueue).
Now, note that we call awaitJoint on the FULL workQueue of that thread (I believe that this is an additional flaw). It appears as if we are not only joining the inner-loops tasks, but also the task(s) of the outer loop and we JOIN ALL THOSE tasks. Unfortunately, the outer task contains that Semaphore.
To proof, that the bug is related to this, we may check a very simple workaround. I create a t = new Thread() which runs the inner loop, then perform t.start(); t.join();. Note that this will not introduce any additional parallelism (I am immediately joining). However, it will change the result of the instanceof ForkJoinWorkerThread test for the creator thread. (Note that task will still be submitted to the common pool).
If that wrapper thread is created, the problem does not occur anymore - at least in my current test situation.
I postet a full demo to
http://svn.finmath.net/finmath%20experiments/trunk/src/net/finmath/experiments/concurrency/ForkJoinPoolTest.java
In this test code the combination
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final boolean isWrappedInnerLoopThread = false;
will result in a deadlock, while the combination
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final boolean isWrappedInnerLoopThread = true;
(and actually all other combinations) will not.
Update: Since many are pointing out that the use of the Semaphore is dangerous I tried to create a demo of the problem without Semaphore. Now, there is no more deadlock, but an - in my opinion - unexpected performance issue. I created a new post for that at Nested Java 8 parallel forEach loop perform poor. Is this behavior expected?. The demo code is here:
http://svn.finmath.net/finmath%20experiments/trunk/src/net/finmath/experiments/concurrency/NestedParallelForEachTest.java

How is LongAccumulator implemented, so that it is more efficient?

I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.
I wonder how is it implemented to be more efficient?

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.
To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.

In Java what is the performance of AtomicInteger compareAndSet() versus synchronized keyword?

I was implementing a FIFO queue of requests instances (preallocated request objects for speed) and started with using the "synchronized" keyword on the add method. The method was quite short (check if room in fixed size buffer, then add value to array). Using visualVM it appeared the thread was blocking more often than I liked ("monitor" to be precise). So I converted the code over to use AtomicInteger values for things such as keeping track of the current size, then using compareAndSet() in while loops (as AtomicInteger does internally for methods such as incrementAndGet()). The code now looks quite a bit longer.
What I was wondering is what is the performance overhead of using synchronized and shorter code versus longer code without the synchronized keyword (so should never block on a lock).
Here is the old get method with the synchronized keyword:
public synchronized Request get()
{
if (head == tail)
{
return null;
}
Request r = requests[head];
head = (head + 1) % requests.length;
return r;
}
Here is the new get method without the synchronized keyword:
public Request get()
{
while (true)
{
int current = size.get();
if (current <= 0)
{
return null;
}
if (size.compareAndSet(current, current - 1))
{
break;
}
}
while (true)
{
int current = head.get();
int nextHead = (current + 1) % requests.length;
if (head.compareAndSet(current, nextHead))
{
return requests[current];
}
}
}
My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc), even though the code is shorter.
Thanks!

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc)
Yes, in the common case you are right. Java Concurrency in Practice discusses this in section 15.3.2:
[...] at high contention levels locking tends to outperform atomic variables, but at more realistic contention levels atomic variables outperform locks. This is because a lock reacts to contention by suspending threads, reducing CPU usage and synchronization traffic on the shared memory bus. (This is similar to how blocking producers in a producer-consumer design reduces the load on consumers and thereby lets them catch up.) On the other hand, with atomic variables, contention management is pushed back to the calling class. Like most CAS-based algorithms, AtomicPseudoRandom reacts to contention by trying again immediately, which is usually the right approach but in a high-contention environment just creates more contention.
Before we condemn AtomicPseudoRandom as poorly written or atomic variables as a poor choice compared to locks, we should realize that the level of contention in Figure 15.1 is unrealistically high: no real program does nothing but contend for a lock or atomic variable. In practice, atomics tend to scale better than locks because atomics deal more effectively with typical contention levels.
The performance reversal between locks and atomics at differing levels of contention illustrates the strengths and weaknesses of each. With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance. (CAS-based algorithms also outperform lock-based ones on single-CPU systems, since a CAS always succeeds on a single-CPU system except in the unlikely case that a thread is preempted in the middle of the read-modify-write operation.)
(On the figures referred to by the text, Figure 15.1 shows that the performance of AtomicInteger and ReentrantLock is more or less equal when contention is high, while Figure 15.2 shows that under moderate contention the former outperforms the latter by a factor of 2-3.)
Update: on nonblocking algorithms
As others have noted, nonblocking algorithms, although potentially faster, are more complex, thus more difficult to get right. A hint from section 15.4 of JCiA:
Good nonblocking algorithms are known for many common data structures, including stacks, queues, priority queues, and hash tables, though designing new ones is a task best left to experts.
Nonblocking algorithms are considerably more complicated than their lock-based equivalents. The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency. In linked collection classes such as queues, you can sometimes get away with expressing state transformations as changes to individual links and using an AtomicReference to represent each link that must be updated atomically.

I wonder if jvm already does a few spin before really suspending the thread. It anticipate that well written critical sections, like yours, are very short and complete almost immediately. Therefore it should optimistically busy-wait for, I don't know, dozens of loops, before giving up and suspending the thread. If that's the case, it should behave the same as your 2nd version.
what a profiler shows might be very different from what's realy happending in a jvm at full speed, with all kinds of crazy optimizations. it's better to measure and compare throughputs without profiler.

Before doing this kind of synchronization optimizations, you really need a profiler to tell you that it's absolutely necessary.
Yes, synchronized under some conditions may be slower than atomic operation, but compare your original and replacement methods. The former is really clear and easy to maintain, the latter, well it's definitely more complex. Because of this there may be very subtle concurrency bugs, that you will not find during initial testing. I already see one problem, size and head can really get out of sync, because, though each of these operations is atomic, the combination is not, and sometimes this may lead to an inconsistent state.
So, my advise:
Start simple
Profile
If performance is good enough, leave simple implementation as is
If you need performance improvement, then start to get clever (possibly using more specialized lock at first), and TEST, TEST, TEST

Here's code for a busy wait lock.
public class BusyWaitLock
{
private static final boolean LOCK_VALUE = true;
private static final boolean UNLOCK_VALUE = false;
private final static Logger log = LoggerFactory.getLogger(BusyWaitLock.class);
/**
* #author Rod Moten
*
*/
public class BusyWaitLockException extends RuntimeException
{
/**
*
*/
private static final long serialVersionUID = 1L;
/**
* #param message
*/
public BusyWaitLockException(String message)
{
super(message);
}
}
private AtomicBoolean lock = new AtomicBoolean(UNLOCK_VALUE);
private final long maximumWaitTime ;
/**
* Create a busy wait lock with that uses the default wait time of two minutes.
*/
public BusyWaitLock()
{
this(1000 * 60 * 2); // default is two minutes)
}
/**
* Create a busy wait lock with that uses the given value as the maximum wait time.
* #param maximumWaitTime - a positive value that represents the maximum number of milliseconds that a thread will busy wait.
*/
public BusyWaitLock(long maximumWaitTime)
{
if (maximumWaitTime < 1)
throw new IllegalArgumentException (" Max wait time of " + maximumWaitTime + " is too low. It must be at least 1 millisecond.");
this.maximumWaitTime = maximumWaitTime;
}
/**
*
*/
public void lock ()
{
long startTime = System.currentTimeMillis();
long lastLogTime = startTime;
int logMessageCount = 0;
while (lock.compareAndSet(UNLOCK_VALUE, LOCK_VALUE)) {
long waitTime = System.currentTimeMillis() - startTime;
if (waitTime - lastLogTime > 5000) {
log.debug("Waiting for lock. Log message # {}", logMessageCount++);
lastLogTime = waitTime;
}
if (waitTime > maximumWaitTime) {
log.warn("Wait time of {} exceed maximum wait time of {}", waitTime, maximumWaitTime);
throw new BusyWaitLockException ("Exceeded maximum wait time of " + maximumWaitTime + " ms.");
}
}
}
public void unlock ()
{
lock.set(UNLOCK_VALUE);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.