The Java AtomicInteger class has a method -
boolean weakCompareAndSet(int expect,int update)
Its documnentation says:
May fail spuriously.
What does 'failing spuriously' here mean?
spuriously: for no apparent reason
According to atomic package javadoc:
The atomic classes also support method weakCompareAndSet, which has limited applicability.
On some platforms, the weak version may be more efficient than compareAndSet in the normal case, but differs in that any given invocation of the weakCompareAndSet method may return false spuriously (that is, for no apparent reason).
A false return means only that the operation may be retried if desired, relying on the guarantee that repeated invocation when the variable holds expectedValue and no other thread is also attempting to set the variable will eventually succeed.
(Such spurious failures may for example be due to memory contention effects that are unrelated to whether the expected and current values are equal.)
Additionally weakCompareAndSet does not provide ordering guarantees that are usually needed for synchronization control.
According to this thread, it is not so much because of "hardware/OS", but because of the underlying algorithm used by weakCompareAndSet :
weakCompareAndSet atomically sets the value to the given updated value if the current value == the expected value. May fail spuriously.
Unlike compareAndSet(), and other operations on an AtomicX, the weakCompareAndSet() operation does not create any happens-before orderings.
Thus, just because a thread sees an update to an AtomicX caused by a weakCompareAndSet doesn't mean it is properly synchronized with operations that occurred before the weakCompareAndSet().
You probably don't want to use this method, but instead should just use compareAndSet; as there are few cases where weakCompareAndSet is faster than compareAndSet, and there are a number of cases in which trying to optimizing your code by using weakCompareAndSet rather than compareAndSet will introduce subtle and hard to reproduce synchronization errors into your code.
Note regarding happens-before orderings:
The Java Memory Model (JMM) defines the conditions under which a thread reading a variable is guaranteed to see the results of a write in another thread.
The JMM defines an ordering on the operations of a program called happens-before.
Happens-before orderings across threads are only created by synchronizing on a common lock or accessing a common volatile variable.
In the absence of a happens-before ordering, the Java platform has great latitude to delay or change the order in which writes in one thread become visible to reads of that same variable in another.
It means it might return false (and will not set the new value) even if it currently contains the expected value.
In other words, the method may do nothing and return false for no apparent reason...
There are CPU architectures where this may have a performance advantage over a strong CompareAndSet().
A bit more concrete detail on why something like this might happen.
Some architectures (like newer ARMs) implement CAS operations using a Load Linked (LL)/Store Conditional (SC) set of instructions. The LL instruction loads the value in a memory location and 'remembers' the address somewhere. The SC instruction stores a value into that memory location if the value at the remembered address has not been modified. It's possible for the hardware to believe that the location has been modified even if it apparently hasn't for a number of possible reasons (and the reasons might vary by CPU architecture):
the location may have been written with the same value
the resolution of the addresses watched might not be exactly the one memory location of interest (think cache lines). A write to another location that's 'close-by' may cause the hardware to flag the address in question as 'dirty'
a number of other reasons that may cause the CPU to lose the saved state of the LL instruction - context switches, cache flushes, or page table changes maybe.
A good use-case for weakCompareAndSet is performance counters - no need for ordering, high rate of updates (so ordering hurts on weakly ordered systems), but will not drop counts under high loads (tightly contented perf-counters can drop 99% of all counts, essentially leaving the counters' value relative to un-contended counters random).
Related
Marking a variable as volatile in Java ensures that every thread sees the value that was last written to it instead of some stale value. I was wondering how this is actually achieved. Does the JVM emit special instructions that flush the CPU cashes or something?
From what I understand it always appears as if the cache has been flushed after write, and always appears as if reads are conducted straight from memory on read. The effect is that a Thread will always see the results of writes from another Thread and (according to the Java Memory Model) never a cached value. The actual implementation and CPU instructions will vary from one architecture to another however.
It doesn't guarantee correctness if you increment the variable in more than one thread, or check its value and take some action since obviously there is no actual synchronization. You can generally only guarantee correct execution if there is only just Thread writing to the variable and others are all reading.
Also note that a 64 bit NON-volatile variable can be read/written as two 32 bit variables, so the 32 bit variables are atomic on write but the 64 bit ones aren't. One half can be written before another - so the value read could be nether the old or the new value.
This is quite a helpful page from my bookmarks:
http://www.cs.umd.edu/~pugh/java/memoryModel/
Exactly what happens is processor-specific. Generally there are some form of memory barrier instructions. Flushing the entire cache would obviously be very expensive - there are cache coherency protocols in the hardware.
Also important, is that certain optimisations are not made across the field accesses. The compiler is important when considering multithreading, don't just think about the hardware.
My understanding, is that the JSR-133 cookbook is a well quoted guide of how to implement the Java memory model using a series of memory barriers, (or at least the visibility guarantees).
It is also my understanding based on the description of the different types of barriers, that StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
I was looking at the table of specific barriers required for different program order inter-leavings of volatile/regular stores/loads and what memory barriers would be required.
From my intuition this table seems incomplete. For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile. In the table in the link above, it seems as if the only actions that flush CPU buffers and propagate changes/allow new changes to be observed are a Volatile Store or MonitorExit followed by a Volatile Load or MonitorEnter. I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Where have I gone wrong in my understanding here? Or does this implementation only enforce happens before and not the synchronization guarantees or extra actions on acquiring/releasing monitors.
Thanks
StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
This may be true for x86 architectures, but you shouldn't be thinking on that level of abstraction. It may be the case that cache coherence can be costly for the processors to be executing.
Take mobile devices for example, one important goal is to reduce the amount of battery use programs consume. In that case, they may not participate in cache coherence and StoreLoad loses this feature.
I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Let's just consider a volatile field. How would a volatile load and store look? Well, Aleksey Shipilëv has a great write up on this, but I will take a piece of it.
A volatile store and then subsequent load would look like:
<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store
[StoreLoad] // Case (a): Guard after volatile stores
...
[StoreLoad] // Case (b): Guard before volatile loads
int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>
So, <other ops> can be non-volatile writes, but as you see those writes are committed to memory prior to the volatile store. Then when we are ready to read the LoadLoad LoadStore will force a wait until the volatile store succeeds.
Lastly, the StoreLoad before and after ensures the volatile load and store cannot be reordered if the immediately precede one another.
The barriers in the document are abstract concepts that more-or-less map to different things on different CPUs. But they are only guidelines. The rules that the JVMs actually have to follow are those in JLS Chapter 17.
Barriers as a concept are also "global" in the sense that they order all prior and following instructions.
For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile.
Acquiring a monitor is the monitor-enter in the cookbook, which only needs to be visible to other threads that contend on the lock. The monitor-exit is the release action, which will prevent loads and stores prior to it from moving bellow it. You can see this in the cookbook tables where the first operation is a normal load/store, and the second is a volatile-store or monitor-exit.
On CPUs with Total Store Order, the store buffers, where available, have no impact on correctness; only on performance.
In any case, it's up to the JVM to use instructions that provide the atomicity and visibility semantics that the JLS demands. And that's the key take-away: If you write Java code, you code against the abstract machine defined in the JLS. You would only dive into the implementation details of the concrete machine, if coding only to the abstract machine doesn't give you the performance you need. You don't need to go there for correctness.
I'm not sure where you got that StoreLoad barriers are the only type that enforce some particular behavior. All of the barriers, abstractly, enforce exactly what they are defined to enforce. For example, LoadLoad prevents any prior load from reordering with any subsequent load.
There may be architecture specific descriptions of how a particular barrier is enforced: for example, on x86 all the barriers other than StoreLoad are no-ops since the chip architecture enforces the other orderings automatically, and StoreLoad is usually implemented as a store buffer flush. Still, all the barriers have their abstract definition which is architecture-independent and the cookbook is defined in terms of that, along with a mapping of the conceptual barriers to actual ISA-specific implementations.
In particular, even if a barrier is "no-op" on a particular platform, it means that the ordering is preserved and hence all the happens-before and other synchronization requirements are satisfied.
From this question : AtomicInteger lazySet vs. set and form this link : https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/package-summary.html
I could gather following points
lazySet could be faster than set
lazySet uses store-store barrier (writes before are honored but not the contended writes, which were yet to happen)
I could find one use-case where it could be applied, from the documentation :
Use lazySet when you want to null out a pointer to aid GC.
Are there any other practical use-cases for lazySet ?
Caffeine uses lazy or relaxed writes in many of its data structures.
When nulling out a field (e.g. ConcurrentLinkedStack)
When writing to volatile fields before publishing (e.g. SingleConsumerQueue)
When publish is safely delayable (e.g. BoundedBuffer)
When races are benign (e.g. cache expiration timestamps)
When inside a lock (e.g. BoundedLocalCache)
ConcurrentLinkedQueue uses relaxed writes prior to publishing a node and may lazily sets a node's next field (prior to publishing or to indicate a stale traversal).
You may also enjoy reading the Linux Kernel Memory Barriers paper.
TL;DR How to use .lazySet()? With care, if at all.
The main problem here is that AtomicXXX.lazySet() is low-level performance optimization and it is out of current JLS. You can't prove correctness if your concurrent code with JMM tools if you are using lazySet().
Why it is much faster than volatile write?
Main difference between set and lazySet is absence of StoreLoad barrier.
JSR-133 Cookbook for Compiler Writers:
StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind.
Moreover, on most popular x86-based hardware StoreLoad is the only explicit barrier (others are just no-op's and cost nothing), so with lazySet you eliminate all (explicit) memory barriers.
Guarantees of lazySet
From the point of JLS there isn't any.
But actually you can reason about lazySet as delayed write which cannot be reordered with any previous write and will happen eventually. Eventually is finite time, if your process makes any progress (e.g. any synchronization action occurs; in addition, size of processor's store buffer is finite). If written value became visible for other thread, you can be sure that all previous writes are visible either (although you cannot formally prove it). So you can treat it as delayed happens-before relationship (but, of course, it's not even close to it's strict and formal definition).
Usage
Most practical usage (except nulling-out references) is making writes far cheaper in context of progress. Simplest example is using lazySet() instead of set() within synchronized block (but in this case there is no great performance impact). Or you can use it instead of writes in single producer cases, when no compete on write occurs.
Disruptor developers are using lazySet exactly for this purpose in their lock-free implementation. Again, it's very hard to argue about correctness of such code, but it's good trick to be aware of.
I would think many uses of AtomicBoolean would benefit from the usage of lazySet() because they are often used as flags to indicate whether something is complete or not, or an outer loop should finish.
This is because in this case the value is initially one value and it eventually becomes another value and then stays there. Obviously this argument applied to almost any atomic that is used in that way.
public void test() {
final AtomicBoolean finished = new AtomicBoolean(false);
new Thread(new Runnable() {
#Override
public void run() {
while (!finished.get()) {
// A long process.
if (wereAllDone()) {
finished.lazySet(true);
}
}
}
}).start();
}
If I have a variable from which multiple threads read and only one thread writes, do I need to have a lock around that variable? Would it crash if one thread tries to read and the other thread tries to write at the same time?
The concurrency concern is not crashing, but what version of the data you're seeing.
if the shared variable is written atomically, it's possible for one (reader) thread to read a stale value when you thought your (writer) thread had updated the variable. You can use volatile keywords to prevent your reader threads from reading a stale value in this situation.
if the write operation is not atomic (for example if it's a composite object of some kind and you're writing bits of it at a time, while the other threads could theoretically be reading it) then your concern would also be that some reader threads could see the variable in an inconsistent state. You'd prevent this by locking access to the variable while it was being written (slow) or making sure that you were writing atomically.
Writes to some types of fields are atomic but without a happens-before relationship that ensures correct memory ordering (unless you use volatile); see this page for details.
The simple answer is yes, you need synchronization.
If you ever write to a field and read it from anywhere else without some form of synchronization, your program can see inconsistent state and is likely wrong. Your program will not crash but can see either the old or new or (in the case of longs and doubles) half old and half new data.
When I say "some form of synchronization" though, I more precisely mean something that creates a "happens-before" relationship (aka memory barrier) between the write and read locations. Synchronization or java.util.concurrent.lock classes are the most obvious way to create such a thing, but all of the concurrent collections typically also provide similar guarantees (check the javadoc to be sure). For example, doing a put and take on a concurrent queue will create a happens-before relationship.
Marking a field as volatile prevents you from seeing inconsistent references (long-tearing) and guarantees that all threads will "see" a write. But volatile fields writes/reads cannot be combined with other operations in larger atomic units. The Atomic classes handle common combo ops like compare-and-set or read-and-increment. Synchronization or other java.util.concurrent synchronizers (CyclicBarrier, etc) or locks should be used for larger areas of exclusivity.
Departing from the simple yes, there are cases that are more "no, if you really know what you're doing". Two examples:
1) The special case of a field that is final and written ONLY during construction. One example of that is when you populate a pre-computed cache (think of a Map where keys are well-known values and values are pre-computed derived values). If you build that in a field prior to construction and the field is final and you never write to it later, the end of the constructor performs "final field freeze" and subsequent reads DO NOT need to synchronize.
2) The case of the "racy single check" pattern which is covered in Effective Java. The canonical example is in java.lang.String.hashCode(). String has a hash field that is lazily computed the first time you call hashCode() and cached into the local field, which is NOT synchronized. Basically, multiple threads may race to compute this value and set over other threads, but because it is guarded by a well-known sentinel (0) and always computes the identical value (so we don't care which thread "wins" or whether multiple do), this actually is guaranteed to be ok.
A longer reference (written by me): http://refcardz.dzone.com/refcardz/core-java-concurrency
Be aware that volatile is NOT atomic, which means that double and long which use 64 bits can be read in an inconsistent state where 32 bits are the old value and 32 bits are the new value. Also, volatile arrays do not make the array entries volatile. Using classes from java.util.concurrent is strongly recommended.
I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?
You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.
The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.
It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.
Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!
volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.