The article "Atomic*.lazySet is a performance win for single writers," goes over how lazySet is a weak volatile write (in the sense that it acts as a store-store and not a store-load fence). But I don't understand how leveraging semi-volatile writes improves concurrent queue performance. How exactly does it offer extra low latency as claimed by Menta-queue?
I already read up on it's implementation and it's claims on the stack overflow question: "How is lazySet in Java's Atomic* classes implemented" and "Atomic Integer's lazySet vs set."
The problem with a volatile write on x86 is that it issues full memory barrier which results in a stall until the store buffer is drained. Meanwhile lazySet on x86 is a simple store. It does not require all previous stores waiting in the store buffer to be flushed, thus allowing writing thread to proceed at full speed.
This is described a bit in Martin Thompson's article.
Related
I am trying to Understand performance of volatile variable in JAVA.
I see https://brooker.co.za/blog/2012/09/10/volatile.html and it seems volatile reads are slow when there is a writer involved. I have not seen any more arguments or benchmarks mentioning the same.
How would AtomicReference lazySet affect volatile variable reads
A regular read and volatile read from hardware x86 perspective are equally cheap. Volatile read requires acquire semantics which is provided by the tso memory model of x86. So both regular load and volatile load have acquire semantics and are equally cheap. On software level there is difference since volatile read prohibits many compiler optimizations.
A lazy set will not change the performance of the reader; just the performance of the writer. On X86 volatile write is a sequential consistent write; so a [StoreLoad] is needed and this requiring stopping any loads from being executed until the store buffer is drained. A lazySet aka orderedSet placed the store on the store buffer and then continues. So it won't stall the CPU. This is purely a writer concern; not a reader. So a reader will not go any faster or slower.
In your case: first determine if it is actual a problem. In most cases many other issues are playing and optimizing on this level makes code complex and introduces bugs. If it truly is a problem, I would be more focused on contention on the cache line than the overhead of reading/writing to a volatile variable.
I have researching the cost of volatile writes in Java in x86 hardware.
I'm planning on using the Unsafe's putLongVolatile method on a shared memory location. Looking into the implementation, putLongVolatile get's translated to Unsafe_SetLongVolatile in Link and subsequently into a AtomicWrite followed by an fence Link
In short, every volatile write is converted to an atomic write followed by a full fence(mfence or locked add instruction in x86).
Questions:
1) Why a fence() is required for x86 ? Isn't a simple compiler barrier sufficient because of store-store ordering ? A full fence seems awfully expensive.
2) Is putLong instead of putLongVolatile of Unsafe a better alternative? Would it work well in multi-threading case?
Answer to question 1:
Without the full fence you do not have sequential consistency which is required for the JMM.
So X86 provides TSO. So the following barriers you get for free [LoadLoad][LoadStore][StoreStore]. The only one missing is the [StoreLoad].
A load has acquire semantics
r1=X
[LoadLoad]
[LoadStore]
A store has release semantics
[LoadStore]
[StoreStore]
Y=r2
If you would do a store followed by a load you end up with this:
[LoadStore]
[StoreStore]
Y=r2
r1=X
[LoadLoad]
[LoadStore]
The issue is that the load and store can still be reordered and hence it isn't sequential consistent; and this is mandatory for the Java Memory model. They only way to prevent this is with a [StoreLoad]. And the most logical place would be to add it to the write since normally reads are more frequent than writes.
And this can be accomplished by an MFENCE or a lock addl %(RSP),0
Answer to question 2:
The problem with a putLong is that not only the CPU can reorder instructions, also the compiler could change to code in such a way that it leads to instruction reordering.
Example: if you would be doing a putLong in a loop, the compiler could decide to pull the write out of the loop and the value will not become visible to other threads. If you want to have a low overhead single writer performance counter, you might want to have a look at putLongRelease/putLongOrdered(oldname). This will prevent the compiler from doing the above trick. And the release semantics on the X86 you get for free.
But it is very difficult to give a one fits all solution to your second question because it depends on what your goal is.
My understanding, is that the JSR-133 cookbook is a well quoted guide of how to implement the Java memory model using a series of memory barriers, (or at least the visibility guarantees).
It is also my understanding based on the description of the different types of barriers, that StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
I was looking at the table of specific barriers required for different program order inter-leavings of volatile/regular stores/loads and what memory barriers would be required.
From my intuition this table seems incomplete. For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile. In the table in the link above, it seems as if the only actions that flush CPU buffers and propagate changes/allow new changes to be observed are a Volatile Store or MonitorExit followed by a Volatile Load or MonitorEnter. I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Where have I gone wrong in my understanding here? Or does this implementation only enforce happens before and not the synchronization guarantees or extra actions on acquiring/releasing monitors.
Thanks
StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
This may be true for x86 architectures, but you shouldn't be thinking on that level of abstraction. It may be the case that cache coherence can be costly for the processors to be executing.
Take mobile devices for example, one important goal is to reduce the amount of battery use programs consume. In that case, they may not participate in cache coherence and StoreLoad loses this feature.
I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Let's just consider a volatile field. How would a volatile load and store look? Well, Aleksey Shipilëv has a great write up on this, but I will take a piece of it.
A volatile store and then subsequent load would look like:
<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store
[StoreLoad] // Case (a): Guard after volatile stores
...
[StoreLoad] // Case (b): Guard before volatile loads
int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>
So, <other ops> can be non-volatile writes, but as you see those writes are committed to memory prior to the volatile store. Then when we are ready to read the LoadLoad LoadStore will force a wait until the volatile store succeeds.
Lastly, the StoreLoad before and after ensures the volatile load and store cannot be reordered if the immediately precede one another.
The barriers in the document are abstract concepts that more-or-less map to different things on different CPUs. But they are only guidelines. The rules that the JVMs actually have to follow are those in JLS Chapter 17.
Barriers as a concept are also "global" in the sense that they order all prior and following instructions.
For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile.
Acquiring a monitor is the monitor-enter in the cookbook, which only needs to be visible to other threads that contend on the lock. The monitor-exit is the release action, which will prevent loads and stores prior to it from moving bellow it. You can see this in the cookbook tables where the first operation is a normal load/store, and the second is a volatile-store or monitor-exit.
On CPUs with Total Store Order, the store buffers, where available, have no impact on correctness; only on performance.
In any case, it's up to the JVM to use instructions that provide the atomicity and visibility semantics that the JLS demands. And that's the key take-away: If you write Java code, you code against the abstract machine defined in the JLS. You would only dive into the implementation details of the concrete machine, if coding only to the abstract machine doesn't give you the performance you need. You don't need to go there for correctness.
I'm not sure where you got that StoreLoad barriers are the only type that enforce some particular behavior. All of the barriers, abstractly, enforce exactly what they are defined to enforce. For example, LoadLoad prevents any prior load from reordering with any subsequent load.
There may be architecture specific descriptions of how a particular barrier is enforced: for example, on x86 all the barriers other than StoreLoad are no-ops since the chip architecture enforces the other orderings automatically, and StoreLoad is usually implemented as a store buffer flush. Still, all the barriers have their abstract definition which is architecture-independent and the cookbook is defined in terms of that, along with a mapping of the conceptual barriers to actual ISA-specific implementations.
In particular, even if a barrier is "no-op" on a particular platform, it means that the ordering is preserved and hence all the happens-before and other synchronization requirements are satisfied.
There is something that bugs me with the Java memory model (if i even understand everything correctly). If there are two threads A and B, there are no guarantees that B will ever see a value written by A, unless both A and B synchronize on the same monitor.
For any system architecture that guarantees cache coherency between threads, there is no problem. But if the architecture does not support cache coherency in hardware, this essentially means that whenever a thread enters a monitor, all memory changes made before must be commited to main memory, and the cache must be invalidated. And it needs to be the entire data cache, not just a few lines, since the monitor has no information which variables in memory it guards.
But that would surely impact performance of any application that needs to synchronize frequently (especially things like job queues with short running jobs). So can Java work reasonably well on architectures without hardware cache-coherency? If not, why doesn't the memory model make stronger guarantees about visibility? Wouldn't it be more efficient if the language would require information what is guarded by a monitor?
As i see it the memory model gives us the worst of both worlds, the absolute need to synchronize, even if cache coherency is guaranteed in hardware, and on the other hand bad performance on incoherent architectures (full cache flushes). So shouldn't it be more strict (require information what is guarded by a monitor) or more lose and restrict potential platforms to cache-coherent architectures?
As it is now, it doesn't make too much sense to me. Can somebody clear up why this specific memory model was choosen?
EDIT: My use of strict and lose was a bad choice in retrospect. I used "strict" for the case where less guarantees are made and "lose" for the opposite. To avoid confusion, its probably better to speak in terms of stronger or weaker guarantees.
the absolute need to synchronize, even
if cache coherency is guaranteed in
hardware
Yes, but then you only have to reason against the Java Memory Model, not against a particular hardware architecture that your program happens to run on. Plus, it's not only about the hardware, the compiler and JIT themselves might reorder the instructions causing visibility issue. Synchronization constructs in Java addresses visibility & atomicity consistently at all possible levels of code transformation (e.g. compiler/JIT/CPU/cache).
and on the other hand bad performance
on incoherent architectures (full
cache flushes)
Maybe I misunderstood s/t, but with incoherent architectures, you have to synchronize critical sections anyway. Otherwise, you'll run into all sort of race conditions due to the reordering. I don't see why the Java Memory Model makes the matter any worse.
shouldn't it be more strict (require
information what is guarded by a
monitor)
I don't think it's possible to tell the CPU to flush any particular part of the cache at all. The best the compiler can do is emitting memory fences and let the CPU decides which parts of the cache need flushing - it's still more coarse-grained than what you're looking for I suppose. Even if more fine-grained control is possible, I think it would make concurrent programming even more difficult (it's difficult enough already).
AFAIK, the Java 5 MM (just like the .NET CLR MM) is more "strict" than memory models of common architectures like x86 and IA64. Therefore, it makes the reasoning about it relatively simpler. Yet, it obviously shouldn't offer s/t closer to sequential consistency because that would hurt performance significantly as fewer compiler/JIT/CPU/cache optimizations could be applied.
Existing architectures guarantee cache coherency, but they do not guarantee sequential consistency - the two things are different. Since seq. consistency is not guaranteed, some reorderings are allowed by the hardware and you need critical sections to limit them. Critical sections make sure that what one thread writes becomes visible to another (i.e., they prevent data races), and they also prevent the classical race conditions (if two threads increment the same variable, you need that for each thread the read of the current value and the write of the new value are indivisible).
Moreover, the execution model isn't as expensive as you describe. On most existing architectures, which are cache-coherent but not sequentially consistent, when you release a lock you must flush pending writes to memory, and when you acquire one you might need to do something to make sure future reads will not read stale values - mostly that means just preventing that reads are moved too early, since the cache is kept coherent; but reads must still not be moved.
Finally, you seem to think that Java's Memory Model (JMM) is peculiar, while the foundations are nowadays fairly state-of-the-art, and similar to Ada, POSIX locks (depending on the interpretation of the standard), and the C/C++ memory model. You might want to read the JSR-133 cookbook which explains how the JMM is implemented on existing architectures: http://g.oswego.edu/dl/jmm/cookbook.html.
The answer would be that most multiprocessors are cache-coherent, including big NUMA systems, which almost? always are ccNUMA.
I think you are somewhat confused as to how cache coherency is acomplished in practice. First, caches may be coherent/incoherent with respect to several other things on the system:
Devices
(Memory modified by) DMA
Data caches vs instruction caches
Caches on other cores/processors (the one this question is about)
...
Something has to be made to maintain coherency. When working with devices and DMA, on architectures with incoherent caches with respect to DMA/devices, you would either bypass the cache (and possibly the write buffer), or invalidate/flush the cache around operations involving DMA/devices.
Similarly, when dynamically generating code, you may need to flush the instruction cache.
When it comes to CPU caches, coherency is achieved using some coherency protocol, such as MESI, MOESI, ... These protocols define messages to be sent between caches in response to certain events (e.g: invalidate-requests to other caches when a non-exclusive cacheline is modified, ...).
While this is sufficient to maintain (eventual) coherency, it doesn't guarantee ordering, or that changes are immediately visible to other CPUs. Then, there are also write buffers, which delay writes.
So, each CPU architecture provides ordering guarantees (e.g. accesses before an aligned store cannot be reordered after the store) and/or provide instructions (memory barriers/fences) to request such guarantees. In the end, entering/exiting a monitor doesn't entail flushing the cache, but may entail draining the write buffer, and/or stall waiting for reads to end.
the caches that JVM has access to are really just CPU registers. since there aren't many of them, flushing them upon monitor exit isn't a big deal.
EDIT: (in general) the memory caches are not under the control of JVM, JVM cannot choose to read/write/flush these caches, so forget about them in this discussion
imagine each CPU has 1,000,000 registers. JVM happily exploits them to do crazy fast computations - until it bumps into monitor enter/exit, and has to flush 1,000,000 registers to the next cache layer.
if we live in that world, either Java must be smart enough to analyze what objects aren't shared (majority of objects aren't), or it must ask programmers to do that.
java memory model is a simplified programming model that allows average programmers make OK multithreading algorithms. by 'simplified' I mean there might be 12 people in the entire world who really read chapter 17 of JLS and actually understood it.
I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?
You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.
The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.
It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.
Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!
volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.