Java memory model - volatile and x86

Java memory model - volatile and x86 - java

I am trying to understand the intrinsics of java volatile and its semantics, and its transaltion to the underlying architecture and its instructions. If we consider the following blogs and resourses
fences generated for volatile, What gets generated for read/write of volatile and Stack overflow question on fences
here is what I gather:
volatile read inserts loadStore/LoadLoad barriers after it (LFENCE instruction on x86)
It prevents the reordering of loads with subsequent writes/loads
It is supposed to guarantee loading of the global state that was modified by other threads i.e. after LFENCE the state modifications done by other threads are visible to the current thread on its CPU.
WHat I am struggling to understand is this: Java does not emit LFENCE on x86
i.e. read of volatile does not cause LFENCE.... I know that memory ordering of x86 prevent reording of loads with lods/stored, so second bullet point is taken care of. However, I would assume that in order for the state to be visible by this thread, LFENCE instruction should be issued to guarantee that all LOAD buffers are drained before the next instruction after the fence is executed (as per Intel manual). I understand there is cahce coherence protocol on x86, but volatile read should still drain any LOADs in the buffers, no?

On x86, the buffers are pinned to the cache line. If the cache line is lost, the value in the buffer isn't used. So there's no need to fence or drain the buffers; the value they contain must be current because another core can't modify the data without first invalidating the cache line.

The X86 provides TSO. So, on a hardware level, the following barriers you get for free [LoadLoad][LoadStore][StoreStore]. The only one missing is the [StoreLoad].
A load has acquire semantics
r1=X
[LoadLoad]
[LoadStore]
A store has release semantics
[LoadStore]
[StoreStore]
Y=r2
If you would do a store followed by a load you end up with this:
[LoadStore]
[StoreStore]
Y=r2
r1=X
[LoadLoad]
[LoadStore]
The issue is that the load and store can still be reordered and hence it isn't sequential consistent; and this is mandatory for the Java Memory model. They only way to prevent this is with a [StoreLoad].
[LoadStore]
[StoreStore]
Y=r2
[StoreLoad]
r1=X
[LoadLoad]
[LoadStore]
And the most logical place would be to add it to the write since normally reads are more frequent than writes. So the write would become:
[LoadStore]
[StoreStore]
Y=r2
[StoreLoad]
Because the X86 provides TSO, the following fences can be no-ops:
[LoadLoad][LoadStore][StoreStore]
So the only one relevant is the [StoreLoad] and this can be accomplished by an MFENCE or a lock addl %(RSP),0
The LFENCE and the SFENCE are not relevant for this situation. The LFENCE and SFENCE are for weakly ordered loads and stores (e.g. those of SSE).
What the [StoreLoad] does on the X86 is to stop executing loads, till the store buffer has been drained. This will make sure that the load is globally visible (so read from memory/cache) AFTER the store has become globally visible (has left the store buffer and entered the L1d).

Related

What does acquire and release Memory Ordering mean in VarHandle

What does acquire and release Memory Ordering mean in VarHandle?
I can see the below description but not clear on what it means.
acquire:
Ensures that subsequent loads and stores are not reordered before this access; compatible with C/C++11 memory_order_acquire ordering.
release:
Ensures that prior loads and stores are not reordered after this access; compatible with C/C++11 memory_order_release ordering.

Basically, modern CPUs can do an out-of-order execution of instructions to increase the processing speed. However for sychronisation purposes you might want to limit the re-ordering capabilities of the processor. That's what memory barriers are for.
For the purpose of further explanations:
Load - reads something from memory
Store - writes something into memory
In JVM you'll find those 4 types of memory barriers:
#LoadLoad - all loads before that barrier need to happen before loads after this barrier
#LoadStore - all loads before that barrier need to happen before stores after this barrier
#StoreLoad - all stores before that barrier need to happen before loads after this barrier
#StoreStore - all stores before that barrier need to happen before loads after this barrier
Acquire memory barrier can be seen as a combination of #LoadLoad and #LoadStore barriers, while release barrier as #LoadStore and #StoreStore.
For further explanations have a look at this article: https://preshing.com/20120913/acquire-and-release-semantics/

Java memory model: single thread and multi-core CPU

In a Java application, if accesses to an object's state happen on the same thread (in simplest case, in a single-threaded application), there is no need to synchronize to enforce visibility/consistency of changes, as per happens-before relationship specification:
"Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happens-before y.
If x and y are actions of the same thread and x comes before y in program order, then hb(x, y)."
But modern architectures are multi-core, so Java thread can potentially execute on any of them at any given time (unless this is not true and Java threads are pinned to specific cores?). So if that is teh case, if a thread writes to variable x, caches that in L1 cache or CPU register, and then starts running on another core, that previously accessed x and also cached it in a register, that value is incosistent... Is there some mechanism (implicit memory barrier) when thread is taken off a CPU?

can potentially execute on any of them at any given time
Tasks don't just spontaneously migrate between cores. These things have to happen:
the task is pre-empted on the core it was previously running on (marking it as waiting to run in the kernel's global task list)
the kernel's task scheduler on another core sees that task waiting for a CPU and decided to run it.
(Scheduling is a distributed algorithm; each core is effectively running the kernel on that core very much like a multi-threaded process. One core can't tell another core what to do, only put data in memory where the kernel running on that core can look at it.)
This isn't a problem because:
Data caches (L1 and so on) are coherent across all cores that a thread could be scheduled on by an OS. Myths Programmers Believe about CPU Caches
Or on hypothetical and unlikely hardware + OS + JVM that runs threads across cores with non-coherent shared memory, the OS would have to flush dirty private cache back to actual shared memory at some point after stopping the task on one core, before putting it in a global task queue where the task scheduler on another core could run it.
Is there some mechanism (implicit memory barrier) when thread is taken off a CPU?
On a real-world system (coherent caches), the OS just has to make sure there's a full memory barrier that drains the store buffer on one core before another core can resume the task.
That barrier is not always implicit as part of something the OS was going to do anyway; the OS kernel might need to explicitly include a barrier just for this. However, saving the register state and marking the task as runnable probably needed at least release stores, so another core that could restore this task's state would also see all user-space stores that task had done.
Still, I've heard of the possibility of breaking a single-threaded process by migrating between CPUs without sufficient barriers. It is something to think about for an OS. It's not at all Java specific; it's about how to not break a single thread running any arbitrary machine code.
Only the registers are truly private, and yes compilers will keep variables in registers. I don't like the term "cached" for that; in asm registers are separate from memory. Compilers can keep the only currently-valid copy of a variable in a register for the duration of a loop, and store it back afterwards.
Every task has its own register state; this is called the "architectural state" and is the context that's saved/restored by a context switch.
Restarting execution of a thread on another core means restoring its saved register state from memory, ending by restoring its program-counter. i.e. jumping to the instruction it stopped at, restoring the program counter into the architectural program counter register. e.g. RIP on x86-64. (64-bit version of the "Instruction Pointer" register)
Note that registers and (virtual) memory contents are the entire state of a user-space process (and open file descriptors and other kernel stuff associated with it). But cache state is not. Registers are not cache. Cache is transparent to software (memory reordering happens because of the store buffer and CPU memory parallelism to hide cache misses, not because of cache, on most ISAs). Registers are the asm equivalent of local variables.
Compiler terminology: "caching"
Hoisting a load out of a loop and keeping the value in a register is sometimes described as "caching" a value in a register, but that's just casual terminology that has you confused here. The compiler-developer terminology for that is "Enregistration" of variables.
Or just "hoisting" a load or "sinking" a store out of a loop; normally you need to load a value into a register before you can use it for other things (at least on a RISC that doesn't have memory-source ALU instructions). By hoisting a load of a loop-invariant value you only have to load once, ahead of the loop, and re-read the register multiple times.
Same for stores; if you know that no other threads are allowed to look at the memory location for a variable, only the final value needs to actually get stored with a store instruction. Any other stores of other values would be "dead" if nothing can read them and we know there's a later store. So you keep the variable in a register for the duration of the loop, then store once at the end. This is called "sinking" a store, and is related to dead store elimination.

Peter Cordes answered you from the implementation level, but it's worth mentioning the specificiation level:
If x and y are actions of the same thread and x comes before y in program order, then hb(x, y)."
That pretty much says it all right there. The Java language specification guarantees that if x comes before y at the source code level of your program, then x "happens before" y for purposes of determining memory visibility.
All of the stuff that Peter Cordes said is what guarantees hb(x, y). If any JVM ever failed to do all of that stuff, then it would not be a valid implementation of Java.
Long story shortened: If your code only ever runs in a single thread, then you'll never have to worry about memory visibility.

Cost of volatile writes

I have researching the cost of volatile writes in Java in x86 hardware.
I'm planning on using the Unsafe's putLongVolatile method on a shared memory location. Looking into the implementation, putLongVolatile get's translated to Unsafe_SetLongVolatile in Link and subsequently into a AtomicWrite followed by an fence Link
In short, every volatile write is converted to an atomic write followed by a full fence(mfence or locked add instruction in x86).
Questions:
1) Why a fence() is required for x86 ? Isn't a simple compiler barrier sufficient because of store-store ordering ? A full fence seems awfully expensive.
2) Is putLong instead of putLongVolatile of Unsafe a better alternative? Would it work well in multi-threading case?

Answer to question 1:
Without the full fence you do not have sequential consistency which is required for the JMM.
So X86 provides TSO. So the following barriers you get for free [LoadLoad][LoadStore][StoreStore]. The only one missing is the [StoreLoad].
A load has acquire semantics
r1=X
[LoadLoad]
[LoadStore]
A store has release semantics
[LoadStore]
[StoreStore]
Y=r2
If you would do a store followed by a load you end up with this:
[LoadStore]
[StoreStore]
Y=r2
r1=X
[LoadLoad]
[LoadStore]
The issue is that the load and store can still be reordered and hence it isn't sequential consistent; and this is mandatory for the Java Memory model. They only way to prevent this is with a [StoreLoad]. And the most logical place would be to add it to the write since normally reads are more frequent than writes.
And this can be accomplished by an MFENCE or a lock addl %(RSP),0
Answer to question 2:
The problem with a putLong is that not only the CPU can reorder instructions, also the compiler could change to code in such a way that it leads to instruction reordering.
Example: if you would be doing a putLong in a loop, the compiler could decide to pull the write out of the loop and the value will not become visible to other threads. If you want to have a low overhead single writer performance counter, you might want to have a look at putLongRelease/putLongOrdered(oldname). This will prevent the compiler from doing the above trick. And the release semantics on the X86 you get for free.
But it is very difficult to give a one fits all solution to your second question because it depends on what your goal is.

How does the JSR-133 cookbook enforce all the guarantees made by the Java Memory Model

My understanding, is that the JSR-133 cookbook is a well quoted guide of how to implement the Java memory model using a series of memory barriers, (or at least the visibility guarantees).
It is also my understanding based on the description of the different types of barriers, that StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
I was looking at the table of specific barriers required for different program order inter-leavings of volatile/regular stores/loads and what memory barriers would be required.
From my intuition this table seems incomplete. For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile. In the table in the link above, it seems as if the only actions that flush CPU buffers and propagate changes/allow new changes to be observed are a Volatile Store or MonitorExit followed by a Volatile Load or MonitorEnter. I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Where have I gone wrong in my understanding here? Or does this implementation only enforce happens before and not the synchronization guarantees or extra actions on acquiring/releasing monitors.
Thanks

StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
This may be true for x86 architectures, but you shouldn't be thinking on that level of abstraction. It may be the case that cache coherence can be costly for the processors to be executing.
Take mobile devices for example, one important goal is to reduce the amount of battery use programs consume. In that case, they may not participate in cache coherence and StoreLoad loses this feature.
I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Let's just consider a volatile field. How would a volatile load and store look? Well, Aleksey Shipilëv has a great write up on this, but I will take a piece of it.
A volatile store and then subsequent load would look like:
<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store
[StoreLoad] // Case (a): Guard after volatile stores
...
[StoreLoad] // Case (b): Guard before volatile loads
int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>
So, <other ops> can be non-volatile writes, but as you see those writes are committed to memory prior to the volatile store. Then when we are ready to read the LoadLoad LoadStore will force a wait until the volatile store succeeds.
Lastly, the StoreLoad before and after ensures the volatile load and store cannot be reordered if the immediately precede one another.

The barriers in the document are abstract concepts that more-or-less map to different things on different CPUs. But they are only guidelines. The rules that the JVMs actually have to follow are those in JLS Chapter 17.
Barriers as a concept are also "global" in the sense that they order all prior and following instructions.
For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile.
Acquiring a monitor is the monitor-enter in the cookbook, which only needs to be visible to other threads that contend on the lock. The monitor-exit is the release action, which will prevent loads and stores prior to it from moving bellow it. You can see this in the cookbook tables where the first operation is a normal load/store, and the second is a volatile-store or monitor-exit.
On CPUs with Total Store Order, the store buffers, where available, have no impact on correctness; only on performance.
In any case, it's up to the JVM to use instructions that provide the atomicity and visibility semantics that the JLS demands. And that's the key take-away: If you write Java code, you code against the abstract machine defined in the JLS. You would only dive into the implementation details of the concrete machine, if coding only to the abstract machine doesn't give you the performance you need. You don't need to go there for correctness.

I'm not sure where you got that StoreLoad barriers are the only type that enforce some particular behavior. All of the barriers, abstractly, enforce exactly what they are defined to enforce. For example, LoadLoad prevents any prior load from reordering with any subsequent load.
There may be architecture specific descriptions of how a particular barrier is enforced: for example, on x86 all the barriers other than StoreLoad are no-ops since the chip architecture enforces the other orderings automatically, and StoreLoad is usually implemented as a store buffer flush. Still, all the barriers have their abstract definition which is architecture-independent and the cookbook is defined in terms of that, along with a mapping of the conceptual barriers to actual ISA-specific implementations.
In particular, even if a barrier is "no-op" on a particular platform, it means that the ordering is preserved and hence all the happens-before and other synchronization requirements are satisfied.

Are volatile variable 'reads' as fast as normal reads?

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?

You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.

The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.

It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.

Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!

volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.