I am trying to Understand performance of volatile variable in JAVA.
I see https://brooker.co.za/blog/2012/09/10/volatile.html and it seems volatile reads are slow when there is a writer involved. I have not seen any more arguments or benchmarks mentioning the same.
How would AtomicReference lazySet affect volatile variable reads
A regular read and volatile read from hardware x86 perspective are equally cheap. Volatile read requires acquire semantics which is provided by the tso memory model of x86. So both regular load and volatile load have acquire semantics and are equally cheap. On software level there is difference since volatile read prohibits many compiler optimizations.
A lazy set will not change the performance of the reader; just the performance of the writer. On X86 volatile write is a sequential consistent write; so a [StoreLoad] is needed and this requiring stopping any loads from being executed until the store buffer is drained. A lazySet aka orderedSet placed the store on the store buffer and then continues. So it won't stall the CPU. This is purely a writer concern; not a reader. So a reader will not go any faster or slower.
In your case: first determine if it is actual a problem. In most cases many other issues are playing and optimizing on this level makes code complex and introduces bugs. If it truly is a problem, I would be more focused on contention on the cache line than the overhead of reading/writing to a volatile variable.
Related
I have researching the cost of volatile writes in Java in x86 hardware.
I'm planning on using the Unsafe's putLongVolatile method on a shared memory location. Looking into the implementation, putLongVolatile get's translated to Unsafe_SetLongVolatile in Link and subsequently into a AtomicWrite followed by an fence Link
In short, every volatile write is converted to an atomic write followed by a full fence(mfence or locked add instruction in x86).
Questions:
1) Why a fence() is required for x86 ? Isn't a simple compiler barrier sufficient because of store-store ordering ? A full fence seems awfully expensive.
2) Is putLong instead of putLongVolatile of Unsafe a better alternative? Would it work well in multi-threading case?
Answer to question 1:
Without the full fence you do not have sequential consistency which is required for the JMM.
So X86 provides TSO. So the following barriers you get for free [LoadLoad][LoadStore][StoreStore]. The only one missing is the [StoreLoad].
A load has acquire semantics
r1=X
[LoadLoad]
[LoadStore]
A store has release semantics
[LoadStore]
[StoreStore]
Y=r2
If you would do a store followed by a load you end up with this:
[LoadStore]
[StoreStore]
Y=r2
r1=X
[LoadLoad]
[LoadStore]
The issue is that the load and store can still be reordered and hence it isn't sequential consistent; and this is mandatory for the Java Memory model. They only way to prevent this is with a [StoreLoad]. And the most logical place would be to add it to the write since normally reads are more frequent than writes.
And this can be accomplished by an MFENCE or a lock addl %(RSP),0
Answer to question 2:
The problem with a putLong is that not only the CPU can reorder instructions, also the compiler could change to code in such a way that it leads to instruction reordering.
Example: if you would be doing a putLong in a loop, the compiler could decide to pull the write out of the loop and the value will not become visible to other threads. If you want to have a low overhead single writer performance counter, you might want to have a look at putLongRelease/putLongOrdered(oldname). This will prevent the compiler from doing the above trick. And the release semantics on the X86 you get for free.
But it is very difficult to give a one fits all solution to your second question because it depends on what your goal is.
The article "Atomic*.lazySet is a performance win for single writers," goes over how lazySet is a weak volatile write (in the sense that it acts as a store-store and not a store-load fence). But I don't understand how leveraging semi-volatile writes improves concurrent queue performance. How exactly does it offer extra low latency as claimed by Menta-queue?
I already read up on it's implementation and it's claims on the stack overflow question: "How is lazySet in Java's Atomic* classes implemented" and "Atomic Integer's lazySet vs set."
The problem with a volatile write on x86 is that it issues full memory barrier which results in a stall until the store buffer is drained. Meanwhile lazySet on x86 is a simple store. It does not require all previous stores waiting in the store buffer to be flushed, thus allowing writing thread to proceed at full speed.
This is described a bit in Martin Thompson's article.
I've been reading about volatile (https://www.ibm.com/developerworks/java/library/j-jtp06197/) and came across a bit that says that a volatile write is so much more expensive than a nonvolatile write.
I can understand that there would be an increased cost associated with a volatile write given that volatile is a way of synchronization but want to know how exactly how a volatile write is so much more expensive than a nonvolatile write; does it perhaps have to do with visibility across different thread stacks at the time at which the volatile write is made?
Here's why, according to the article you have indicated:
Volatile writes are considerably more expensive than nonvolatile writes because of the memory fencing required to guarantee visibility but still generally cheaper than lock acquisition.
[...] volatile reads are cheap -- nearly as cheap as nonvolatile reads
And that is, of course, true: memory fence operations are always bound to writing and reads execute the same way regardless of whether the underlying variable is volatile or not.
However, volatile in Java is about much more than just volatile vs. nonvolatile memory read. In fact, in its essence it has nothing to do with that distinction: the difference is in the concurrent semantics.
Consider this notorious example:
volatile boolean runningFlag = true;
void run() {
while (runningFlag) { do work; }
}
If runningFlag wasn't volatile, the JIT compiler could essentially rewrite that code to
void run() {
if (runningFlag) while (true) { do work; }
}
The ratio of overhead introduced by reading the runningFlag on each iteration against not reading it at all is, needless to say, enormous.
It is about caching. Since new processors use caches, if you don't specify volatile data stays in cache and operation of writing is fast. (Since cache is near processor) If variable is marked as volatile, system needs to write it fully into memory nad that is a bit slower operation.
And yes you are thinking right it has to do something with different thread stacks, since each is separate and reads from SAME memory, but not necessarily from same cache. Today processors use many levels of caching so this can be a big problem if multiple threads/processes are using same data.
EDIT: If data stays in local cache other threads/processes won't see change until data is written back in memory.
Most likely it has to do with the fact that a volatile write has to stall the pipeline.
All writes are queued to be written to the caches. You don't see this with non-volatile writes/reads as the code can just get the value you just wrote without involving the cache.
When you use a volatile read, it has to go back to the cache, and this means the write (as implemented) cannot continue under the write has been written to the case (in case you do a write followed by a read)
One way around this is to use a lazy write e.g. AtomicInteger.lazySet() which can be 10x faster than a volatile write as it doesn't wait.
Initially I thought a volatile variable was better than synchronized keyword as it did not involve BLOCKING or CONTEXT SWITCHING. But reading this I am now confused.
Is volatile implemented in a non-blocking approach using low level atomic locks or no?
Is volatile implemented in a non-blocking approach using low level atomic locks or no?
Volatile's implementation varies between each processor but it is a non-blocking field load/store - it is usually implemented via memory-fences but can can also be managed with cache-coherent protocols.
I just read that post. That poster is actually incorrect in his explanations of Volatile vs Synchronized flow and someone corrected him as a comment. Volatile will not hold a lock, you may read that a volatile store is similar to a synchronized release and a volatile load is similar to a synchronized acquire but that only pertains to memory visibility and not actual implementation details
Is volatile implemented in a non-blocking approach using low level atomic locks or no?
Use of volatile erects a memory barrier around the field in question. This does not cause a thread to be put into the "BLOCKING" state. However when the volatile field is accessed, the program has to flush changes to central memory and update cache memory which takes cycles. It may result in a context switch but doesn't necessary cause one.
It's true that volatile does not cause blocking.
However, the statement
a volatile variable was better than synchronized keyword as it did not
involve BLOCKING or CONTEXT SWITCHING.
is very debatable and depends heavily on what you are trying to do. volatile is not equivalent to a lock and declaring a variable volatile does not give any guarantees regarding the atomicity of operations in which that variable is involved e.g. increment.
What volatile does is prevent the compiler and/or CPU from performing instruction reordering or caching of the specific variable. This is known as a memory fence. This nasty little mechanism is required to ensure that in a multithreaded environment all threads reading a specific variable have an up-to-date view of its value. This is called visibility and is different from atomicity.
Atomicity can only be guaranteed in the general case by the use of locks (synchronized) or atomic primitives.
What can be however, confusing, is the fact that using synchronization mechanisms also generates an implicit memory fence, so declaring a variable volatile if you're only going to read/write it inside synchronized blocks is redundant.
Volatile is a java language modifier and how it is providing its guarantees comes down to the JVM implementation. Putting it simple if you set a primitive field as volatile you guarantee whatever thread reads this field it will read the most recent value. It basically prohibits any JVM behind the scenes optimizations and forces all the threads to cross the memory barrier to read the volatile primitive.
BLOCKING means that threads don't wait for each other when reading the same volatile variable doing it without mutual exclusion. However, they trigger putting fences on the hardware level to observe "happens-before" semantics(no memory reordering).
To make this more clear, volatile variable is non-blocking because whenever it is read/ by multiple threads concurrently, CPU-cores tied to their threads communicate directly with the main memory or via CPU cache-coherency (depends on hardware/JVM implementation) and no locking mechanism is put in place.
CONTEXT-SWITCHING
The volatile keyword does not trigger context switching itself from its semantics, but possible and depends on lower-level implementations.
I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?
You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.
The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.
It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.
Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!
volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.