Increased cost of a volatile write over a nonvolatile write - java

I've been reading about volatile (https://www.ibm.com/developerworks/java/library/j-jtp06197/) and came across a bit that says that a volatile write is so much more expensive than a nonvolatile write.
I can understand that there would be an increased cost associated with a volatile write given that volatile is a way of synchronization but want to know how exactly how a volatile write is so much more expensive than a nonvolatile write; does it perhaps have to do with visibility across different thread stacks at the time at which the volatile write is made?

Here's why, according to the article you have indicated:
Volatile writes are considerably more expensive than nonvolatile writes because of the memory fencing required to guarantee visibility but still generally cheaper than lock acquisition.
[...] volatile reads are cheap -- nearly as cheap as nonvolatile reads
And that is, of course, true: memory fence operations are always bound to writing and reads execute the same way regardless of whether the underlying variable is volatile or not.
However, volatile in Java is about much more than just volatile vs. nonvolatile memory read. In fact, in its essence it has nothing to do with that distinction: the difference is in the concurrent semantics.
Consider this notorious example:
volatile boolean runningFlag = true;
void run() {
while (runningFlag) { do work; }
}
If runningFlag wasn't volatile, the JIT compiler could essentially rewrite that code to
void run() {
if (runningFlag) while (true) { do work; }
}
The ratio of overhead introduced by reading the runningFlag on each iteration against not reading it at all is, needless to say, enormous.

It is about caching. Since new processors use caches, if you don't specify volatile data stays in cache and operation of writing is fast. (Since cache is near processor) If variable is marked as volatile, system needs to write it fully into memory nad that is a bit slower operation.
And yes you are thinking right it has to do something with different thread stacks, since each is separate and reads from SAME memory, but not necessarily from same cache. Today processors use many levels of caching so this can be a big problem if multiple threads/processes are using same data.
EDIT: If data stays in local cache other threads/processes won't see change until data is written back in memory.

Most likely it has to do with the fact that a volatile write has to stall the pipeline.
All writes are queued to be written to the caches. You don't see this with non-volatile writes/reads as the code can just get the value you just wrote without involving the cache.
When you use a volatile read, it has to go back to the cache, and this means the write (as implemented) cannot continue under the write has been written to the case (in case you do a write followed by a read)
One way around this is to use a lazy write e.g. AtomicInteger.lazySet() which can be 10x faster than a volatile write as it doesn't wait.

Related

JAVA volatile variable read performance with LazySet

I am trying to Understand performance of volatile variable in JAVA.
I see https://brooker.co.za/blog/2012/09/10/volatile.html and it seems volatile reads are slow when there is a writer involved. I have not seen any more arguments or benchmarks mentioning the same.
How would AtomicReference lazySet affect volatile variable reads
A regular read and volatile read from hardware x86 perspective are equally cheap. Volatile read requires acquire semantics which is provided by the tso memory model of x86. So both regular load and volatile load have acquire semantics and are equally cheap. On software level there is difference since volatile read prohibits many compiler optimizations.
A lazy set will not change the performance of the reader; just the performance of the writer. On X86 volatile write is a sequential consistent write; so a [StoreLoad] is needed and this requiring stopping any loads from being executed until the store buffer is drained. A lazySet aka orderedSet placed the store on the store buffer and then continues. So it won't stall the CPU. This is purely a writer concern; not a reader. So a reader will not go any faster or slower.
In your case: first determine if it is actual a problem. In most cases many other issues are playing and optimizing on this level makes code complex and introduces bugs. If it truly is a problem, I would be more focused on contention on the cache line than the overhead of reading/writing to a volatile variable.

Memory barriers on entry and exit of Java synchronized block

I came across answers, here on SO, about Java flushing the work copy of variables within a synchronized block during exit. Similarly it syncs all the variable from main memory once during the entry into the synchronized section.
However, I have some fundamental questions around this:
What if I access mostly non-volatile instance variables inside my synchronized section? Will the JVM automatically cache those variables into the CPU registers at the time of entering into the block and then carry all the necessary computations before finally flushing them back?
I have a synchronized block as below:
The underscored variables _ e.g. _callStartsInLastSecondTracker are all instance variables which I heavily access in this critical section.
public CallCompletion startCall()
{
long currentTime;
Pending pending;
synchronized (_lock)
{
currentTime = _clock.currentTimeMillis();
_tracker.getStatsWithCurrentTime(currentTime);
_callStartCountTotal++;
_tracker._callStartCount++;
if (_callStartsInLastSecondTracker != null)
_callStartsInLastSecondTracker.addCall();
_concurrency++;
if (_concurrency > _tracker._concurrentMax)
{
_tracker._concurrentMax = _concurrency;
}
_lastStartTime = currentTime;
_sumOfOutstandingStartTimes += currentTime;
pending = checkForPending();
}
if (pending != null)
{
pending.deliver();
}
return new CallCompletionImpl(currentTime);
}
Does this mean that all these operations e.g. +=, ++, > etc. requires the JVM to interact with main memory repeatedly? If so, can I use local variables to cache them (preferably stack allocation for primitives) and perform operations and in the end assign them back to the instance variables? Will that help to optimise performance of this block?
I have such blocks in other places as well. On running a JProfiler, it has been observed that most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
Appreciate any help here.
(I don't know Java that well, just the underlying locking and memory-ordering concepts that Java is exposing. Some of this is based on assumptions about how Java works, so corrections welcome.)
I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block.
i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply.
The normal rules for non-volatile vars are like in C++: the JIT-compiler can keep private copies / temporaries and do full optimization. The closing } makes any assignments visible before marking the lock as released, so any other thread that runs the same synchronized block will see those changes.
But if you read/write those variables outside a synchronized(_lock) block while this synchronized block is executing, there's no ordering guarantee and only whatever atomicity guarantee Java has. Only volatile would force a JVM to re-read a variable on every access.
most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
The things you're worried about wouldn't really explain this. Inefficient code-gen inside the critical section would make it take somewhat longer, and that could lead to extra contention.
But there wouldn't be a big enough effect to make most threads be blocked waiting for locks (or I/O?) most of the time, compared to having most threads actively running most of the time.
#Kayaman's comment is most likely correct: this is a design issue, doing too much work inside one big mutex. I don't see loops inside your critical section, but presumably some of those methods you call contain loops or are otherwise expensive, and no other thread can enter this synchronized(_lock) block while one thread is in it.
The theoretical worst case slowdown for store/reload from memory (like compiling C in anti-optimized debug mode) vs. keeping a variable in a register would be for something like while (--shared_var >= 0) {}, giving maybe a 6x slowdown on current x86 hardware. (1 cycle latency for dec eax vs. that plus 5 cycle store-forwarding latency for a memory-destination dec). But that's only if you're looping on a shared var, or otherwise creating a dependency chain through repeated modification of it.
Note that a store buffer with store-forwarding still keeps it local to the CPU core without even having to commit to L1d cache.
In the much more likely case of code that just reads a var multiple times, anti-optimized code that really loads every time can have all those loads hit in L1d cache very efficiently. On x86 you'd probably barely notice the difference, with modern CPUs having 2/clock load throughput, and efficient handling of ALU instructions with memory source operands, like cmp eax, [rdi] being basically as efficient as cmp eax, edx.
(CPUs have coherent caches so there's no need for flushing or going all the way to DRAM to ensure you "see" data from other cores; a JVM or C compiler only has to make sure the load or store actually happens in asm, not optimized into a register. Registers are thread-private.)
But as I said, there's no reason to expect that your JVM is doing this anti-optimization inside synchronized blocks. But even if it were, it might make a 25% slowdown.
You are accessing members on a single object. So when the CPU reads the _lock member, it needs to load the cache line containing _lock member first. So probably quite a few of the member variables will be on the same cache line which is already in your cache.
I would be more worried about the synchronized block itself IF you have determined it is actually a problem; it might not be a problem at all. For example Java uses quite a few lock optimization techniques like biased locking, adaptive spin lock to reduce the costs of locks.
But if it is a contended lock, you might want to make the duration of the lock shorter by moving as much out of the lock as possible and perhaps even get rid of the whole lock and switch to a lock free approach.
I would not trust JPofiler for a second.
http://psy-lob-saw.blogspot.com/2016/02/why-most-sampling-java-profilers-are.html
So it might be that JProfiler is putting you in the wrong direction.

how/why does writing to a volatile variable cause flushing of other variables?

I was reading about concurrency in Java, including the volatile variable, for example here: http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html
The following quote is very interesting but I don't quite understand it yet:
In effect, because the new memory model places stricter constraints on
reordering [by e.g. the processor for efficiency] of volatile field accesses with other field accesses,
volatile or not, anything that was visible to thread A when it writes
to volatile field f becomes visible to thread B when it reads f.
I already understood that a volatile variable cannot be cached in registers, so any write by any thread will be immediately visible by all other threads. Also according to this (https://docs.oracle.com/javase/tutorial/essential/concurrency/atomic.html) reads and writes on volatile variables are atomic (not sure if that would include something like x++, but it's beside the point to this post).
But the quote I provided seems to imply something in addition to that. It says that anything visible to thread A will now be visible to thread B.
So just to make sure I have that right, does this mean that when a thread writes to a volatile variable, it does a full dump of its entire processor registers to main memory? Can you give some more context about how and why this happens? It might also help to compare/contrast this with synchronization (does it follow a similar mechanism or different?). Also, examples never hurt with something as complex as this :).
On x64, the JIT produced an instruction with a read or write barrier. The implementation is in hardware, not software.
does this mean that when a thread writes to a volatile variable, it does a full dump of its entire processor registers to main memory?
No, only data written to memory is flushed. Not registers.
Can you give some more context about how and why this happens?
The CPU implements this using an L2 cache coherency protocol (depending on the CPU)
Note: on a single cpu system, it doesn't need to do anything.
It would also help to compare/contrast this with synchronization (does it follow a similar mechanism or different?).
It uses the same instructions.
Also, examples never hurt with something as complex as this :).
When you read, it adds a read barrier.
When you write, it adds a write barrier.
The CPU then ensures the data stored in your L1 & L2 cache is appropriately synchronised with other CPUs.
Yes, you are correct. This is exactly what happens. This is related to passing so called memory barrier. More details here: https://dzone.com/articles/memory-barriersfences

Usage of lazySet on AtomicXXX in Java

From this question : AtomicInteger lazySet vs. set and form this link : https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/package-summary.html
I could gather following points
lazySet could be faster than set
lazySet uses store-store barrier (writes before are honored but not the contended writes, which were yet to happen)
I could find one use-case where it could be applied, from the documentation :
Use lazySet when you want to null out a pointer to aid GC.
Are there any other practical use-cases for lazySet ?
Caffeine uses lazy or relaxed writes in many of its data structures.
When nulling out a field (e.g. ConcurrentLinkedStack)
When writing to volatile fields before publishing (e.g. SingleConsumerQueue)
When publish is safely delayable (e.g. BoundedBuffer)
When races are benign (e.g. cache expiration timestamps)
When inside a lock (e.g. BoundedLocalCache)
ConcurrentLinkedQueue uses relaxed writes prior to publishing a node and may lazily sets a node's next field (prior to publishing or to indicate a stale traversal).
You may also enjoy reading the Linux Kernel Memory Barriers paper.
TL;DR How to use .lazySet()? With care, if at all.
The main problem here is that AtomicXXX.lazySet() is low-level performance optimization and it is out of current JLS. You can't prove correctness if your concurrent code with JMM tools if you are using lazySet().
Why it is much faster than volatile write?
Main difference between set and lazySet is absence of StoreLoad barrier.
JSR-133 Cookbook for Compiler Writers:
StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind.
Moreover, on most popular x86-based hardware StoreLoad is the only explicit barrier (others are just no-op's and cost nothing), so with lazySet you eliminate all (explicit) memory barriers.
Guarantees of lazySet
From the point of JLS there isn't any.
But actually you can reason about lazySet as delayed write which cannot be reordered with any previous write and will happen eventually. Eventually is finite time, if your process makes any progress (e.g. any synchronization action occurs; in addition, size of processor's store buffer is finite). If written value became visible for other thread, you can be sure that all previous writes are visible either (although you cannot formally prove it). So you can treat it as delayed happens-before relationship (but, of course, it's not even close to it's strict and formal definition).
Usage
Most practical usage (except nulling-out references) is making writes far cheaper in context of progress. Simplest example is using lazySet() instead of set() within synchronized block (but in this case there is no great performance impact). Or you can use it instead of writes in single producer cases, when no compete on write occurs.
Disruptor developers are using lazySet exactly for this purpose in their lock-free implementation. Again, it's very hard to argue about correctness of such code, but it's good trick to be aware of.
I would think many uses of AtomicBoolean would benefit from the usage of lazySet() because they are often used as flags to indicate whether something is complete or not, or an outer loop should finish.
This is because in this case the value is initially one value and it eventually becomes another value and then stays there. Obviously this argument applied to almost any atomic that is used in that way.
public void test() {
final AtomicBoolean finished = new AtomicBoolean(false);
new Thread(new Runnable() {
#Override
public void run() {
while (!finished.get()) {
// A long process.
if (wereAllDone()) {
finished.lazySet(true);
}
}
}
}).start();
}

Are volatile variable 'reads' as fast as normal reads?

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?
You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.
The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.
It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.
Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!
volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

Categories

Resources