Memory barriers on entry and exit of Java synchronized block

Memory barriers on entry and exit of Java synchronized block - java

I came across answers, here on SO, about Java flushing the work copy of variables within a synchronized block during exit. Similarly it syncs all the variable from main memory once during the entry into the synchronized section.
However, I have some fundamental questions around this:
What if I access mostly non-volatile instance variables inside my synchronized section? Will the JVM automatically cache those variables into the CPU registers at the time of entering into the block and then carry all the necessary computations before finally flushing them back?
I have a synchronized block as below:
The underscored variables _ e.g. _callStartsInLastSecondTracker are all instance variables which I heavily access in this critical section.
public CallCompletion startCall()
{
long currentTime;
Pending pending;
synchronized (_lock)
{
currentTime = _clock.currentTimeMillis();
_tracker.getStatsWithCurrentTime(currentTime);
_callStartCountTotal++;
_tracker._callStartCount++;
if (_callStartsInLastSecondTracker != null)
_callStartsInLastSecondTracker.addCall();
_concurrency++;
if (_concurrency > _tracker._concurrentMax)
{
_tracker._concurrentMax = _concurrency;
}
_lastStartTime = currentTime;
_sumOfOutstandingStartTimes += currentTime;
pending = checkForPending();
}
if (pending != null)
{
pending.deliver();
}
return new CallCompletionImpl(currentTime);
}
Does this mean that all these operations e.g. +=, ++, > etc. requires the JVM to interact with main memory repeatedly? If so, can I use local variables to cache them (preferably stack allocation for primitives) and perform operations and in the end assign them back to the instance variables? Will that help to optimise performance of this block?
I have such blocks in other places as well. On running a JProfiler, it has been observed that most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
Appreciate any help here.

(I don't know Java that well, just the underlying locking and memory-ordering concepts that Java is exposing. Some of this is based on assumptions about how Java works, so corrections welcome.)
I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block.
i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply.
The normal rules for non-volatile vars are like in C++: the JIT-compiler can keep private copies / temporaries and do full optimization. The closing } makes any assignments visible before marking the lock as released, so any other thread that runs the same synchronized block will see those changes.
But if you read/write those variables outside a synchronized(_lock) block while this synchronized block is executing, there's no ordering guarantee and only whatever atomicity guarantee Java has. Only volatile would force a JVM to re-read a variable on every access.
most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
The things you're worried about wouldn't really explain this. Inefficient code-gen inside the critical section would make it take somewhat longer, and that could lead to extra contention.
But there wouldn't be a big enough effect to make most threads be blocked waiting for locks (or I/O?) most of the time, compared to having most threads actively running most of the time.
#Kayaman's comment is most likely correct: this is a design issue, doing too much work inside one big mutex. I don't see loops inside your critical section, but presumably some of those methods you call contain loops or are otherwise expensive, and no other thread can enter this synchronized(_lock) block while one thread is in it.
The theoretical worst case slowdown for store/reload from memory (like compiling C in anti-optimized debug mode) vs. keeping a variable in a register would be for something like while (--shared_var >= 0) {}, giving maybe a 6x slowdown on current x86 hardware. (1 cycle latency for dec eax vs. that plus 5 cycle store-forwarding latency for a memory-destination dec). But that's only if you're looping on a shared var, or otherwise creating a dependency chain through repeated modification of it.
Note that a store buffer with store-forwarding still keeps it local to the CPU core without even having to commit to L1d cache.
In the much more likely case of code that just reads a var multiple times, anti-optimized code that really loads every time can have all those loads hit in L1d cache very efficiently. On x86 you'd probably barely notice the difference, with modern CPUs having 2/clock load throughput, and efficient handling of ALU instructions with memory source operands, like cmp eax, [rdi] being basically as efficient as cmp eax, edx.
(CPUs have coherent caches so there's no need for flushing or going all the way to DRAM to ensure you "see" data from other cores; a JVM or C compiler only has to make sure the load or store actually happens in asm, not optimized into a register. Registers are thread-private.)
But as I said, there's no reason to expect that your JVM is doing this anti-optimization inside synchronized blocks. But even if it were, it might make a 25% slowdown.

You are accessing members on a single object. So when the CPU reads the _lock member, it needs to load the cache line containing _lock member first. So probably quite a few of the member variables will be on the same cache line which is already in your cache.
I would be more worried about the synchronized block itself IF you have determined it is actually a problem; it might not be a problem at all. For example Java uses quite a few lock optimization techniques like biased locking, adaptive spin lock to reduce the costs of locks.
But if it is a contended lock, you might want to make the duration of the lock shorter by moving as much out of the lock as possible and perhaps even get rid of the whole lock and switch to a lock free approach.
I would not trust JPofiler for a second.
http://psy-lob-saw.blogspot.com/2016/02/why-most-sampling-java-profilers-are.html
So it might be that JProfiler is putting you in the wrong direction.

Related

Using Java volatile keyword in non-thread scenario

I understand that the Java keyword volatile is used in multi-threading context; the main purpose is to read from the memory rather than from the cache or even if read from the cache, it would be updated first.
In the below example, there is no multi-threading concept. I want to understand if the variable i would be cached as a part of code optimization and hence read from cpu cache rather than memory? If yes, if the variable is declared as volatile, will it certainly be read from the memory?
I have run the program multiple times, by adding and also by deleting the volatile keyword; but, since there is no constant time for the for loop, I was unable to come to a conclusion if more time is consumed when the variable is declared as volatile.
All I want to see is that the time taken from CPU cache is actually less than when it is declared as volatile.
Is my understanding even right? If yes, how can I see the concept in working, with a good record of the times for both CPU cache reads and memory reads?
import java.time.Duration;
import java.time.Instant;
public class Test {
volatile static int i=0;
// static int i=0;
public static void main(String[] args) {
Instant start = Instant.now();
for (i=0; i<838_860_8; i++) { // 2 power 23; ~ 1 MB CPU Cache
System.out.println("i:" + i);
}
Instant end = Instant.now();
long timeElapsed = Duration.between(start, end).getSeconds();
System.out.println("timeElapsed: " + timeElapsed + " seconds.");
}
}

I think that the answer is "probably yes" ... for current Java implementations.
There are two reasons that we can't be sure.
The Java language specification doesn't actually say anything about registers, CPU caches or anything like that. What it actually says is that there is a happens before relationship between one thread writing the volatile and another thread (subsequently) reading it.
While it is reasonable to assume that this will affect caching in the case where there are multiple threads, if the JIT compiler was able to deduce that the volatile variable was thread confined for a given execution of your application, it could reason that it can cache the variable.
That is the theory.
If there was a measurable performance difference, you would be able to measure it in a properly written benchmark. (Though you may get different results depending on the Java version and your hardware.)
However the current version of your benchmark has a number of flaws which would make any results it gives doubtful. If you want to get meaningful results, I strongly recommend that you read the following Q&A.
How do I write a correct micro-benchmark in Java?.
(Unfortunately some of the links in some of the answers seem to be broken ...)

The premise of your benchmark is flawed; the cache is the source of truth. The cache coherence protocol makes sure that CPU caches are coherent; memory is just a spill bucket for whatever doesn't fit into the cache since most caches are write-behind (not write-through). In other words; a write of a volatile doesn't need to be written to main memory; it is sufficient to write it to the cache.
A few examples where writing to the cache isn't desired:
I/O DMA: you want to prevent writing to cache because otherwise, main memory and CPU cache could become incoherent.
Non-temporal data: e.g. you are processing some huge data set and you only access it once, there is no point in caching it.
But this is outside of the reach of a regular Java volatile.
There is a price to pay using volatile:
Atomicity guarantees.
Loads and stores can't be optimized out. This rules out many compiler optimizations. Volatile will not prevent using CPU registers; it will only prevent 'caching' the content of a variable in a register 'indefinitely'.
Ordering guarantees in the form of fences. On the X86 in the above case, the price is at the volatile store. The store needs to wait for the store and all earlier stores to commit to the cache.
Especially the last 2 will impact volatile performance.
Apart from that, your benchmark is flawed. First of all, I would switch to JMH as others already pointed out. It will take care of quite a few typical benchmark errors like warmup and dead code elimination. Also, you should not be using a System.out in the benchmark since this will completely dominate the performance.

Lock implementation deadlock

I have two threads, one with id 0, and the other with id 1. Based on this information I tried to construct a lock, but it seems like it deadlocks, but I don't understand why, can someone please help me?
I tried to construct a scenario when this happens, but it is really difficult
private int turn = 0;
private boolean [] flag = new boolean [2];
public void lock(int tid) {
int other;
other = (tid == 0) ? 1 : 0;
flag[tid] = true;
while (flag[other] == true) {
if (turn == other) {
flag[tid] = false;
while (turn == other){}
flag[tid] = true;
}
}
}
public void unlock(int tid) {
//turn = 1-t;
turn = (tid == 0) ? 1 : 0;
flag[tid] = false;
}

Yup, this code is broken. It's.. a complicated topic. If you want to know more, dive into the Java Memory Model (JMM) - that's a term you can search the web form.
In basis, every field in a JVM has the following property, unless it is marked volatile:
Any thread is free to make a local clone of this field, or not.
The VM is free to sync these clones up, in arbitrary fashion, at any time. Could be in 5 seconds, could be in 5 years. The VM is free to do whatever it wants, the spec intentionally does not make many guarantees.
To solve this problem, establish HB/HA (A Happens-Before/Happens-After relationship).
In other words, if 2 threads both look at the same field, and you haven't established HB/HA, your code is broken, and it's broken in a way that you can't reliably test for, because the JVM can do whatever it wants. This includes working as you expect during tests.
To establish HB/HA, you need volatile/synchronized, pretty much. There's a long list, though:
x.start() is HB to the first line in that thread's HA.
Exiting a synchronized(x) {} block is HB to the start of a synchronized (x) {} block, assuming both x's are pointing at the same object.
Volatile access establishes HB/HA, but you can't really control in which direction.
Note that many library calls, such as to ReadWriteLock and co, may well be synchronizing/using volatile under the hood, thus in passing establishing some form of HB/HA.
"Natural" HB/HA: Within a single thread, a line executed earlier is HB to a line executed later. One might say 'duh' on this one.
Note that HB/HA doesn't actually mean they actually happen before. It merely means that it is not possible to observe the state as it was before the HB line from the HA line, other than by timing attacks. This usually doesn't matter.
The problem
Your code has two threads that both look at the same field, does not contain volatile or synchronized or otherwise establishes HB/HA anywhere, and doesn't use library calls that do so. Therefore, this code is broken.
The solution is to introduce that someplace. Not sure why you're rolling your own locks - the folks who wrote ReadWriteLock and friends in the java.util.concurrent package are pretty smart. There is zero chance you can do better than them, so just use that.

Dekker's algorithm is a great example to explain sequential consistency.
Lets first give you a bit of background information. When a CPU does a store, the CPU won't wait for the store to be inserted in the (coherent) cache; once it is inserted into the cache it is globally visible to all other CPUs. The reason why a CPU doesn't wait is that it can be a long time before the cache line that contained the field you want to write is in the appropriate state; the cache line first needs to be invalidated on all other CPU's before the write can be performed on the cache-line and this can stall the CPU.
To resolve this problem modern CPUs make use of store buffers. So the store is written to the store buffer and at some point in the future, the store will be committed to the cache.
If the store is followed by a load to a different address, it could be that the store and the load are reordered. E.g. the cache line for the later load might already be in the right state while the cache line for the earlier store isn't. So the load is 'globally' performed before the store is 'globally' performed; so we have a reordering.
Now lets have a look at your example:
flag[tid] = true;
while (flag[other] == true) {
So what we have here is a store followed by a load to a different address and hence the store and the load can be reordered and the consequence is that the lock won't provide mutual exclusion since 2 threads at the same time could enter the critical section.
[PS]
It is likely that the 2 flags will be on the same cache line, but it isn't a guarantee.
This is typical what you see on the X86; an older store can be reordered with a newer load. The memory model of the X86 is called TSO. I will not go into the details because there is an optimization called store to load forwarding which complicates the TSO model. TSO is a relaxation of sequential consistency.
Dekker's and Peterson's mutex algorithms do not rely on atomic instructions like a compare and set; but they do rely on a sequential consistent system and one of the requirement of sequential consistency is that the loads and stores do not get reordered (or at least nobody should be able to proof they were reordered).
So what the compiler needs to do is to add the appropriate CPU memory fence to make sure that the older store and the newer load to a different address don't get reordered. And in this case that would be an expensive [StoreLoad] fence. It depends on the ISA, but on Intel X86 the [StoreLoad] will cause the load-store-unit stop executing loads till the store buffer has been drained.
You can't declare the fields of an array to be volatile. But there are a few options:
don't use an array, but explicit fields.
use an AtomicIntegerArray
start messing around with fences using the VarHandle. Make sure the fields are accessed using a VarHandle with at least memory order opaque to prevent the compiler to optimize out any loads/stores.
Apart from what I said above, it isn't only the CPU that can cause problems. Because there is no appropriate happens before edge between writing and reading the flags, the JIT can also optimize this code.
For example:
if(!flag[other])
return;
while (true) {
if (turn == other) {
flag[tid] = false;
while (turn == other){}
flag[tid] = true;
}
}
And now it is clear that this lock could wait indefinitely (deadlock) and doesn't need to see change in the flag. More optimizations are possible, that could totally mess up your code.
if (turn == other) {
while (true) {}
}
If you add the appropriate happens before edges, the compiler will add:
compiler fences
CPU memory fences.
And Dekker's algorithm should work as expected.
I would also suggesting setting up a test using JCStress. https://github.com/openjdk/jcstress and have a look at this test:
https://github.com/openjdk/jcstress/blob/master/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/concurrency/mutex/Mutex_02_DekkerAlgorithm.java

Java multithreading purpose of synchronized keyword

From the book Java in Nutshell 6th edition one can read:
The reason we use the word synchronized as the keyword for “requires temporary
exclusive access” is that in addition to acquiring the monitor, the JVM also rereads
the current state of the object from the main memory when the block is entered. Similarly,
when the synchronized block or method is exited, the JVM flushes any modified
state of the object back to the main memory.
as well as:
Without synchronization, different CPU cores in the system may not see the same
view of memory and memory inconsistencies can damage the state of a running
the program, as we saw in our ATM example.
It suggests when the synchronized method is entered the object is loaded from the main memory to maintain memory consistency
But is this the case for objects without synchronized keywords also? So in case of a normal object is modified in one core of a CPU is synchronized with main memory so that other cores can see it?

While the other answer talks about the importance of cache synchronisation and the memory hierarchy, the synchronized keyword doesn’t control the state of the object as a whole; rather, it is about the lock associated with that object.
Every object instance in Java can have an associated lock which prevents multiple threads running at the same time for those blocks which are synchronized on the lock. This is either implicit on the this instance or on the argument to the synchronized keyword (or the class if a static method).
While the JMM semantics say that this lock object is properly controlled and available in cache levels, it doesn’t necessarily mean therefore that the object as a whole is protected; fields read from different threads while a single thread is running in a synchronized block or method aren’t dealt with, for example.
In addition the Java memory model has defined “happens before” relationships about how data changes may become visible between threads that you need to take into account, which is why the “volatile” keyword and AtomicXxxx types are present, including var handles relaxed memory models.
So when you talk about synchronised, you need to be aware that it’s only shot the state of the object’s lock and not the state within the object that it is protecting.

First, similar to what happen with other miss information going around like:
Volatile is supposed to make the threads read the values from RAM
disabling thread cache
More detail about why that is not the case can be found this SO thread.
That can be applied to the statements:
the JVM also rereads the current state of the object from the main
memory when the block is entered
and
when the synchronized block or method is exited, the JVM flushes any
modified state of the object back to the main memory
Citing David Schwarz that kindly pointed out the following in the comments and allowed me to used:
That does not happen on any modern system. These are things that the platform might, in theory, have to do to make synchronized work but if they're not necessary on the platform (and they aren't on any platform you're likely to use), they aren't done.
These statements are all in regard to a hypothetical system that has no hardware synchronization and might require these steps. Actual systems have very different hardware designs from this simplified hypothetical system and require very different things to be done. (For example, they typically just require optimization and memory barriers, not flushes to main memory or reads. This is because modern systems are optimized and use caches to avoid having to flush to or re-read from main memory because main memory is very slow and so modern CPUs have hardware optimizations to avoid it.)
Now going back to your question:
But this is case for object without synchronized keyword also ? So in
case of normal object is modified in one core of CPU is synchronized
with main memory so that other core can see it?
TL;DR: It might or not happen; it depends on the hardware and if the Object is read from cache; However, with the use of the synchronized the JVM ensures that it will be.
More detailed answer
So in case of the normal object is modified in one core of CPU is
synchronized with main memory so that other core can see it?
To keep simple and concise, without synchronized it depends on the hardware architecture (e.g., Cache protocol) where the code will be executed and it depends if the Object is (or not) in the cache when it is read/updated.
If the architecture forces that the data in the cores is always consistence with the other cores, then yes. Accessing the cache is much faster than accessing the main memory, and accessing the first levels of cache (e.g., L1) is also faster than access the other levels.
Hence, for performance reasons, normally when the data (e.g., an Object) is loaded from main memory it gets stored in the cache (e.g., L1, L2, and L3) for quicker access in case that same data is needed again.
The first levels of cache tend to be private to each core. Therefore, it might happen that different cores have stored in their private cache (e.g., L1) different states of the "same Object". Consequently, Threads might also be reading/updating different states of the "same Object".
Notice that I wrote "same Object" because conceptually it is the same Object but in practice it is not the same entity but rather a copy of the same Object that was read from the main memory.

Increased cost of a volatile write over a nonvolatile write

I've been reading about volatile (https://www.ibm.com/developerworks/java/library/j-jtp06197/) and came across a bit that says that a volatile write is so much more expensive than a nonvolatile write.
I can understand that there would be an increased cost associated with a volatile write given that volatile is a way of synchronization but want to know how exactly how a volatile write is so much more expensive than a nonvolatile write; does it perhaps have to do with visibility across different thread stacks at the time at which the volatile write is made?

Here's why, according to the article you have indicated:
Volatile writes are considerably more expensive than nonvolatile writes because of the memory fencing required to guarantee visibility but still generally cheaper than lock acquisition.
[...] volatile reads are cheap -- nearly as cheap as nonvolatile reads
And that is, of course, true: memory fence operations are always bound to writing and reads execute the same way regardless of whether the underlying variable is volatile or not.
However, volatile in Java is about much more than just volatile vs. nonvolatile memory read. In fact, in its essence it has nothing to do with that distinction: the difference is in the concurrent semantics.
Consider this notorious example:
volatile boolean runningFlag = true;
void run() {
while (runningFlag) { do work; }
}
If runningFlag wasn't volatile, the JIT compiler could essentially rewrite that code to
void run() {
if (runningFlag) while (true) { do work; }
}
The ratio of overhead introduced by reading the runningFlag on each iteration against not reading it at all is, needless to say, enormous.

It is about caching. Since new processors use caches, if you don't specify volatile data stays in cache and operation of writing is fast. (Since cache is near processor) If variable is marked as volatile, system needs to write it fully into memory nad that is a bit slower operation.
And yes you are thinking right it has to do something with different thread stacks, since each is separate and reads from SAME memory, but not necessarily from same cache. Today processors use many levels of caching so this can be a big problem if multiple threads/processes are using same data.
EDIT: If data stays in local cache other threads/processes won't see change until data is written back in memory.

Most likely it has to do with the fact that a volatile write has to stall the pipeline.
All writes are queued to be written to the caches. You don't see this with non-volatile writes/reads as the code can just get the value you just wrote without involving the cache.
When you use a volatile read, it has to go back to the cache, and this means the write (as implemented) cannot continue under the write has been written to the case (in case you do a write followed by a read)
One way around this is to use a lazy write e.g. AtomicInteger.lazySet() which can be 10x faster than a volatile write as it doesn't wait.

Are volatile variable 'reads' as fast as normal reads?

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?

You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.

The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.

It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.

Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!

volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.