Java multithreading purpose of synchronized keyword

Java multithreading purpose of synchronized keyword - java

From the book Java in Nutshell 6th edition one can read:
The reason we use the word synchronized as the keyword for “requires temporary
exclusive access” is that in addition to acquiring the monitor, the JVM also rereads
the current state of the object from the main memory when the block is entered. Similarly,
when the synchronized block or method is exited, the JVM flushes any modified
state of the object back to the main memory.
as well as:
Without synchronization, different CPU cores in the system may not see the same
view of memory and memory inconsistencies can damage the state of a running
the program, as we saw in our ATM example.
It suggests when the synchronized method is entered the object is loaded from the main memory to maintain memory consistency
But is this the case for objects without synchronized keywords also? So in case of a normal object is modified in one core of a CPU is synchronized with main memory so that other cores can see it?

While the other answer talks about the importance of cache synchronisation and the memory hierarchy, the synchronized keyword doesn’t control the state of the object as a whole; rather, it is about the lock associated with that object.
Every object instance in Java can have an associated lock which prevents multiple threads running at the same time for those blocks which are synchronized on the lock. This is either implicit on the this instance or on the argument to the synchronized keyword (or the class if a static method).
While the JMM semantics say that this lock object is properly controlled and available in cache levels, it doesn’t necessarily mean therefore that the object as a whole is protected; fields read from different threads while a single thread is running in a synchronized block or method aren’t dealt with, for example.
In addition the Java memory model has defined “happens before” relationships about how data changes may become visible between threads that you need to take into account, which is why the “volatile” keyword and AtomicXxxx types are present, including var handles relaxed memory models.
So when you talk about synchronised, you need to be aware that it’s only shot the state of the object’s lock and not the state within the object that it is protecting.

First, similar to what happen with other miss information going around like:
Volatile is supposed to make the threads read the values from RAM
disabling thread cache
More detail about why that is not the case can be found this SO thread.
That can be applied to the statements:
the JVM also rereads the current state of the object from the main
memory when the block is entered
and
when the synchronized block or method is exited, the JVM flushes any
modified state of the object back to the main memory
Citing David Schwarz that kindly pointed out the following in the comments and allowed me to used:
That does not happen on any modern system. These are things that the platform might, in theory, have to do to make synchronized work but if they're not necessary on the platform (and they aren't on any platform you're likely to use), they aren't done.
These statements are all in regard to a hypothetical system that has no hardware synchronization and might require these steps. Actual systems have very different hardware designs from this simplified hypothetical system and require very different things to be done. (For example, they typically just require optimization and memory barriers, not flushes to main memory or reads. This is because modern systems are optimized and use caches to avoid having to flush to or re-read from main memory because main memory is very slow and so modern CPUs have hardware optimizations to avoid it.)
Now going back to your question:
But this is case for object without synchronized keyword also ? So in
case of normal object is modified in one core of CPU is synchronized
with main memory so that other core can see it?
TL;DR: It might or not happen; it depends on the hardware and if the Object is read from cache; However, with the use of the synchronized the JVM ensures that it will be.
More detailed answer
So in case of the normal object is modified in one core of CPU is
synchronized with main memory so that other core can see it?
To keep simple and concise, without synchronized it depends on the hardware architecture (e.g., Cache protocol) where the code will be executed and it depends if the Object is (or not) in the cache when it is read/updated.
If the architecture forces that the data in the cores is always consistence with the other cores, then yes. Accessing the cache is much faster than accessing the main memory, and accessing the first levels of cache (e.g., L1) is also faster than access the other levels.
Hence, for performance reasons, normally when the data (e.g., an Object) is loaded from main memory it gets stored in the cache (e.g., L1, L2, and L3) for quicker access in case that same data is needed again.
The first levels of cache tend to be private to each core. Therefore, it might happen that different cores have stored in their private cache (e.g., L1) different states of the "same Object". Consequently, Threads might also be reading/updating different states of the "same Object".
Notice that I wrote "same Object" because conceptually it is the same Object but in practice it is not the same entity but rather a copy of the same Object that was read from the main memory.

Related

Memory barriers on entry and exit of Java synchronized block

I came across answers, here on SO, about Java flushing the work copy of variables within a synchronized block during exit. Similarly it syncs all the variable from main memory once during the entry into the synchronized section.
However, I have some fundamental questions around this:
What if I access mostly non-volatile instance variables inside my synchronized section? Will the JVM automatically cache those variables into the CPU registers at the time of entering into the block and then carry all the necessary computations before finally flushing them back?
I have a synchronized block as below:
The underscored variables _ e.g. _callStartsInLastSecondTracker are all instance variables which I heavily access in this critical section.
public CallCompletion startCall()
{
long currentTime;
Pending pending;
synchronized (_lock)
{
currentTime = _clock.currentTimeMillis();
_tracker.getStatsWithCurrentTime(currentTime);
_callStartCountTotal++;
_tracker._callStartCount++;
if (_callStartsInLastSecondTracker != null)
_callStartsInLastSecondTracker.addCall();
_concurrency++;
if (_concurrency > _tracker._concurrentMax)
{
_tracker._concurrentMax = _concurrency;
}
_lastStartTime = currentTime;
_sumOfOutstandingStartTimes += currentTime;
pending = checkForPending();
}
if (pending != null)
{
pending.deliver();
}
return new CallCompletionImpl(currentTime);
}
Does this mean that all these operations e.g. +=, ++, > etc. requires the JVM to interact with main memory repeatedly? If so, can I use local variables to cache them (preferably stack allocation for primitives) and perform operations and in the end assign them back to the instance variables? Will that help to optimise performance of this block?
I have such blocks in other places as well. On running a JProfiler, it has been observed that most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
Appreciate any help here.

(I don't know Java that well, just the underlying locking and memory-ordering concepts that Java is exposing. Some of this is based on assumptions about how Java works, so corrections welcome.)
I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block.
i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply.
The normal rules for non-volatile vars are like in C++: the JIT-compiler can keep private copies / temporaries and do full optimization. The closing } makes any assignments visible before marking the lock as released, so any other thread that runs the same synchronized block will see those changes.
But if you read/write those variables outside a synchronized(_lock) block while this synchronized block is executing, there's no ordering guarantee and only whatever atomicity guarantee Java has. Only volatile would force a JVM to re-read a variable on every access.
most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
The things you're worried about wouldn't really explain this. Inefficient code-gen inside the critical section would make it take somewhat longer, and that could lead to extra contention.
But there wouldn't be a big enough effect to make most threads be blocked waiting for locks (or I/O?) most of the time, compared to having most threads actively running most of the time.
#Kayaman's comment is most likely correct: this is a design issue, doing too much work inside one big mutex. I don't see loops inside your critical section, but presumably some of those methods you call contain loops or are otherwise expensive, and no other thread can enter this synchronized(_lock) block while one thread is in it.
The theoretical worst case slowdown for store/reload from memory (like compiling C in anti-optimized debug mode) vs. keeping a variable in a register would be for something like while (--shared_var >= 0) {}, giving maybe a 6x slowdown on current x86 hardware. (1 cycle latency for dec eax vs. that plus 5 cycle store-forwarding latency for a memory-destination dec). But that's only if you're looping on a shared var, or otherwise creating a dependency chain through repeated modification of it.
Note that a store buffer with store-forwarding still keeps it local to the CPU core without even having to commit to L1d cache.
In the much more likely case of code that just reads a var multiple times, anti-optimized code that really loads every time can have all those loads hit in L1d cache very efficiently. On x86 you'd probably barely notice the difference, with modern CPUs having 2/clock load throughput, and efficient handling of ALU instructions with memory source operands, like cmp eax, [rdi] being basically as efficient as cmp eax, edx.
(CPUs have coherent caches so there's no need for flushing or going all the way to DRAM to ensure you "see" data from other cores; a JVM or C compiler only has to make sure the load or store actually happens in asm, not optimized into a register. Registers are thread-private.)
But as I said, there's no reason to expect that your JVM is doing this anti-optimization inside synchronized blocks. But even if it were, it might make a 25% slowdown.

You are accessing members on a single object. So when the CPU reads the _lock member, it needs to load the cache line containing _lock member first. So probably quite a few of the member variables will be on the same cache line which is already in your cache.
I would be more worried about the synchronized block itself IF you have determined it is actually a problem; it might not be a problem at all. For example Java uses quite a few lock optimization techniques like biased locking, adaptive spin lock to reduce the costs of locks.
But if it is a contended lock, you might want to make the duration of the lock shorter by moving as much out of the lock as possible and perhaps even get rid of the whole lock and switch to a lock free approach.
I would not trust JPofiler for a second.
http://psy-lob-saw.blogspot.com/2016/02/why-most-sampling-java-profilers-are.html
So it might be that JProfiler is putting you in the wrong direction.

Why are objects visible to all threads, while a reading thread might not see a value written by another thread on a timely basis?

From java in a nutshell
In Java, all Java application threads in a process have their own
stacks (and local variables) but share a single heap. This makes it
very easy to share objects between threads, as all that is required is
to pass a reference from one thread to another.
This leads to a general design principle of Java—that objects are
visible by default. If I have a reference to an object, I can copy
it and hand it off to another thread with no restrictions. A Java
reference is essentially a typed pointer to a location in memory—and
threads share the same address space, so visible by default
is a natural model.
From Java Concurrency in Practice
Visibility is subtle because the things that can go wrong are so
counterintuitive. In a single-threaded environment, if you write a
value to a variable and later read that variable with no intervening
writes, you can expect to get the same value back. This seems only
natural. It may be hard to accept at ﬁrst, but when the reads and
writes occur in different threads, this is simply not the case. In
general,
there is no guarantee that the reading thread will see a value written by another thread on a timely basis, or even at all. In
order to ensure visibility of memory writes across threads, you must
use synchronization.
When a thread reads a variable without synchronization, it may see a stale value.
So why does Java in a Nutshell says objects are visible to all threads, while Java Concurrency in Practice says no guarantee that a reading thread sees a value written by another thread on a timely basis? They don't seem consistent.
Thanks.

"So why does Java in a Nutshell says objects are visible to all threads" -->
As your quote says, in Java objects are allocated on the heap. A 'global' heap available for the entire JVM. Whereas in other languages (e.g. C++) objects can also be allocated on a stack. Objects on a heap can be passed to other threads, using different stacks. Objects on a stack can only be used on the thread using the same stack, as the stack's content will change beyond control of another thread.
"while Java Concurrency in Practice says no guarantee that a reading thread sees a value written by another thread on a timely basis?" -> This is another issue, as this is about values of memory locations. Though they are reachable compilers and CPUs try to optimize reading from or writing to this memory locations and will heavily cache the value by assuming "I'm the only one reading and writing to this memory location". So if one thread modifies a memory location's value the other thread does not know it has changed and will not read it new. This makes the program much faster. By declaring a variable volatile you are telling the compiler that another thread may change the value at will and the compiler will use this to create code that doesn't cache the value.
Finally, multithreading is much more difficult than adding volatile, or using synchronized, one really needs to dive into the topic of the issues you will encounter when using multiple threads.

In Java, all Java application threads in a process have their own
stacks (and local variables) but share a single heap. This makes it
very easy to share objects between threads, as all that is required is
to pass a reference from one thread to another.
This leads to a general design principle of Java—that objects are
visible by default.
I suppose that these statements are strictly true ... but they are misleading because they don't convey the whole truth. For example, what does the author mean when he says "...that objects are visible by default."
Any thread executing on a Java JVM does not have de facto visibility to all the objects on the JVM's heap. If we define visibility as "the ability to access by reference", then a thread only has visibility to objects:
whose references have been published to that thread
whose references are in static fields or fields of objects to which the thread has access
In fact, an important and commonly used thread safety policy in Java concurrent programming is thread confinement. If a thread holds a reference to an object to which only it has access and which is not published to any other thread, then that object is thread safe. That object can be safely mutated by the thread in which it is confined without any further regard to visibility and atomicity ... as long as it is correctly thread confined.
In other words, an object that is thread confined, no matter where it is on the JVM heap, is not visible to any other thread that may be running on that same JVM by virtue of being inaccessible.
since shared objects are stored in the heap shared by threads, why
some threads might not see the most updated value by other threads?
In this age of multi-core processors, each CPU on which a JVM may be running has its own levels of local cache memory that no other core can see. This gets to the heart of why values written to variables in one thread are not guaranteed to be visible to another thread: the Java Memory Model makes no guarantees when values written by one thread will become visible to other threads because it does not specify when cached values will be written back from cache to memory.
It is, in fact, usual for the unsynchronized access of values to be stale (or inconsistent) when those values are accessed by many threads. Depending on the state transition that is occurring, thread safety in a concurrent environment in which many threads may be accessing the same value, may require:
mutual exclusion
atomicity protection
visibility guarantees
or all of the above
in order to achieve a thread safety policy that allows your program to be correct.

How does the JSR-133 cookbook enforce all the guarantees made by the Java Memory Model

My understanding, is that the JSR-133 cookbook is a well quoted guide of how to implement the Java memory model using a series of memory barriers, (or at least the visibility guarantees).
It is also my understanding based on the description of the different types of barriers, that StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
I was looking at the table of specific barriers required for different program order inter-leavings of volatile/regular stores/loads and what memory barriers would be required.
From my intuition this table seems incomplete. For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile. In the table in the link above, it seems as if the only actions that flush CPU buffers and propagate changes/allow new changes to be observed are a Volatile Store or MonitorExit followed by a Volatile Load or MonitorEnter. I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Where have I gone wrong in my understanding here? Or does this implementation only enforce happens before and not the synchronization guarantees or extra actions on acquiring/releasing monitors.
Thanks

StoreLoad is the only one that guarantees all CPU buffers are flushed to cache and therefore ensure fresh reads (by avoiding store forwarding) and guarantees the observation of the latest value due to cache coherency.
This may be true for x86 architectures, but you shouldn't be thinking on that level of abstraction. It may be the case that cache coherence can be costly for the processors to be executing.
Take mobile devices for example, one important goal is to reduce the amount of battery use programs consume. In that case, they may not participate in cache coherence and StoreLoad loses this feature.
I don't see how the barriers could guarantee visibility in my above example, when those operations (according to the table) only use LoadStore and StoreStore which from my understanding are only concerned with re-ordering in a thread and cannot enforce the happens before guarantee (across threads that is).
Let's just consider a volatile field. How would a volatile load and store look? Well, Aleksey Shipilëv has a great write up on this, but I will take a piece of it.
A volatile store and then subsequent load would look like:
<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store
[StoreLoad] // Case (a): Guard after volatile stores
...
[StoreLoad] // Case (b): Guard before volatile loads
int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>
So, <other ops> can be non-volatile writes, but as you see those writes are committed to memory prior to the volatile store. Then when we are ready to read the LoadLoad LoadStore will force a wait until the volatile store succeeds.
Lastly, the StoreLoad before and after ensures the volatile load and store cannot be reordered if the immediately precede one another.

The barriers in the document are abstract concepts that more-or-less map to different things on different CPUs. But they are only guidelines. The rules that the JVMs actually have to follow are those in JLS Chapter 17.
Barriers as a concept are also "global" in the sense that they order all prior and following instructions.
For example, the Java memory model guarantees visibility on the acquire action of a monitor to all actions performed before it's release in another thread, even if the values being updated are non volatile.
Acquiring a monitor is the monitor-enter in the cookbook, which only needs to be visible to other threads that contend on the lock. The monitor-exit is the release action, which will prevent loads and stores prior to it from moving bellow it. You can see this in the cookbook tables where the first operation is a normal load/store, and the second is a volatile-store or monitor-exit.
On CPUs with Total Store Order, the store buffers, where available, have no impact on correctness; only on performance.
In any case, it's up to the JVM to use instructions that provide the atomicity and visibility semantics that the JLS demands. And that's the key take-away: If you write Java code, you code against the abstract machine defined in the JLS. You would only dive into the implementation details of the concrete machine, if coding only to the abstract machine doesn't give you the performance you need. You don't need to go there for correctness.

I'm not sure where you got that StoreLoad barriers are the only type that enforce some particular behavior. All of the barriers, abstractly, enforce exactly what they are defined to enforce. For example, LoadLoad prevents any prior load from reordering with any subsequent load.
There may be architecture specific descriptions of how a particular barrier is enforced: for example, on x86 all the barriers other than StoreLoad are no-ops since the chip architecture enforces the other orderings automatically, and StoreLoad is usually implemented as a store buffer flush. Still, all the barriers have their abstract definition which is architecture-independent and the cookbook is defined in terms of that, along with a mapping of the conceptual barriers to actual ISA-specific implementations.
In particular, even if a barrier is "no-op" on a particular platform, it means that the ordering is preserved and hence all the happens-before and other synchronization requirements are satisfied.

What are the performance implication of sharing objects among threads?

I know that reading from a single object across multiple threads is safe in Java, as long as the object is not written to. But what are the performance implications of doing that instead of copying the data per thread?
Do threads have to wait for others to finish reading the memory? Or is the data implicitly copied (the reason of existence of volatile)? But what would that do for the memory usage of the entire JVM? And how does it all differ when the object being read is older than the threads that read it, instead of created in their lifetime?

If you know that an object will not change (e.g. immutable objects such as String or Integer) and have, therefore, avoided using any of the synchronization constructs (synchronized, volatile), reading that object from multiple threads does not have any impact on performance. All threads will access the memory where the object is stored in parallel.
The JVM may choose, however, to cache some values locally in each thread for performance reasons. The use of volatile forbids just that behaviour - the JVM will have to explicitly and atomically access a volatile field each and every time.

If data is being read there is no implication because multiple threads can access the same memory concurrently. Only when writing occurs because of locking mechanisms will you receive a performance hit. Note on volatile (cant remember if its the same in Java as C) but its used for data that can change from underneath the program (like direct addressing of data in c) or if you want atomicity for your data. Copying the data would not make a difference in performance but would use more memory.

To have a shared state between multiple threads - you'll have to coordinate access to it using some synchronization mechanism - volatile, synchronization, cas. I'm not sure what you expect to hear on "performance implication" - it will depend on the concrete scenario and context. In general you will be paying some price for having to coordinate access to the shared object by multiple threads.

Does the JVM create a mutex for every object in order to implement the 'synchronized' keyword? If not, how?

As a C++ programmer becoming more familiar with Java, it's a little odd to me to see language level support for locking on arbitrary objects without any kind of declaration that the object supports such locking. Creating mutexes for every object seems like a heavy cost to be automatically opted into. Besides memory usage, mutexes are an OS limited resource on some platforms. You could spin lock if mutexes aren't available but the performance characteristics of that are significantly different, which I would expect to hurt predictability.
Is the JVM smart enough in all cases to recognize that a particular object will never be the target of the synchronized keyword and thus avoid creating the mutex? The mutexes could be created lazily, but that poses a bootstrapping problem that itself necessitates a mutex, and even if that were worked around I assume there's still going to be some overhead for tracking whether a mutex has already been created or not. So I assume if such an optimization is possible, it must be done at compile time or startup. In C++ such an optimization would not be possible due to the compilation model (you couldn't know if the lock for an object was going to be used across library boundaries), but I don't know enough about Java's compilation and linking to know if the same limitations apply.

Speaking as someone who has looked at the way that some JVMs implement locks ...
The normal approach is to start out with a couple of reserved bits in the object's header word. If the object is never locked, or if it is locked but there is no contention it stays that way. If and when contention occurs on a locked object, the JVM inflates the lock into a full-blown mutex data structure, and it stays that way for the lifetime of the object.
EDIT - I just noticed that the OP was talking about OS-supported mutexes. In the examples that I've looked at, the uninflated mutexes were implemented directly using CAS instructions and the like, rather than using pthread library functions, etc.

This is really an implementation detail of the JVM, and different JVMs may implement it differently. However, it is definitely not something that can be optimized at compile time, since Java links at runtime, and this it is possible for previously unknown code to get a hold of an object created in older code and start synchronizing on it.
Note that in Java lingo, the synchronization primitive is called "monitor" rather than mutex, and it is supported by special bytecode operations. There's a rather detailed explanation here.

You can never be sure that an object will never be used as a lock (consider reflection). Typically every object has a header with some bits dedicated to the lock. It is possible to implement it such that the header is only added as needed, but that gets a bit complicated and you probably need some header anyway (class (equivalent of "vtbl" and allocation size in C++), hash code and garbage collection).
Here's a wiki page on the implementation of synchronisation in the OpenJDK.
(In my opinion, adding a lock to every object was a mistake.)

can't JVM use compare-and-swap instruction directly? let's say each object has a field lockingThreadId storing the id of the thread that is locking it,
while( compare_and_swap (obj.lockingThreadId, null, thisThreadId) != thisTheadId )
// failed, someone else got it
mark this thread as waiting on obj.
shelf this thead
//out of loop. now this thread locked the object
do the work
obj.lockingThreadId = null;
wake up threads waiting on the obj
this is a toy model, but it doesn't seem too expensive, and does no rely on OS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.