From the Performance and Scalability chapter of the JCIP book:
The synchronized mechanism is optimized for the uncontended
case(volatile is always uncontended), and at this writing, the
performance cost of a "fast-path" uncontended synchronization ranges
from 20 to 250 clock cycles for most systems.
What does the author mean by fast-path uncontended synchronization here?
There are two distinct concepts here.
Fast-path and Slow-path code
Uncontended and Contended synchronization
Slow-path vs Fast-path code
This is another way to identify the producer of the machine specific binary code.
With HotSpot VM, slow-path code is binary code produced by a C++ implementation, where fast-path code means code produced by JIT compiler.
In general sense, fast-path code is a lot more optimised. To fully understand JIT compilers wikipedia is a good place to start.
Uncontended and Contended synchronization
Java's synchronization construct (Monitors) have the concept of ownership. When a thread tries to lock (gain the ownership of) the monitor, it can either be locked (owned by another thread) or unlocked.
Uncontended synchronization happens in two different scenarios:
Unlocked Monitor (ownership gained strait away)
Monitor already owned by the same thread
Contended synchronization, on the other hand, means the thread will be blocked until the owner thread release the monitor lock.
Answering the question
By fast-path uncontended synchronization the author means, the fastest bytecode translation (fast-path) in the cheapest scenario (uncontended synchronization).
I'm not familiar with the topic of the book, but in general a “fast path” is a specific possible control flow branch which is significantly more efficient than the others and therefore preferred, but cannot handle complex cases.
I assume that the book is talking about Java's synchronized block/qualifier. In this case, the fast path is most likely one where it is easy to detect that there are no other threads accessing the same data. What the book is saying, then, is that the implementation of synchronized has been optimized to have the best performance in the case where only one thread is actually using the object, as opposed to the case where multiple threads are and the synchronization must actually mediate among them.
The first step of acquiring a synchronized lock is a single volatile write (monitor owner field). If the lock is uncontested then that is all which will happen.
If the lock is contested then there will be various context switches and other mechanisms which will increase clock cycles.
Related
AFAIK, Every object in Java has a mark word. The first word (the mark word) is used for storing locking information, either through a flag if only one thread is acquiring the lock or pointing to a lock monitor object if there is contention between different threads, and in both the cases, compare and swap construct is used for acquiring the lock.
But according to this link -
https://www.baeldung.com/lmax-disruptor-concurrency
To deal with the write contention, a queue often uses locks, which can cause a context switch to the kernel. When this happens the processor involved is likely to lose the data in its caches.
What am I missing ?
Neither, synchronized nor the standard Lock implementations, require a context switch into the kernel when locking uncontended or unlocking. These operations indeed boil down to an atomic cas or write.
The performance critical aspect is the contention, i.e. when trying to acquire the monitor or lock and it’s not available. Waiting for the availability of the monitor or lock implies putting the thread into a waiting state and reactivating it when the resource became available. The performance impact is so large, that you don’t need to worry about CPU caches at all.
For this reason, typical implementations perform some amount of spinning, rechecking the availability of the monitor or lock in a loop for some time, when there is a chance of becoming available in that time. This is usually tied to the number of CPU cores. When the resource becomes available in that time, these costs can be avoided. This, however, usually requires the acquisition to be allowed to be unfair, as a spinning acquisition may overtake an already waiting thread.
Note that the linked article says before your cited sentence:
Queues are typically always close to full or close to empty due to the differences in pace between consumers and producers.
In such a scenario, the faster threads will sooner or later enter a condition wait, waiting for new space or new items in a queue, even when they acquired the lock without contention. So in this specific scenario, the associated costs are indeed there and unavoidable when using a simple queue implementation.
I was asked by the interviewer about disadvantages of using thread safe class like Hashtable in single threaded environment? Are there any disadvantages? if not then why are there non thread safe class introduced later?
I was asked by the interviewer about disadvantages of using thread safe class like Hashtable in single threaded environment?
There are, although most of the disadvantages are around performance. Even single-threaded environments have multiple threads in them (think GC, finalizers, signal handlers, JMX, etc.) so the language still needs to obey the synchronization constructs such as synchronized, volatile, and the native lock implementations. These language features flush or invalidate memory caches and affect code reordering both of which can dramatically affect overall runtime performance.
if not then why are there non thread safe class introduced later?
Non-thread-safe objects always perform better than their thread-safe counterparts in either single or multi-threaded applications. The ability to deal with local CPU cached memory is one of the main speed increases provided by modern hardware. If you don't have to reach out to the main memory bus, you can execute operations orders of magnitude faster. Synchronization constructs decrease the ability of cache memory to be used.
Lastly, thread-safe classes are typically more complicated both in terms of the data structures involved as well as the logic necessary for them to operate correctly in a multi-thread application. This means that even if we ignore the synchronization constructures, it may use more memory and run slower, although the degree to which this is the case is very dependent on the particular class in question
They are slower in single thread environment. Modern JIT is very effective in working with synchronized class in single thread environment, but it is not perfect.
They are much slower in multi thread environment. In case you have immutable collection you can safely use from different threads, but synchronized collection will work much slower.
[design] Its locking semantic is mostly useless, so additional synchronization needed anyway. You rarely need just read or write, most of time you read then write and you want it to be atomic. Or you want to allow multiple simultaneous reads.
Is there a difference between 'ReentrantLock' and 'synchronized' on how it's implemented on CPU level?
Or do they use the same 'CAS' approach?
If we are talking about ReentrantLock vs synchronized (also known as "intrinsic lock") then it's a good idea to look at Lock documentation:
All Lock implementations must enforce the same memory synchronization semantics as provided by the built-in monitor lock:
A successful lock operation acts like a successful monitorEnter
action
A successful unlock operation acts like a successful monitorExit
action
So in general consider that synchronized is just an easy-to-use and concise approach of locking. You can achieve exactly the same synchronization effects by writing code with ReentrantLock with a bit more code (but it offers more options and flexibility).
Some time ago ReentrantLock was way faster under certain conditions (high contention for example), but now Java uses different optimizations techniques (like lock coarsening and adaptive locking) to make performance differences in many typical scenarios barely visible to the programmer.
There was also done a great job to optimize intrinsic lock in low-contention cases (e.g. biased locking). Authors of Java platform do like synchronized keyword and intrinsic-locking approach, they want programmers do not fear to use this handy tool (and prevent possible bugs). That's why synchronized optimizations and "synchronization is slow" myth busting was such a big deal for Sun and Oracle.
"CPU-part" of the question:
synchronized uses a locking mechanism that is built into the JVM and MONITORENTER / MONITOREXIT bytecode instructions. So the underlying implementation is JVM-specific (that is why it is called intrinsic lock) and AFAIK usually (subject to change) uses a pretty conservative strategy: once lock is "inflated" after threads collision on lock acquiring, synchronized begin to use OS-based locking ("fat locking") instead of fast CAS ("thin locking") and do not "like" to use CAS again soon (even if contention is gone).
ReentrantLock implementation is based on AbstractQueuedSynchronizer and coded in pure Java (uses CAS instructions and thread descheduling which was introduced it Java 5), so it is more stable across platforms, offers more flexibility and tries to use fast CAS appoach for acquiring a lock every time (and OS-level locking if failed).
So, the main difference between these locks implementations in terms of performance is a lock acquiring strategy (which may not exist in specific JVM implementation or situation).
And there is no general answer which locking is better + it is a subject to change during the time and platforms. You should look at the specific problem and its nature to pick the most suitable solution (as usually in Java)
PS: you're pretty curious and I highly recommend you to look at HotSpot sources to go deeper (and to find out exact implementations for specific platform version). It may really help. Starting point is somewhere here: http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/runtime/synchronizer.cpp
The ReentrantLock class, which implements Lock, has the same concurrency and memory semantics as synchronized, but also adds features like lock polling, timed lock waits, and interruptible lock waits. Additionally, it offers far better performance under heavy contention.
Source
Above answer is extract from Brian Goetz's article. You should read entire article. It helped me to understand differences in both.
I can see that ReentrantLock is around 50% faster than synchronized and AtomicInteger 100% faster. Why such difference with the execution time of these three synchronization methods: synchronized blocks, ReentrantLock and AtomicInteger (or whatever class from the Atomic package).
Are there any other popular and extended synchronizing methods aside than these ones?
A number of factor effect this.
the version of Java. Java 5.0 was much faster for ReentrantLock, Java 7 not so much
the level of contention. synchronized works best (as does locking in general) with low contention rates. ReentrantLock works better with higher contention rates. YMWV
how much optimisation can the JIT do. The JIT optimise synchronized in ways ReentrantLOck is not. If this is not possible you won't see the advantage.
synchronized is GC free in it's actions. ReentrantLock can create garbage which can make it slower and trigger GCs depending on how it is used.
AtomicInteger uses the same primitives that locking uses but does a busy wait. CompareAndSet also called CompareAndSwap i.e. it is much simpler in what it does (and much more limited as well)
The ConcurrentXxxx, CopyOnWriteArrayXxxx collections are very popular. These provide concurrency without needing to use lock directly (and in some cases no locks at all)
AtomicInteger is much faster than the other two synchronization methods on your hardware because it is lock-free. On architectures where the CPU provides basic facilities for lock-free concurrency, AtomicInteger's operations are performed entirely in hardware, with the critical usually taking a single CPU instruction. In contrast, ReentrantLock and synchronized use multiple instructions to perform their task, so you see some considerable overhead associated with them.
I think that you are doing a common mistake evaluating those 3 elements for comparison.
Basically when a ReentrantLock is something that allows you more flexibility when your are synchronizing blocks compared with the synchronized key. Atomic is something that adopts a different approach based on CAS(Compare and Swap) to manage the updates in a concurrent context.
I suggest you to read in deep a bible of concurrency for the Java platform.
Java Concurrency in Practice - Brian Göetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes & Doug Lea
There's a lot difference in having a deep knowledge on concurrency and know what a language can offer you to solve concurrency problems and taking advantage of multithreading.
In terms of performance, it depends on the current scenario.
When objects are locked in languages like C++ and Java where actually on a low level scale) is this performed? I don't think it's anything to do with the CPU/cache or RAM. My best guestimate is that this occurs somewhere in the OS? Would it be within the same part of the OS which performs context switching?
I am referring to locking objects, synchronizing on method signatures (Java) etc.
It could be that the answer depends on which particular locking mechanism?
Locking involves a synchronisation primitive, typically a mutex. While naively speaking a mutex is just a boolean flag that says "locked" or "unlocked", the devil is in the detail: The mutex value has to be read, compared and set atomically, so that multiple threads trying for the same mutex don't corrupt its state.
But apart from that, instructions have to be ordered properly so that the effects of a read and write of the mutex variable are visible to the program in the correct order and that no thread inadvertently enters the critical section when it shouldn't because it failed to see the lock update in time.
There are two aspects to memory access ordering: One is done by the compiler, which may choose to reorder statements if that's deemed more efficient. This is relatively trivial to prevent, since the compiler knows when it must be careful. The far more difficult phenomenon is that the CPU itself, internally, may choose to reorder instructions, and it must be prevented from doing so when a mutex variable is being accessed for the purpose of locking. This requires hardware support (e.g. a "lock bit" which causes a pipeline flush and a bus lock).
Finally, if you have multiple physical CPUs, each CPU will have its own cache, and it becomes important that state updates are propagated to all CPU caches before any executing instructions make further progress. This again requires dedicated hardware support.
As you can see, synchronisation is a (potentially) expensive business that really gets in the way of concurrent processing. That, however, is simply the price you pay for having one single block of memory on which multiple independent context perform work.
There is no concept of object locking in C++. You will typically implement your own on top of OS-specific functions or use synchronization primitives provided by libraries (e.g. boost::scoped_lock). If you have access to C++11, you can use the locks provided by the threading library which has a similar interface to boost, take a look.
In Java the same is done for you by the JVM.
The java.lang.Object has a monitor built into it. That's what is used to lock for the synchronized keyword. JDK 6 added a concurrency packages that give you more fine-grained choices.
This has a nice explanation:
http://www.artima.com/insidejvm/ed2/threadsynch.html
I haven't written C++ in a long time, so I can't speak to how to do it in that language. It wasn't supported by the language when I last wrote it. I believe it was all 3rd party libraries or custom code.
It does depend on the particular locking mechanism, typically a semaphore, but you cannot be sure, since it is implementation dependent.
All architectures I know of use an atomic Compare And Swap to implement their synchronization primitives. See, for example, AbstractQueuedSynchronizer, which was used in some JDK versions to implement Semiphore and ReentrantLock.