Suppose two following counter implementations:
class Counter {
private final AtomicInteger atomic = new AtomicInteger(0);
private int i = 0;
public void incrementAtomic() {
atomic.incrementAndGet();
}
public synchronized void increment() {
i++;
}
}
At first glance atomics should be faster and more scalable. And they are, I believe. But are they faster than synchronized blocks all the time? Or some situations exists when this rule is broken (e.g. SMP/Single CPU machine, different CPU ISA, OS'es etc.)?
incrementAndGet may well be implemented as a CAS-loop. In highly contented situations that can result in n-1 of your threads failing leading to an O(n) problem, for n threads.
( For #Geek:
Typically getAndIncrement may be implemented something like:
int old;
do {
old = value;
} while (!compareAndSet(value, old, old+1));
return old;
Imagine you have a n threads executing this code on the same atomic, and they happen to be in step with each other. The first iteration does kn work. Only one CAS will succeed. The other n-1 threads will repeat the exercise until there is only one left. So the total work is O(n^2) (worst case) instead of O(n). )
Having said that, acquiring a lock will need to do something similar at best, and locks aren't at their best when heavily contended. You're not likely to see much of an advantage for locks until you use a CAS-loop which requires significant computation before get and compareAndSwap.
Or some situations exists when this rule is broken (e.g. SMP/Single CPU machine, different CPU ISA, OS'es etc.)?
I don't know of any. (And I stand ready to be corrected ... if someone knows of a concrete counter-example.)
However (and this is my main point) there is no theoretical reason why you couldn't have a hardware architecture or a poorly implemented JVM in which synchronized is the same speed or faster than the atomic types. (The relative speed of these two forms of synchronization is an implementation issue, and as such can only be quantified for existing implementations.)
And of course, this doesn't mean you should never use synchronized. The synchronized construct has many use-cases that the atomic classes don't address.
It's implementation dependent - so ultimately you'll need to benchmark on your particualar platform / JVM / configuration.
Having said that, atomics should always be faster for the following reasons:
Atomics are designed so that the JVM can take advantage of atomic machine instructions, which are the fastest atomic operations that you can get for most platforms
synchronized makes use of relatively heavyweight locking schemes with monitor objects, which is designed to protect potentially large blocks of code. This form of locking in inherently more complicated than atomic operations so you would expect it to have higher runtime cost.
As others have said this is implementation dependent . But keep in mind that if your program invariants involve more than one variable , then you have to use synchronization to update them together . You can't do atomic operation on two related variables together just because they are of Atomic type. In that case your only friend is synchronized .
atomic variables will always be faster.
you can see that the java.util.concurrent package always utilises atomic variables rather than synchronized blocks.
Related
if (var != X)
var = X;
Is it sensible or not? Will the compiler always optimize-out the if statement? Are there any use cases that would benefit from the if statement?
What if var is a volatile variable?
I'm interested in both C++ and Java answers as the volatile variables have different semantics in both of the languages. Also the Java's JIT-compiling can make a difference.
The if statement introduces branching and additional read that wouldn't happen if we always overwrote var with X, so it's bad. On the other hand, if var == X then using this optimization we perform only a read and we do not perform a write, which could have some effects on cache. Clearly, there are some trade-offs here. I'd like to know how it looks like in practice. Has anyone done any testing on this?
EDIT:
I'm mostly interested about how it looks like in a multi-processor environment. In a trivial situation there doesn't seem to be much sense in checking the variable first. But when cache coherency has to be kept between processors/cores the extra check might be actually beneficial. I just wonder how big impact can it have? Also shouldn't the processor do such an optimization itself? If var == X assigning it once more value X should not 'dirt-up' the cache. But can we rely on this?
Yes, there are definitely cases where this is sensible, and as you suggest, volatile variables are one of those cases - even for single threaded access!
Volatile writes are expensive, both from a hardware and a compiler/JIT perspective. At the hardware level, these writes might be 10x-100x more expensive than a normal write, since write buffers have to be flushed (on x86, the details will vary by platform). At the compiler/JIT level, volatile writes inhibit many common optimizations.
Speculation, however, can only get you so far - the proof is always in the benchmarking. Here's a microbenchmark that tries your two strategies. The basic idea is to copy values from one array to another (pretty much System.arraycopy), with two variants - one which copies unconditionally, and one that checks to see if the values are different first.
Here are the copy routines for the simple, non-volatile case (full source here):
// no check
for (int i=0; i < ARRAY_LENGTH; i++) {
target[i] = source[i];
}
// check, then set if unequal
for (int i=0; i < ARRAY_LENGTH; i++) {
int x = source[i];
if (target[i] != x) {
target[i] = x;
}
}
The results using the above code to copy an array length of 1000, using Caliper as my microbenchmark harness, are:
benchmark arrayType ns linear runtime
CopyNoCheck SAME 470 =
CopyNoCheck DIFFERENT 460 =
CopyCheck SAME 1378 ===
CopyCheck DIFFERENT 1856 ====
This also includes about 150ns of overhead per run to reset the target array each time. Skipping the check is much faster - about 0.47 ns per element (or around 0.32 ns per element after we remove the setup overhead, so pretty much exactly 1 cycle on my box).
Checking is about 3x slower when the arrays are the same, and 4x slower then they are different. I'm surprised at how bad the check is, given that it is perfectly predicted. I suspect that the culprit is largely the JIT - with a much more complex loop body, it may be unrolled fewer times, and other optimizations may not apply.
Let's switch to the volatile case. Here, I've used AtomicIntegerArray as my arrays of volatile elements, since Java doesn't have any native array types with volatile elements. Internally, this class is just writing straight through to the array using sun.misc.Unsafe, which allows volatile writes. The assembly generated is substantially similar to normal array access, other than the volatile aspect (and possibly range check elimination, which may not be effective in the AIA case).
Here's the code:
// no check
for (int i=0; i < ARRAY_LENGTH; i++) {
target.set(i, source[i]);
}
// check, then set if unequal
for (int i=0; i < ARRAY_LENGTH; i++) {
int x = source[i];
if (target.get(i) != x) {
target.set(i, x);
}
}
And here are the results:
arrayType benchmark us linear runtime
SAME CopyCheckAI 2.85 =======
SAME CopyNoCheckAI 10.21 ===========================
DIFFERENT CopyCheckAI 11.33 ==============================
DIFFERENT CopyNoCheckAI 11.19 =============================
The tables have turned. Checking first is ~3.5x faster than the usual method. Everything is much slower overall - in the check case, we are paying ~3 ns per loop, and in the worst cases ~10 ns (the times above are in us, and cover the copy of the whole 1000 element array). Volatile writes really are more expensive. There is about 1 ns of overheaded included in the DIFFERENT case to reset the array on each iteration (which is why even the simple is slightly slower for DIFFERENT). I suspect a lot of the overhead in the "check" case is actually bounds checking.
This is all single threaded. If you actual had cross-core contention over a volatile, the results would be much, much worse for the simple method, and just about as good as the above for the check case (the cache line would just sit in the shared state - no coherency traffic needed).
I've also only tested the extremes of "every element equal" vs "every element different". This means the branch in the "check" algorithm is always perfectly predicted. If you had a mix of equal and different, you wouldn't get just a weighted combination of the times for the SAME and DIFFERENT cases - you do worse, due to misprediction (both at the hardware level, and perhaps also at the JIT level, which can no longer optimize for the always-taken branch).
So whether it is sensible, even for volatile, depends on the specific context - the mix of equal and unequal values, the surrounding code and so on. I'd usually not do it for volatile alone in a single-threaded scenario, unless I suspected a large number of sets are redundant. In heavily multi-threaded structures, however, reading and then doing a volatile write (or other expensive operation, like a CAS) is a best-practice and you'll see it quality code such as java.util.concurrent structures.
Is it a sensible optimization to check whether a variable holds a specific value before writing that value?
Are there any use cases that would benefit from the if statement?
It is when assignment is significantly more costly than an inequality comparison returning false.
A example would be a large* std::set, which may require many heap allocations to duplicate.
**for some definition of "large"*
Will the compiler always optimize-out the if statement?
That's a fairly safe "no", as are most questions that contain both "optimize" and "always".
The C++ standard makes rare mention of optimizations, but never demands one.
What if var is a volatile variable?
Then it may perform the if, although volatile doesn't achieve what most people assume.
In general the answer is no. Since if you have simple datatype, compiler would be able to perform any necessary optimizations. And in case of types with heavy operator= it is responsibility of operator= to choose optimal way to assign new value.
There are situations where even a trivial assignment of say a pointersized variable can be more expensive than a read and branch (especially if predictable).
Why? Multithreading. If several threads are only reading the same value, they can all share that value in their caches. But as soon as you write to it, you have to invalidate the cacheline and get the new value the next time you want to read it or you have to get the updated value to keep your cache coherent. Both situations lead to more traffic between the cores and add latency to the reads.
If the branch is pretty unpredictable though it's probably still slower.
In C++, assigning a SIMPLE variable (that is, a normal integer or float variable) is definitely and always faster than checking if it already has that value and then setting it if it didn't have the value. I would be very surprised if this wasn't true in Java too, but I don't know how complicated or simple things are in Java - I've written a few hundred lines, and not actually studied how byte code and JITed bytecode actually works.
Clearly, if the variable is very easy to check, but complicated to set, which could be the case for classes and other such things, then there may be a value. The typical case where you'd find this would be in some code where the "value" is some sort of index or hash, but if it's not a match, a whole lot of work is required. One example would be in a task-switch:
if (current_process != new_process_to_run)
current_process == new_process_to_run;
Because here, a "process" is a complex object to alter, but the != can be done on the ID of the process.
Whether the object is simple or complex, the compiler will almost certainly not understand what you are trying to do here, so it will probably not optimize it away - but compilers are more clever than you think SOMETIMES, and more stupid at other times, so I wouldn't bet either way.
volatile should always force the compiler to read and write values to the variable, whether it "thinks" it is necessary or not, so it will definitely READ the variable and WRITE the variable. Of course, if the variable is volatile it probably means that it can change or represents some hardware, so you should be EXTRA careful with how you treat it yourself too... An extra read of a PCI-X card could incur several bus cycles (bus cycles being an order of magnitude slower than the processor speed!), which is likely to affect the performance much more. But then writing to a hardware register may (for example) cause the hardware to do something unexpected, and checking that we have that value first MAY make it faster, because "some operation starts over", or something like that.
It would be sensible if you had read-write locking semantics involved, whenever reading is usually less disruptive than writing.
In Objective-C you have the situation where assigning a object address to a pointer variable may require that the object be "retained" (reference count incremented). In such a case it makes sense to see if the value being assigned is the same as the value currently in the pointer variable, to avoid having to do the relatively expensive increment/decrement operations.
Other languages that use reference counting likely have similar scenarios.
But when assigning, say, an int or a boolean to a simple variable (outside of the multiprocessor cache scenario mentioned elsewhere) the test is rarely merited. The speed of a store in most processors is at least as fast as the load/test/branch.
In java the answer is always no. All assignments you can do in Java are primitive. In C++, the answer is still pretty much always no - if copying is so much more expensive than an equality check, the class in question should do that equality check itself.
Is a nested synchronized block faster to get into than a normal synchronized block? Or for example, which of the following routines is faster:
void routine1(SyncClass a) {
a.syncMethod1();
a.syncmethod2();
a.syncMethod1();
}
void routine2(SyncClass a) {
synchronized(a) {
a.syncMethod1();
a.syncmethod2();
a.syncMethod1();
}
}
The methods are synchronized. I am considering the use of a thread safe object in a situation where thread safety is not needed. So level of concurrency is not affected.
Also, is the answer platform dependant?
You're better off synchronizeding the smallest code elements you can, performance-wise, regardless of the platform.
Wrapping a number of synchronized calls in a synchronized block will reduce concurrency (and so, performance). Only do it if you need that particular sequence of calls to be synchronized.
If you're concerned about the performance impact besides that which is derived from concurrency, I don't know which is faster. However, I would expect that difference in performance in both of the methods you describe is imperceptible.
It appears that the answer is yes, as per comments left on the question. But with two caveats.
1) Due to fewer opportunities for parallel execution, threads may wait on each other more often.
2) The compiler may optimize this way automatically.
This servlet seems to fetch an object from ehCache, from an Element which has this object: http://code.google.com/p/adwhirl/source/browse/src/obj/HitObject.java?repo=servers-mobile
It then goes on to increment the counter which is an atomic long:
http://code.google.com/p/adwhirl/source/browse/src/servlet/MetricsServlet.java?repo=servers-mobile#174
//Atomically record the hit
if(i_hitType == AdWhirlUtil.HITTYPE.IMPRESSION.ordinal()) {
ho.impressions.incrementAndGet();
}
else {
ho.clicks.incrementAndGet();
}
This doesn't seem thread-safe to me as multiple threads could be fetching from the cache and if both increment at the same time you might loose a click/impression count.
Do you agree that this is not thread-safe?
AtomicLong and AtomicInteger use a CAS internally -- compare and set (or compare-and-swap). The idea is that you tell the CAS two things: the value you expect the long/int to have, and the value you want to update it to. If the long/int has the value you say it should have, the CAS will atomically make the update and return true; otherwise, it won't make the update, and it'll return false. Many modern chips support CAS very efficiently at the machine-code level; if the JVM is running in an environment that doesn't have a CAS, it can use mutexes (what Java calls synchronization) to implement the CAS. Regardless, once you have a CAS, you can safely implement an atomic increment via this logic (in pseudocode):
long incrementAndGet(atomicLong, byIncrement)
do
oldValue = atomicLong.get() // 1
newValue = oldValue + byIncrement
while ! atomicLong.cas(oldValue, newValue) // 2
return newValue
If another thread has come in and does its own increment between lines // 1 and // 2, the CAS will fail and the loop will try again. Otherwise, the CAS will succeed.
There's a gamble in this kind of approach: if there's low contention, a CAS is faster than a synchronized block isn't as likely to cause a thread context switch. But if there's a lot of contention, some threads are going to have to go through multiple loop iterations per increment, which obviously amounts to wasted work. Generally speaking, the incrementAndGet is going to be faster under most common loads.
The increment is thread safe since AtomicInteger and family guarantee that. But there is a problem with the insertion and fetching from the cache, where two (or more) HitObject could be created and inserted. That would cause potentially losing some hits on the first time this HitObject is accessed. As #denis.solonenko has pointed, there is already a TODO in the code to fix this.
However I'd like to point out that this code only suffers from concurrency on first accessing a given HitObject. Once you have the HitObject in the cache (and there are no more threads creating or inserting the HitObject) then this code is perfectly thread-safe. So this is only a very limited concurrency problem, and probably that's the reason they have not yet fixed it.
I've got a gigantic Trove map and a method that I need to call very often from multiple threads. Most of the time this method shall return true. The threads are doing heavy number crunching and I noticed that there was some contention due to the following method (it's just an example, my actual code is bit different):
synchronized boolean containsSpecial() {
return troveMap.contains(key);
}
Note that it's an "append only" map: once a key is added, is stays in there forever (which is important for what comes next I think).
I noticed that by changing the above to:
boolean containsSpecial() {
if ( troveMap.contains(key) ) {
// most of the time (>90%) we shall pass here, dodging lock-acquisition
return true;
}
synchronized (this) {
return troveMap.contains(key);
}
}
I get a 20% speedup on my number crunching (verified on lots of runs, running during long times etc.).
Does this optimization look correct (knowing that once a key is there it shall stay there forever)?
What is the name for this technique?
EDIT
The code that updates the map is called way less often than the containsSpecial() method and looks like this (I've synchronized the entire method):
synchronized void addSpecialKeyValue( key, value ) {
....
}
This code is not correct.
Trove doesn't handle concurrent use itself; it's like java.util.HashMap in that regard. So, like HashMap, even seemingly innocent, read-only methods like containsKey() could throw a runtime exception or, worse, enter an infinite loop if another thread modifies the map concurrently. I don't know the internals of Trove, but with HashMap, rehashing when the load factor is exceeded, or removing entries can cause failures in other threads that are only reading.
If the operation takes a significant amount of time compared to lock management, using a read-write lock to eliminate the serialization bottleneck will improve performance greatly. In the class documentation for ReentrantReadWriteLock, there are "Sample usages"; you can use the second example, for RWDictionary, as a guide.
In this case, the map operations may be so fast that the locking overhead dominates. If that's the case, you'll need to profile on the target system to see whether a synchronized block or a read-write lock is faster.
Either way, the important point is that you can't safely remove all synchronization, or you'll have consistency and visibility problems.
It's called wrong locking ;-) Actually, it is some variant of the double-checked locking approach. And the original version of that approach is just plain wrong in Java.
Java threads are allowed to keep private copies of variables in their local memory (think: core-local cache of a multi-core machine). Any Java implementation is allowed to never write changes back into the global memory unless some synchronization happens.
So, it is very well possible that one of your threads has a local memory in which troveMap.contains(key) evaluates to true. Therefore, it never synchronizes and it never gets the updated memory.
Additionally, what happens when contains() sees a inconsistent memory of the troveMap data structure?
Lookup the Java memory model for the details. Or have a look at this book: Java Concurrency in Practice.
This looks unsafe to me. Specifically, the unsynchronized calls will be able to see partial updates, either due to memory visibility (a previous put not getting fully published, since you haven't told the JMM it needs to be) or due to a plain old race. Imagine if TroveMap.contains has some internal variable that it assumes won't change during the course of contains. This code lets that invariant break.
Regarding the memory visibility, the problem with that isn't false negatives (you use the synchronized double-check for that), but that trove's invariants may be violated. For instance, if they have a counter, and they require that counter == someInternalArray.length at all times, the lack of synchronization may be violating that.
My first thought was to make troveMap's reference volatile, and to re-write the reference every time you add to the map:
synchronized (this) {
troveMap.put(key, value);
troveMap = troveMap;
}
That way, you're setting up a memory barrier such that anyone who reads the troveMap will be guaranteed to see everything that had happened to it before its most recent assignment -- that is, its latest state. This solves the memory issues, but it doesn't solve the race conditions.
Depending on how quickly your data changes, maybe a Bloom filter could help? Or some other structure that's more optimized for certain fast paths?
Under the conditions you describe, it's easy to imagine a map implementation for which you can get false negatives by failing to synchronize. The only way I can imagine obtaining false positives is an implementation in which key insertions are non-atomic and a partial key insertion happens to look like another key you are testing for.
You don't say what kind of map you have implemented, but the stock map implementations store keys by assigning references. According to the Java Language Specification:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.
If your map implementation uses object references as keys, then I don't see how you can get in trouble.
EDIT
The above was written in ignorance of Trove itself. After a little research, I found the following post by Rob Eden (one of the developers of Trove) on whether Trove maps are concurrent:
Trove does not modify the internal structure on retrievals. However, this is an implementation detail not a guarantee so I can't say that it won't change in future versions.
So it seems like this approach will work for now but may not be safe at all in a future version. It may be best to use one of Trove's synchronized map classes, despite the penalty.
I think you would be better off with a ConcurrentHashMap which doesn't need explicit locking and allows concurrent reads
boolean containsSpecial() {
return troveMap.contains(key);
}
void addSpecialKeyValue( key, value ) {
troveMap.putIfAbsent(key,value);
}
another option is using a ReadWriteLock which allows concurrent reads but no concurrent writes
ReadWriteLock rwlock = new ReentrantReadWriteLock();
boolean containsSpecial() {
rwlock.readLock().lock();
try{
return troveMap.contains(key);
}finally{
rwlock.readLock().release();
}
}
void addSpecialKeyValue( key, value ) {
rwlock.writeLock().lock();
try{
//...
troveMap.put(key,value);
}finally{
rwlock.writeLock().release();
}
}
Why you reinvent the wheel?
Simply use ConcurrentHashMap.putIfAbsent
This question has been discussed in two blog posts (http://dow.ngra.de/2008/10/27/when-systemcurrenttimemillis-is-too-slow/, http://dow.ngra.de/2008/10/28/what-do-we-really-know-about-non-blocking-concurrency-in-java/), but I haven't heard a definitive answer yet. If we have one thread that does this:
public class HeartBeatThread extends Thread {
public static int counter = 0;
public static volatile int cacheFlush = 0;
public HeartBeatThread() {
setDaemon(true);
}
static {
new HeartBeatThread().start();
}
public void run() {
while (true) {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
counter++;
cacheFlush++;
}
}
}
And many clients that run the following:
if (counter == HeartBeatThread.counter) return;
counter = HeartBeatThread.cacheFlush;
is it threadsafe or not?
Within the java memory model? No, you are not ok.
I've seen a number of attempts to head towards a very 'soft flush' approach like this, but without an explicit fence, you're definitely playing with fire.
The 'happens before' semantics in
http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7
start to referring to purely inter-thread actions as 'actions' at the end of 17.4.2. This drives a lot of confusion since prior to that point they distinguish between inter- and intra- thread actions. Consequently, the intra-thread action of manipulating counter isn't explicitly synchronized across the volatile action by the happens-before relationship. You have two threads of reasoning to follow about synchronization, one governs local consistency and is subject to all the nice tricks of alias analysis, etc to shuffle operations The other is about global consistency and is only defined for inter-thread operations.
One for the intra-thread logic that says within the thread the reads and writes are consistently reordered and one for the inter-thread logic that says things like volatile reads/writes and that synchronization starts/ends are appropriately fenced.
The problem is the visibility of the non-volatile write is undefined as it is an intra-thread operation and therefore not covered by the specification. The processor its running on should be able to see it as it you executed those statements serially, but its sequentialization for inter-thread purposes is potentially undefined.
Now, the reality of whether or not this can affect you is another matter entirely.
While running java on x86 and x86-64 platforms? Technically you're in murky territory, but practically the very strong guarantees x86 places on reads and writes including the total order on the read/write across the access to cacheflush and the local ordering on the two writes and the two reads should enable this code to execute correctly provided it makes it through the compiler unmolested. That assumes the compiler doesn't step in and try to use the freedom it is permitted under the standard to reorder operations on you due to the provable lack of aliasing between the two intra-thread operations.
If you move to a memory with weaker release semantics like an ia64? Then you're back on your own.
A compiler could in perfectly good faith break this program in java on any platform, however. That it functions right now is an artifact of current implementations of the standard, not of the standard.
As an aside, in the CLR, the runtime model is stronger, and this sort of trick is legal because the individual writes from each thread have ordered visibility, so be careful trying to translate any examples from there.
Well, I don't think it is.
The first if-statement:
if (counter == HeartBeatThread.counter)
return;
Does not access any volatile field and is not synchronized. So you might read stale data forever and never get to the point of accessing the volatile field.
Quoting from one of the comments in the second blog entry: "anything that was visible to thread A when it writes to volatile field f becomes visible to thread B when it reads f."
But in your case B (the client) never reads f (=cacheFlush). So changes to HeartBeatThread.counter do not have to become visible to the client.