Why ConcurrentHashMap.putifAbsent is safe?

Why ConcurrentHashMap.putifAbsent is safe? - java

I have been reading for concurency since yesterday and i dont know much things... However some things are starting to getting clear...
I understand why double check locking isnt safe (i wonder what is the propability the rare condition to occur) but volatile fixes the issue in 1.5 +....
But i wonder if this occurs with putifAbsent
like...
myObj = new myObject("CodeMonkey");
cHashM.putIfAbsent("keyy",myObj);
Then does this ensures that myObj would be 100% intialiased when another thread does a cHashM.get() ??? Because it could have a reference isnt completely initialised (the double check lock problem)

If you invoke concurrentHashMap.get(key) and it returns an object, that object is guaranteed to be fully initialized. Each put (or putIfAbsent) will obtain a bucket specific lock and will append the element to the bucket's entries.
Now you may go through the code and notice that the get method doesnt obtain this same lock. So you can argue that there can be an out of date read, that isn't true either. The reason here is that value within the entry itself is volatile. So you will be sure to get the most up to date read.

putIfAbsent method in ConcurrentHashMap is check-if-absent-then-set method. It's an atomic operation. But to answer the following part: "Then does this ensures that myObj would be 100% intialiased when another thread does a cHashM.get() ", it would depend on when the object is put into the HashMap. Usually there is a happens-before precedence, i.e., if the caller gets first before the object is placed in the map, then null would be returned, else the value would be returned.

The relevant part of the documentation is this:
Memory consistency effects: As with
other concurrent collections, actions
in a thread prior to placing an object
into a ConcurrentMap as a key or value
happen-before actions subsequent to
the access or removal of that object
from the ConcurrentMap in another
thread.
-- java.util.ConcurrentMap
So, yes you have your happens-before relationship.

I'm not an expert on this, but looking at the implementation of Segment in ConcurrentHashMap I see that the volatile field count appears to be used to ensure proper visibility between threads. All read operations have to read the count field and all write operations have to write to it. From comments in the class:
Read operations can thus proceed without locking, but rely
on selected uses of volatiles to ensure that completed
write operations performed by other threads are
noticed. For most purposes, the "count" field, tracking the
number of elements, serves as that volatile variable
ensuring visibility. This is convenient because this field
needs to be read in many read operations anyway:
- All (unsynchronized) read operations must first read the
"count" field, and should not look at table entries if
it is 0.
- All (synchronized) write operations should write to
the "count" field after structurally changing any bin.
The operations must not take any action that could even
momentarily cause a concurrent read operation to see
inconsistent data. This is made easier by the nature of
the read operations in Map. For example, no operation
can reveal that the table has grown but the threshold
has not yet been updated, so there are no atomicity
requirements for this with respect to reads.

Related

ConcurrentHashMap and its operations

Suppose there is a ConcurrentHashMap and there are two threads.
If both threads are reading some data from the same bucket, then my understanding says that both can read that bucket concurrently, as CHM does not block reading operations.
But suppose one thread is writing (put) to a bucket. Then, can a second thread simultaneously read (get) from the same bucket or will the second thread have to wait for the put operation to complete?
If it were Hashtable then get will have to wait until the put operation is complete. But in case of CHM how it will behave?

There is no need for speculation. The source code for ConcurrentHashMap is open, and anyone can read it. (This is JDK 8 build 128, the first JDK 8 release candidate.)
You should have no trouble understanding it, as it's only 6,300 lines long. :-) Actually, a good fraction of this is comments, and most of the code goes toward handling edge cases. The straightforward paths of get() and put() aren't terribly complicated and are only a few dozen lines of code.
Your understanding of read operations (get(), contains()) is correct; there is no blocking. Hashing to a bucket and searching within the bucket, if necessary, is straightforward, with no locking. Memory visibility is ensured by volatile reads. (At lines 622-623, the val and next fields of Node are volatile.) Read operations proceed concurrently with other reads and also with writes to the same bucket.
The policy for removing and replacing values is fairly straightforward in that the head of the bucket is locked while the bucket is being searched and modified. See the synchronized block at line 1117 of replaceNode. A put that adds to an existing bucket is similar; see the synchronized block at line 1027 of putVal. These operations will of course block other threads attempting to remove, replace, or add entries to this same bucket. If a value is in the midst of being replaced, a thread that is getting the value for this key will see either the old value or the new value, depending on whether the reading thread finds the node before or after the value is replaced by the writing thread.
There is a special case for putting the first element into a bucket. At lines 1018-1020, if putVal finds a bucket empty, it will create a new Node and CAS (compare-and-swap) it into place. If this succeeds, the operation is complete. If two threads are attempting to add nodes into the same bucket more-or-less simultaneously, the CAS for the first will succeed, and the CAS for the second will fail. But note that this code is within a for-loop (line 1014). The thread whose CAS has failed simply goes around the loop and retries. In fact, all the other write operations are within a loop. The general approach is that operations proceed optimistically but are checked for concurrent writers. If the optimistic attempt fails, the operation is retried and goes through a (possibly) different path based on the now updated state.

Hi as Per my knowledge ConcurrentHashMap allows multiple readers to read concurrently without any blocking. This is achieved by partitioning Map into different parts based on concurrency level and locking only a portion of Map during updates. Default concurrency level is 16, and accordingly Map is divided into 16 part and each part is governed with different lock. This means, 16 thread can operate on Map simultaneously, until they are operating on different part of Map. This makes ConcurrentHashMap high performance despite keeping thread-safety intact. Though, it comes with caveat. Since update operations like put(), remove(), putAll() or clear() is not synchronized, concurrent retrieval may not reflect most recent change on Map.
I hope this will help..

This is from the JavaDocs of ConcurrentHashMap class:
"Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset"

In Hastable concurrent operations will lock the whole collection, but in ConcurrentHashMap only one bucket will be locked.

From the doc:
A hash table supporting full concurrency of retrievals and adjustable
expected concurrency for updates. This class obeys the same functional
specification as Hashtable, and includes versions of methods
corresponding to each method of Hashtable. However, even though all
operations are thread-safe, retrieval operations do not entail
locking, and there is not any support for locking the entire table in
a way that prevents all access. This class is fully interoperable with
Hashtable in programs that rely on its thread safety but not on its
synchronization details.
Retrieval operations (including get) generally do not block, so may
overlap with update operations (including put and remove). Retrievals
reflect the results of the most recently completed update operations
holding upon their onset. For aggregate operations such as putAll and
clear, concurrent retrievals may reflect insertion or removal of only
some entries. Similarly, Iterators and Enumerations return elements
reflecting the state of the hash table at some point at or since the
creation of the iterator/enumeration. They do not throw
ConcurrentModificationException. However, iterators are designed to be
used by only one thread at a time.
So, you shouldn't expect operations to synchronize exactly as a Hashtable, but the same (series of) operation are threadsafe. The second highlighted sentence does not imply, but in my opinion strongly suggest, what is going on here: a put in progress, i.e. not finished, will not block a get - the get will simply not see the changes yet.
Although I have not worked myself through the whole CHM class, this piece of documentation supports my hypothesis (taken from OpenJDK 6)
static final class Segment<K,V> extends ReentrantLock implements Serializable {
/*
* Segments maintain a table of entry lists that are always
* kept in a consistent state, so can be read (via volatile
* reads of segments and tables) without locking. This
* requires replicating nodes when necessary during table
* resizing, so the old lists can be traversed by readers
* still using old version of table.
When an update is "complete" doesn't seem to be explicitly defined; generally as soon as the new bucket is linked into the list of buckets, I guess. CHM also makes heavy use of volatile fields to ensure that threads read the most recent buckets in the list.

How do I get the latest view of a ConcurrentHashMap?

You can ensure that changes one thread makes to a variable can be seen on other threads by making the variable volatile, or by having both threads synchronize on something. If the thing being changed is a java.util.ConcurrentHashMap, does it make sense to create a memory barrier by declaring the type of the variable holding this map as volatile, or are readers accessing the map (say via myMap.values()) going to get the latest possible view anyway? For context I have a heavy reading, light writing scenario where I am switching my lock free read solution to a ConcurrentHashMap.

ConcurrentHashMap guarantees that there is a happens-before relationship between writes and subsequent reads. So yes, when you read (get), you will see the most recent changes that have been "committed" (put has returned).
Note: this does not apply to iterators as explained in the javadoc.

The variable "holding" the map is a reference or pointer to the map object (respectively (simplified) to the memory address where the map is stored). Making it volatile would only affect the pointer, not the map object itself. As long as you always use the same Map-Object and ensure that the map is fully initialized before the threads use it, you don't have to use "volatile references" to it. The concurrency is handled transparently inside the concurrent hash map.

Yes, ConcurrentHashMap gives the latest views. If you refer the javadocs at http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html#get(java.lang.Object)
it is clearly written that
Retrievals reflect the results of the most recently completed update
operations holding upon their onset
It has some more details and I would suggest you go and read it.
Besides, as already noted using volatile is not what you want as it will only affect the pointer and not the actual contents of the map.

All you need to do is make sure that the reference holding the map is final, so you get a final field fence that guarantees you see a properly initialised map and that the reference itself is not changed.
As others point out, ConcurrentHashMap will guarantee visibility/happens-before of writes internally, as all of the java.util.concurrent.* collections do. You should however use the conditional writes exposed on the ConcurrentMap interface to avoid data-races in your writes.

In Java can I depend on reference assignment being atomic to implement copy on write?

If I have an unsynchronized java collection in a multithreaded environment, and I don't want to force readers of the collection to synchronize[1], is a solution where I synchronize the writers and use the atomicity of reference assignment feasible? Something like:
private Collection global = new HashSet(); // start threading after this
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(global) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
// Do multithreaded reads here. All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
Rolling your own solution seems to often fail in these type of situations, so I'd be interested in knowing other patterns, collections or libraries I could use to prevent object creation and blocking for my data consumers.
[1] The reasons being a large proportion of time spent in reads compared to writes, combined with the risk of introducing deadlocks.
Edit: A lot of good information in several of the answers and comments, some important points:
A bug was present in the code I posted. Synchronizing on global (a badly named variable) can fail to protect the syncronized block after a swap.
You could fix this by synchronizing on the class (moving the synchronized keyword to the method), but there may be other bugs. A safer and more maintainable solution is to use something from java.util.concurrent.
There is no "eventual consistency guarantee" in the code I posted, one way to make sure that readers do get to see the updates by writers is to use the volatile keyword.
On reflection the general problem that motivated this question was trying to implement lock free reads with locked writes in java, however my (solved) problem was with a collection, which may be unnecessarily confusing for future readers. So in case it is not obvious the code I posted works by allowing one writer at a time to perform edits to "some object" that is being read unprotected by multiple reader threads. Commits of the edit are done through an atomic operation so readers can only get the pre-edit or post-edit "object". When/if the reader thread gets the update, it cannot occur in the middle of a read as the read is occurring on the old copy of the "object". A simple solution that had probably been discovered and proved to be broken in some way prior to the availability of better concurrency support in java.

Rather than trying to roll out your own solution, why not use a ConcurrentHashMap as your set and just set all the values to some standard value? (A constant like Boolean.TRUE would work well.)
I think this implementation works well with the many-readers-few-writers scenario. There's even a constructor that lets you set the expected "concurrency level".
Update: Veer has suggested using the Collections.newSetFromMap utility method to turn the ConcurrentHashMap into a Set. Since the method takes a Map<E,Boolean> my guess is that it does the same thing with setting all the values to Boolean.TRUE behind-the-scenes.
Update: Addressing the poster's example
That is probably what I will end up going with, but I am still curious about how my minimalist solution could fail. – MilesHampson
Your minimalist solution would work just fine with a bit of tweaking. My worry is that, although it's minimal now, it might get more complicated in the future. It's hard to remember all of the conditions you assume when making something thread-safe—especially if you're coming back to the code weeks/months/years later to make a seemingly insignificant tweak. If the ConcurrentHashMap does everything you need with sufficient performance then why not use that instead? All the nasty concurrency details are encapsulated away and even 6-months-from-now you will have a hard time messing it up!
You do need at least one tweak before your current solution will work. As has already been pointed out, you should probably add the volatile modifier to global's declaration. I don't know if you have a C/C++ background, but I was very surprised when I learned that the semantics of volatile in Java are actually much more complicated than in C. If you're planning on doing a lot of concurrent programming in Java then it'd be a good idea to familiarize yourself with the basics of the Java memory model. If you don't make the reference to global a volatile reference then it's possible that no thread will ever see any changes to the value of global until they try to update it, at which point entering the synchronized block will flush the local cache and get the updated reference value.
However, even with the addition of volatile there's still a huge problem. Here's a problem scenario with two threads:
We begin with the empty set, or global={}. Threads A and B both have this value in their thread-local cached memory.
Thread A obtains obtains the synchronized lock on global and starts the update by making a copy of global and adding the new key to the set.
While Thread A is still inside the synchronized block, Thread B reads its local value of global onto the stack and tries to enter the synchronized block. Since Thread A is currently inside the monitor Thread B blocks.
Thread A completes the update by setting the reference and exiting the monitor, resulting in global={1}.
Thread B is now able to enter the monitor and makes a copy of the global={1} set.
Thread A decides to make another update, reads in its local global reference and tries to enter the synchronized block. Since Thread B currently holds the lock on {} there is no lock on {1} and Thread A successfully enters the monitor!
Thread A also makes a copy of {1} for purposes of updating.
Now Threads A and B are both inside the synchronized block and they have identical copies of the global={1} set. This means that one of their updates will be lost! This situation is caused by the fact that you're synchronizing on an object stored in a reference that you're updating inside your synchronized block. You should always be very careful which objects you use to synchronize. You can fix this problem by adding a new variable to act as the lock:
private volatile Collection global = new HashSet(); // start threading after this
private final Object globalLock = new Object(); // final reference used for synchronization
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(globalLock) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
This bug was insidious enough that none of the other answers have addressed it yet. It's these kinds of crazy concurrency details that cause me to recommend using something from the already-debugged java.util.concurrent library rather than trying to put something together yourself. I think the above solution would work—but how easy would it be to screw it up again? This would be so much easier:
private final Set<Object> global = Collections.newSetFromMap(new ConcurrentHashMap<Object,Boolean>());
Since the reference is final you don't need to worry about threads using stale references, and since the ConcurrentHashMap handles all the nasty memory model issues internally you don't have to worry about all the nasty details of monitors and memory barriers!

According to the relevant Java Tutorial,
We have already seen that an increment expression, such as c++, does not describe an atomic action. Even very simple expressions can define complex actions that can decompose into other actions. However, there are actions you can specify that are atomic:
Reads and writes are atomic for reference variables and for most primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long and double variables).
This is reaffirmed by Section §17.7 of the Java Language Specification
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
It appears that you can indeed rely on reference access being atomic; however, recognize that this does not ensure that all readers will read an updated value for global after this write -- i.e. there is no memory ordering guarantee here.
If you use an implicit lock via synchronized on all access to global, then you can forge some memory consistency here... but it might be better to use an alternative approach.
You also appear to want the collection in global to remain immutable... luckily, there is Collections.unmodifiableSet which you can use to enforce this. As an example, you should likely do something like the following...
private volatile Collection global = Collections.unmodifiableSet(new HashSet());
... that, or using AtomicReference,
private AtomicReference<Collection> global = new AtomicReference<>(Collections.unmodifiableSet(new HashSet()));
You would then use Collections.unmodifiableSet for your modified copies as well.
// ... All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
You should know that making a copy here is redundant, as internally for (Object elm : global) creates an Iterator as follows...
final Iterator it = global.iterator();
while (it.hasNext()) {
Object elm = it.next();
}
There is therefore no chance of switching to an entirely different value for global in the midst of reading.
All that aside, I agree with the sentiment expressed by DaoWen... is there any reason you're rolling your own data structure here when there may be an alternative available in java.util.concurrent? I figured maybe you're dealing with an older Java, since you use raw types, but it won't hurt to ask.
You can find copy-on-write collection semantics provided by CopyOnWriteArrayList, or its cousin CopyOnWriteArraySet (which implements a Set using the former).
Also suggested by DaoWen, have you considered using a ConcurrentHashMap? They guarantee that using a for loop as you've done in your example will be consistent.
Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.
Internally, an Iterator is used for enhanced for over an Iterable.
You can craft a Set from this by utilizing Collections.newSetFromMap like follows:
final Set<E> safeSet = Collections.newSetFromMap(new ConcurrentHashMap<E, Boolean>());
...
/* guaranteed to reflect the state of the set at read-time */
for (final E elem : safeSet) {
...
}

I think your original idea was sound, and DaoWen did a good job getting the bugs out. Unless you can find something that does everything for you, it's better to understand these things than hope some magical class will do it for you. Magical classes can make your life easier and reduce the number of mistakes, but you do want to understand what they are doing.
ConcurrentSkipListSet might do a better job for you here. It could get rid of all your multithreading problems.
However, it is slower than a HashSet (usually--HashSets and SkipLists/Trees hard to compare). If you are doing a lot of reads for every write, what you've got will be faster. More importantly, if you update more than one entry at a time, your reads could see inconsistent results. If you expect that whenever there is an entry A there is an entry B, and vice versa, the skip list could give you one without the other.
With your current solution, to the readers, the contents of the map are always internally consistent. A read can be sure there's an A for every B. It can be sure that the size() method gives the precise number of elements that will be returned by the iterator. Two iterations will return the same elements in the same order.
In other words, allUpdatesGoThroughHere and ConcurrentSkipListSet are two good solutions to two different problems.

Can you use the Collections.synchronizedSet method? From HashSet Javadoc http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html
Set s = Collections.synchronizedSet(new HashSet(...));

Replace the synchronized by making global volatile and you'll be alright as far as the copy-on-write goes.
Although the assignment is atomic, in other threads it is not ordered with the writes to the object referenced. There needs to be a happens-before relationship which you get with a volatile or synchronising both reads and writes.
The problem of multiple updates happening at once is separate - use a single thread or whatever you want to do there.
If you used a synchronized for both reads and writes then it'd be correct but the performance may not be great with reads needing to hand-off. A ReadWriteLock may be appropriate, but you'd still have writes blocking reads.
Another approach to the publication issue is to use final field semantics to create an object that is (in theory) safe to be published unsafely.
Of course, there are also concurrent collections available.

Trying to understand the scope of the ConcurrentHashMap

The ConcurrentHashMap provides thread-safe but the docs state:
" However, even though all operations are thread-safe, retrieval operations do not entail locking"
So from this I understand that getting or setting a key and value are thread-safe, but modifying the actual VALUE of any given key isn't (by value I actaully mean the value or state of that object).
I'm just confused on how this works, at the moment I think things work like this.
The ConcurrentHashMap only gaurantees the key's are thread-safe in terms setting/getting them. But the object you put inside the map has to gaurd for concurrency by itself.
Is this correct?

But the object you put inside the map has to gaurd for concurrency by itself.
Your understanding is correct.
From the documentation:
However, even though all operations are thread-safe, retrieval operations do not entail locking, and there is not any support for locking the entire table in a way that prevents all access.
What the above is also saying is that there is no built-in mechanism for automatic locking of the hash map while the reading takes place. In particular, this means that get() operations can overlap with concurrent modifications performed by other threads.
The document goes on to explain the concurrency semantics:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries. Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.

What you say is true by default -- there would be no way for the map to enforce the thread safety of either its keys or its values since these are objects that come from outside. What you read about the retrieval of objects, however, has nothing to do with that fact. The map doesn't block while retrieving a value so another update may be happening at the same time (these operation can overlap).

The basic idea of the ConcurrentHashMap is that only modifications use a lock, while retrieval-only operations don't. This is possible because the entire data structure and the operations on it are defined in a way that allows get() to only ever see a "consistent enough" state of the map to do its work. If there's currently an insert operation in progress, then get() either sees the result or it doesn't, but it won't ever see a partial result or even temporarily invalid data.

Is this thread-safe?

I want to make my class thread-safe without large overhead.
The instances will be seldom used concurrently, but it may happen.
Most of the class is immutable, there's only one mutable member used as a cache:
private volatile SoftReference<Map<String, Something>> cache
= new SoftReference(null);
which gets assigned in the constructor (not shared) like
Map<String, Something> tmp = new HashMap<String, Something>();
tmp.put("a", new Something("a");
tmp.put("b", new Something("b");
cache = new SoftReference(tmp);
After the assignment, the map gets never modified.
It's no problem, when two threads compute the cache in parallel, since the value will be the same.
The additional overhead of the word done twice is acceptable.
When a thread wouldn't see the value computed by another tread, it'd compute it unnecessary, and this is acceptable.
This wouldn't happen because of volatile.
When a thread sees value computed by another tread, it's fine.
The only possible problem would be a thread seeing inconsistent state (e.g. a partly filled map).
Can this happen?
Notes:
I really want the whole map being softly referenced, there's no use for a map using soft keys or values here.
I know about ConcurrentHashMap and will maybe use it anyway, but I'm curious, if using volatile only works.

The only possible problem would be a
thread seeing inconsistent state (e.g.
a partly filled map). Can this happen?
No. Actions performed within a thread must be performed as if they had been executed in order. Writing a volatile variable happens-before any read of that value. Hence, initialization of the map happens-before any thread reading the reference to the map from the field.

The problem with using a soft reference is that you can lose the whole map/cache after a GC. This means the performance of your application can be hit very hard. You are better off using a cache with an eviction policy so that you never have this problem.
The volatile doesn't make any operation safe here.
You haven't shown all your code, perhaps we could offer some suggestion on how you could improve your code e.g. your sample code should compile ;)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.