Java concurrent hashMap retrieval

Java concurrent hashMap retrieval - java

ConcurrentHashMap documentation says:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. (More formally, an update operation for a given key bears a happens-before relation with any (non-null) retrieval for that key reporting the updated value.)
But I am not able to understand how retrieval operation do not block with update/remove operation for same key ?
"an update operation for a given key bears a happens-before relation with any (non-null) retrieval for that key"
that sort of indicate that concurrent hash map has some way create sequence for read/update/remove operations on same key. But that mean read operations are blocked by update/remove operations.
Not sure what I am missing here.

But I am not able to understand how retrieval operation do not block with update/remove operation for same key ?
First of all, "how" it does it is an implementation detail. If you really want / need to understand "how", you need to look at the source code. It is available from various places.
But what this is actually saying is that:
get doesn't block, and
what you see when you do a get is the value from the most recently completed put operation for that key.
So if you do a get while a put is in progress, you may see the value from the previous put.
Note that the 2nd bullet point explains (implicitly) why it is possible for get to be non-blocking. There is no need for get(42) to wait for a put(42, value) that is currently in progress to finish ... since the get call is allowed to return the previous value.
TL;DR - there is no need to block.
The stuff about "happens before" relationships is relating this to the semantics of the Java Memory Model. This may be important if you are doing a deep analysis of your code. But for a shallow understanding of what ConcurrentHashMap does, you can ignore it.
... that sort of indicate that concurrent hash map has some way create sequence for read/update/remove operations on same key.
It doesn't imply that at all. But to understand what the statement you quoted really means, you need a need a good understanding of the Java Memory Model. And I don't think it is feasible to convey that to you in a StackOverflow Q&A. I recommend you take a few hours to read and understand the JMM ... by reading the JLS. Then, do it again.
This is NOT stuff where an intuitive understanding is sufficient.

Related

How to know if a method is thread safe

Suppose I have a method that checks for a id in the db and if the id doesn't exit then inserts a value with that id. How do I know if this is thread safe and how do I ensure that its thread safe. Are there any general rules that I can use to ensure that it doesn't contain race conditions and is generally thread safe.
public TestEntity save(TestEntity entity) {
if (entity.getId() == null) {
entity.setId(UUID.randomUUID().toString());
}
Map<String, TestEntity > map = dbConnection.getMap(DB_NAME);
map.put(entity.getId(), entity);
return map.get(entity.getId());
}

This is a how long is a piece of string question...
A method will be thread safe if it uses the synchronized keyword in its declaration.
However, even if your setId and getId methods used synchronized keyword, your process of setting the id (if it has not been previously initialized) above is not. .. but even then there is an "it depends" aspect to the question. If it is impossible for two threads to ever get the same object with an uninitialised id then you are thread safe because you would never be attempting to concurrently modifying the id.
It is entirely possible, given the code in your question, that there could be two calls to the thread safe getid at the same time for the same object. One by one they get the return value (null) and immediately get pre-empted to let the other thread run. This means both will then run the thread safe setId method - again one by one.
You could declare the whole save method as synchronized, but if you do that the entire method will be single threaded which defeats the purpose of using threads in the first place. You tend to want to minimize the synchronized code to the bare minimum to maximize concurrency.
You could also put a synchronized block around the critical if statement and minimise the single threaded part of the processing, but then you would also need to be careful if there were other parts of the code that might also set the Id if it wasn't previously initialized.
Another possibility which has various pros and cons is to put the initialization of the Id into the get method and make that method synchronized, or simply assign the Id when the object is created in the constructor.
I hope this helps...
Edit...
The above talks about java language features. A few people mentioned facilities in the java class libraries (e.g. java.util.concurrent) which also provide support for concurrency. So that is a good add on, but there are also whole packages which address the concurrency and other related parallel programming paradigms (e.g. parallelism) in various ways.
To complete the list I would add tools such as Akka and Cats-effect (concurrency) and more.
Not to mention the books and courses devoted to the subject.
I just reread your question and noted that you are asking about databases. Again the answer is it depends. Rdbms' usually let you do this type of operation with record locks usually in a transaction. Some (like teradata) use special clauses such as locking row for write select * from some table where pi_cols = 'somevalues' which locks the rowhash to you until you update it or certain other conditions. This is known as pessimistic locking.
Others (notebly nosql) have optimistic locking. This is when you read the record (like you are implying with getid) there is no opportunity to lock the record. Then you do a conditional update. The conditional update is sort of like this: write the id as x provided that when you try to do so the Id is still null (or whatever the value was when you checked). These types of operations are usually down through an API.
You can also do optimistics locking in an RDBMs as follows:
SQL
Update tbl
Set x = 'some value',
Last_update_timestamp = current_timestamp()
Where x = bull AND last_update_timestamp = 'same value as when I last checked'
In this example the second part of the where clause is the critical bit which basically says "only update the record if no one else did and I trust that everyone else will update the last update to when they do". The "trust" bit can sometimes be replaced by triggers.
These types of database operations (if available) are guaranteed by the database engine to be "thread safe".
Which loops me back to the "how long is a piece of string" observation at the beginning of this answer...

Test-and-set is unsafe
a method that checks for a id in the db and if the id doesn't exit then inserts a value with that id.
Any test-and-set pair of operations on a shared resource is inherently unsafe, vulnerable to a race condition. If the two operations are separate (not atomic), then they must be protected as a pair. While one thread completes the test but has not yet done the set, another thread could sneak in and do both the test and the set. The first thread now completes its set without knowing a duplicate action has occurred.
Providing that necessary protection is too broad a topic for an Answer on Stack Overflow, as others have said here.
UPSERT
However, let me point out that an alternative approach to to make the test-and-set atomic.
In the context of a database, that can be done using the UPSERT feature. Also known as a Merge operation. For example, in Postgres 9.5 and later we have the INSERT INTO … ON CONFLICT command. See this explanation for details.
In the context of a Boolean-style flag, a semaphore makes the test-and-set atomic.

In general, when we say "a method is thread-safe" when there is no race-condition to the internal and external data structure of the object it belongs to. In other words, the order of the method calls are strictly enforced.
For example, let's say you have a HashMap object and two threads, thread_a and thread_b.
thread_a calls put("a", "a") and thread_b calls put("a", "b").
The put method is not thread-safe (refer to its documentation) in the sense that while thread_a is executing its put, thread_b can also go in and execute its own put.
A put contains reading and writing part.
thread_a.read("a")
thread_b.read("a")
thread_b.write("a", "b")
thread_a.write("a", "a")
If above sequence happens, you can say ... a method is not thread-safe.
How to make a method thread-safe is by ensuring the state of the whole object cannot be perturbed while the thread-safe method is executing. An easier way is to put "synchronized" keyword in method declarations.
If you are worried about performance, use manual locking using synchronized blocks with a lock object. Further performance improvement can be achieved using a very well designed semaphores.

When using forEach on a ConcurrentHashMap, are updates reflected or does it behave like a failsafe Iterator?

If I start a forEach action on a ConcurrentHashMap, and other threads are still performing puts on this map, will I see new updates into other bins?
The reason for this is I am trying to find the most effective way to broadcast the contents of a ConcurrentHashMap to listeners, without causing contention to writers of new data to the map. But I want all listeners to receive the same snapshot of the Map when I notify the listeners.

It's not fail safe, but updates are not reflected in way that you think they are; so if you have already seen a bin before you will not get updates from that bin anymore.
If you know how spread works internally you can even cause an OOM(this was an excellent comment from a question I've answered from Holger, but I can't seem to find it right now... )
ConcurrentHashMap<Integer, Integer> chm = new ConcurrentHashMap<>(500_000_000);
chm.put(1, 1);
chm.forEach((key, value) -> chm.put(++value^(value>>>16), value));

The class-level API docs have this to say:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. (More formally, an update operation for a given key bears a happens-before relation with any (non-null) retrieval for that key reporting the updated value.) For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries. Similarly, Iterators, Spliterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.
(Emphasis added.) That does not explicitly address forEach(), but I would expect its behavior to be similar to that achievable via an Iterator over the map's entry set. That is, the forEach() iteration would reflect the map's contents as of some fixed point in time. Therefore, I do not think it at all safe to suppose that modifications to the map by other threads will be seen by the forEach(). I would in fact expect modifications by other threads generally not to be reflected in the forEach()'s behavior, though there is room in the spec to allow it seeing some modifications.

To provide a snapshot of a map you need to copy the map at a given point. If the iterator would simply be a snapshot, it would have to create a copy initially as well. As that costs extra memory and computation it won't do that, and then there are architectural reasons why that might not be desirable anyways.
Any non-null result returned from get(key) and related access methods bears a happens-before relation with the associated insertion or update. The result of any bulk operation reflects the composition of these per-element relations
In these lines it is (not so clearly) stated, that any changes that happened before any get (iterator or single call) are already included and reflected by that get operation. So a forEach bulk operation will work on the most recent state of the map at any given time.
You already gave the (only) solution to this problem in your question: create a local snap-shot using the copy constructor of the map prior to distribution. It is an extra memory overhead, but that is the only way to get a snapshot.

ConcurrentHashMap and its operations

Suppose there is a ConcurrentHashMap and there are two threads.
If both threads are reading some data from the same bucket, then my understanding says that both can read that bucket concurrently, as CHM does not block reading operations.
But suppose one thread is writing (put) to a bucket. Then, can a second thread simultaneously read (get) from the same bucket or will the second thread have to wait for the put operation to complete?
If it were Hashtable then get will have to wait until the put operation is complete. But in case of CHM how it will behave?

There is no need for speculation. The source code for ConcurrentHashMap is open, and anyone can read it. (This is JDK 8 build 128, the first JDK 8 release candidate.)
You should have no trouble understanding it, as it's only 6,300 lines long. :-) Actually, a good fraction of this is comments, and most of the code goes toward handling edge cases. The straightforward paths of get() and put() aren't terribly complicated and are only a few dozen lines of code.
Your understanding of read operations (get(), contains()) is correct; there is no blocking. Hashing to a bucket and searching within the bucket, if necessary, is straightforward, with no locking. Memory visibility is ensured by volatile reads. (At lines 622-623, the val and next fields of Node are volatile.) Read operations proceed concurrently with other reads and also with writes to the same bucket.
The policy for removing and replacing values is fairly straightforward in that the head of the bucket is locked while the bucket is being searched and modified. See the synchronized block at line 1117 of replaceNode. A put that adds to an existing bucket is similar; see the synchronized block at line 1027 of putVal. These operations will of course block other threads attempting to remove, replace, or add entries to this same bucket. If a value is in the midst of being replaced, a thread that is getting the value for this key will see either the old value or the new value, depending on whether the reading thread finds the node before or after the value is replaced by the writing thread.
There is a special case for putting the first element into a bucket. At lines 1018-1020, if putVal finds a bucket empty, it will create a new Node and CAS (compare-and-swap) it into place. If this succeeds, the operation is complete. If two threads are attempting to add nodes into the same bucket more-or-less simultaneously, the CAS for the first will succeed, and the CAS for the second will fail. But note that this code is within a for-loop (line 1014). The thread whose CAS has failed simply goes around the loop and retries. In fact, all the other write operations are within a loop. The general approach is that operations proceed optimistically but are checked for concurrent writers. If the optimistic attempt fails, the operation is retried and goes through a (possibly) different path based on the now updated state.

Hi as Per my knowledge ConcurrentHashMap allows multiple readers to read concurrently without any blocking. This is achieved by partitioning Map into different parts based on concurrency level and locking only a portion of Map during updates. Default concurrency level is 16, and accordingly Map is divided into 16 part and each part is governed with different lock. This means, 16 thread can operate on Map simultaneously, until they are operating on different part of Map. This makes ConcurrentHashMap high performance despite keeping thread-safety intact. Though, it comes with caveat. Since update operations like put(), remove(), putAll() or clear() is not synchronized, concurrent retrieval may not reflect most recent change on Map.
I hope this will help..

This is from the JavaDocs of ConcurrentHashMap class:
"Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset"

In Hastable concurrent operations will lock the whole collection, but in ConcurrentHashMap only one bucket will be locked.

From the doc:
A hash table supporting full concurrency of retrievals and adjustable
expected concurrency for updates. This class obeys the same functional
specification as Hashtable, and includes versions of methods
corresponding to each method of Hashtable. However, even though all
operations are thread-safe, retrieval operations do not entail
locking, and there is not any support for locking the entire table in
a way that prevents all access. This class is fully interoperable with
Hashtable in programs that rely on its thread safety but not on its
synchronization details.
Retrieval operations (including get) generally do not block, so may
overlap with update operations (including put and remove). Retrievals
reflect the results of the most recently completed update operations
holding upon their onset. For aggregate operations such as putAll and
clear, concurrent retrievals may reflect insertion or removal of only
some entries. Similarly, Iterators and Enumerations return elements
reflecting the state of the hash table at some point at or since the
creation of the iterator/enumeration. They do not throw
ConcurrentModificationException. However, iterators are designed to be
used by only one thread at a time.
So, you shouldn't expect operations to synchronize exactly as a Hashtable, but the same (series of) operation are threadsafe. The second highlighted sentence does not imply, but in my opinion strongly suggest, what is going on here: a put in progress, i.e. not finished, will not block a get - the get will simply not see the changes yet.
Although I have not worked myself through the whole CHM class, this piece of documentation supports my hypothesis (taken from OpenJDK 6)
static final class Segment<K,V> extends ReentrantLock implements Serializable {
/*
* Segments maintain a table of entry lists that are always
* kept in a consistent state, so can be read (via volatile
* reads of segments and tables) without locking. This
* requires replicating nodes when necessary during table
* resizing, so the old lists can be traversed by readers
* still using old version of table.
When an update is "complete" doesn't seem to be explicitly defined; generally as soon as the new bucket is linked into the list of buckets, I guess. CHM also makes heavy use of volatile fields to ensure that threads read the most recent buckets in the list.

Trying to understand the scope of the ConcurrentHashMap

The ConcurrentHashMap provides thread-safe but the docs state:
" However, even though all operations are thread-safe, retrieval operations do not entail locking"
So from this I understand that getting or setting a key and value are thread-safe, but modifying the actual VALUE of any given key isn't (by value I actaully mean the value or state of that object).
I'm just confused on how this works, at the moment I think things work like this.
The ConcurrentHashMap only gaurantees the key's are thread-safe in terms setting/getting them. But the object you put inside the map has to gaurd for concurrency by itself.
Is this correct?

But the object you put inside the map has to gaurd for concurrency by itself.
Your understanding is correct.
From the documentation:
However, even though all operations are thread-safe, retrieval operations do not entail locking, and there is not any support for locking the entire table in a way that prevents all access.
What the above is also saying is that there is no built-in mechanism for automatic locking of the hash map while the reading takes place. In particular, this means that get() operations can overlap with concurrent modifications performed by other threads.
The document goes on to explain the concurrency semantics:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries. Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.

What you say is true by default -- there would be no way for the map to enforce the thread safety of either its keys or its values since these are objects that come from outside. What you read about the retrieval of objects, however, has nothing to do with that fact. The map doesn't block while retrieving a value so another update may be happening at the same time (these operation can overlap).

The basic idea of the ConcurrentHashMap is that only modifications use a lock, while retrieval-only operations don't. This is possible because the entire data structure and the operations on it are defined in a way that allows get() to only ever see a "consistent enough" state of the map to do its work. If there's currently an insert operation in progress, then get() either sees the result or it doesn't, but it won't ever see a partial result or even temporarily invalid data.

Why ConcurrentHashMap.putifAbsent is safe?

I have been reading for concurency since yesterday and i dont know much things... However some things are starting to getting clear...
I understand why double check locking isnt safe (i wonder what is the propability the rare condition to occur) but volatile fixes the issue in 1.5 +....
But i wonder if this occurs with putifAbsent
like...
myObj = new myObject("CodeMonkey");
cHashM.putIfAbsent("keyy",myObj);
Then does this ensures that myObj would be 100% intialiased when another thread does a cHashM.get() ??? Because it could have a reference isnt completely initialised (the double check lock problem)

If you invoke concurrentHashMap.get(key) and it returns an object, that object is guaranteed to be fully initialized. Each put (or putIfAbsent) will obtain a bucket specific lock and will append the element to the bucket's entries.
Now you may go through the code and notice that the get method doesnt obtain this same lock. So you can argue that there can be an out of date read, that isn't true either. The reason here is that value within the entry itself is volatile. So you will be sure to get the most up to date read.

putIfAbsent method in ConcurrentHashMap is check-if-absent-then-set method. It's an atomic operation. But to answer the following part: "Then does this ensures that myObj would be 100% intialiased when another thread does a cHashM.get() ", it would depend on when the object is put into the HashMap. Usually there is a happens-before precedence, i.e., if the caller gets first before the object is placed in the map, then null would be returned, else the value would be returned.

The relevant part of the documentation is this:
Memory consistency effects: As with
other concurrent collections, actions
in a thread prior to placing an object
into a ConcurrentMap as a key or value
happen-before actions subsequent to
the access or removal of that object
from the ConcurrentMap in another
thread.
-- java.util.ConcurrentMap
So, yes you have your happens-before relationship.

I'm not an expert on this, but looking at the implementation of Segment in ConcurrentHashMap I see that the volatile field count appears to be used to ensure proper visibility between threads. All read operations have to read the count field and all write operations have to write to it. From comments in the class:
Read operations can thus proceed without locking, but rely
on selected uses of volatiles to ensure that completed
write operations performed by other threads are
noticed. For most purposes, the "count" field, tracking the
number of elements, serves as that volatile variable
ensuring visibility. This is convenient because this field
needs to be read in many read operations anyway:
- All (unsynchronized) read operations must first read the
"count" field, and should not look at table entries if
it is 0.
- All (synchronized) write operations should write to
the "count" field after structurally changing any bin.
The operations must not take any action that could even
momentarily cause a concurrent read operation to see
inconsistent data. This is made easier by the nature of
the read operations in Map. For example, no operation
can reveal that the table has grown but the threshold
has not yet been updated, so there are no atomicity
requirements for this with respect to reads.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.