How to prevent threads from reading inconsistent data from ConcurrentHashMap? - java

I'm using a ConcurrentHashMap<String, String> that works as a cache, and where read operations are performed to validate if an element is already in the cache and write operations to add an element to the cache.
So, my question is: what are the best practices to always read the most recent ConcorrentHashMap values?
I want to ensure data consistency and not have cases like:
With the map.get("key") method, the first thread validates that this key does not yet exist in the map, then it does the map.put("value")
The second thread reads the data before the first thread puts the element on the map, leading to inconsistent data.
Code example:
Optional<String> cacheValue = Optional.ofNullable(cachedMap.get("key"));
if (cacheValue.isPresent()) {
// Perform actions
} else {
cachedMap.putIfAbsent("key", "value");
// Perform actions
}
How can I ensure that my ConcurrentHashMap is synchronized and doesn't retrieve inconsistent data?
Should I perform these map operations inside a synchronized block?

You probably need to do it this way:
if (cachedMap.putIfAbsent("key", "value") == null) {
// Perform actions "IS NOT PRESENT"
} else {
// Perform actions "IS PRESENT"
}
Doing it in two checks is obviously not atomic, so if you're having problems with the wrong values getting put in the cache, then that's likely your problem.

what are the best practices to always read the most recent ConcurrentHashMap values?
Oracle's Javadoc for ConcurrentHashMap says, "Retrievals reflect the results of the most recently completed update operations holding upon their onset." In other words, any time you call map.get(...) or any other method on the map, you are always working with the "most recent" content.
*BUT*
Is that enough? Maybe not. If your program threads expect any kind of consistency between two or more keys in the map, or if your threads expect any kind of consistency between something that is stored in the map and something that is stored elsewhere, then you are going to need to provide some explicit higher-level synchronization between the threads.
I can't provide an example that would be specific to the problem that's puzzling you because your question doesn't really say what that problem is.

Related

Java 8 updating a map in parallel stream

I have two loops. In the inner loop, I hit a Database, get the result and perform some computatiosn on the result (which involves calling other private method) and put the result it in a map.
Will this approach cause any problem like putting null for any of the keys?
No two threads will update the same value. i.e)the key that is computed will be unique. (If it loops n times, there will be n keys)
Map<String,String> m = new ConcurrentHashMap<>();
obj1.getProp().parallelStream().forEach(k1 -> { //obj.getProp() returns a list
obj2.parallelStream().forEach(k2-> { //obj2 is a list
String key = constructKey(k1,k2);
//Hit a DB and get the result
//Computations on the result
//Call some other methods
m.put(key, result);
});
});
You should not use the Stream API unless you’ve fully understood that it is more than an alternative spelling for loops. Generally, if your code contains a forEach on a stream, you should ask yourself at least once whether this is really the best solution for your task, but if your code contains a nested forEach calls, you should know that it can’t be the right thing.
It might work, as when adding to a concurrent map, like in your question, however, it defeats the purpose of the Stream API.
Besides that, arrays don’t have a parallelStream() method, thus, when the result type of obj.getProp() and the type of obj2 are arrays, as your comments say, you have to use Arrays.stream(…) to construct a stream.
What you want to do can be implemented as
Map<String,String> m =
Arrays.stream(obj1.getProp()).parallel()
.flatMap(k1 -> Arrays.stream(obj2).map(k2 -> constructKey(k1, k2)))
.collect(Collectors.toConcurrentMap(key -> key, key -> {
//Hit a DB and get the result
//Computations on the result
//Call some other methods
return result;
}));
The benefit of this is not only a better utilization of parallel processing, but also that it works even if you use Collectors.toMap, creating a non-concurrent Map, instead of Collectors.toConcurrentMap; the framework will take care of producing it in a thread-safe manner.
So unless you definitely need a concurrent map for concurrent later-one processing, you can use either; which one will perform better depends on factors whose discussion would exceed the scope of this answer.
So with the correct usage of the Stream API, it will be thread safe, regardless of which Map type you produce, and the remaining question is whether the database access is thread safe, which, as already explained in this answer depends on a lot of factors which you didn’t include in your question, so we can’t answer that.
Your question boils down to the parts "can I add to a concurrent hash map from multiple threads?" and "can I access my database in parallel?"
The answer to the first is: "yes", the answer to the second is "it depends"
Or a little longer: the two parallel streams which you use basically just start the inner lambda on multiple threads in the execution pool. The adding to the map itself is not a problem, that is what the concurrent hash map was made for.
Regarding the database, it depends on how you query it and on which level you share the object. If you use a connection pool with a different connection for each thread, you will probably be fine. For most databases, sharing a connection and getting a new statement per thread is also fine. Sharing a statement and getting a new result set leads to problems for quite a number of database drivers.

Sort array based on constantly changing map

I have a ConcurrentHashMap which is asynchronously updated to mirror the data in a database. I am attempting to sort an array based on this data which works fine most of the time but if the data updates while sorting then things can get messy.
I have thought of copying the map and then sorting with the copied map but due to the frequency I need to sort and the size of the map this is not a possibility.
I'm not sure I understood your requirements perfectly, so I'll deal with two separate cases.
Let's say your async "update" operation requires you to update 2 keys in your map.
Scenario 1 is : it's OK if a "sort" operation occurs while only 1 of the two updates is visible.
Scenario 2 is : you need the 2 updates to be visible simultaneously or not at all (which is called atomic behavior).
Case 1 : you do not need atomic bulk updates
In this case, ConcurrentHashMap is OK as is, seeing iterators are guaranteed not to fail upon modification of the map. From ConcurrentHashMap documentation (emphasis mine) :
Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration. They do not throw ConcurrentModificationException. However, iterators are designed to be used by only one thread at a time.
So you are guaranteed that you can iterate through the map even while it is being modified without the iteration crashing for concurrent modifications. But (see the emphasis) you are NOT guaranteed that all modifications made concurrently to the map are immediately visible not even if only part of them are, and in which order.
Case 2 : you need bulk updates to be atomic
Further more with ConcurrentHashMap, you do not have any guarantee that bulk operations (putAll) will behave atomically :
For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries.
So I see two scenarios for working this case, each of which entail locking.
Solution 1 : building a copy
Building a "frozen" copy can help you only if this copy is built during a phase where all other updates are locked, because the copying of your map implies iterating through it, and our hypothesis is that iteration is not safe if we have concurrent modification.
This could look like :
ConcurrentMap<String, String> map = new ConcurrentHashMap<String, String>(); //
AtomicReference<Map<String, String>> frozenCopy = new AtomicReference<Map<String, String>>(map);
public void sortOperation() {
sortUsingFrozenCopy();
}
public void updateOperation() {
synchronized (map) { // Exclusive access to the map instance
updateMap();
Map<String, String> newCopy = new HashMap<String, String>();
newCopy.putAll(map); // You build the copy. This is safe thanks to the exclusive access.
frozenCopy.set(newCopy); // And you update the reference to the copy
}
}
This solution could be refined...
Seeing your 2 operations (map read and map writes) are totally asynchronous, one can assume that your read operations can not know (and should not care) wether the previous write operation occured 0.1 sec before or will occur 0.1 sec after.
So having your read operations depend on a "frozen copy" of the map that is actually updated once every 1 (or 2, or 5, or 10) seconds (or update events) instead of each time may be a possibility for your case.
Solution 2 : lock the map for updates
Locking the Map without copying it is a solution. You'd want a ReadWriteLock (or StampedLock in Java 8) so as to have multiple sorts possible, and a mutual exclusion of read and write operations.
Solution 2 is actually easy to implement. You'd have something like
ReadWriteLock lock = new ReentrantReadWriteLock();
public void sortOperation() {
lock.readLock().lock();
// read lock granted, which prevents writeLock to be granted
try {
sort(); // This is safe, nobody can write
} finally {
lock.readLock().unlock();
}
}
public void updateOperation() {
lock.writeLock().lock();
// Write lock granted, no other writeLock (except to myself) can be granted
// nor any readLock
try {
updateMap(); // Nobody is reading, that's OK.
} finally {
lock.writeLock().unlock();
}
}
With a ReadWriteLock, multiple reads can occur simultaneously, or a single write, but not multiple writes nor reads and writes.
You'd have to consider the possibility of using a fair variant of the lock, so that you are sure that every read and write process will eventually have a chance of being executed, depending on your usage pattern.
(NB : if you use Locking/synchronized, your Map may not need to be concurrent, as write and read operations will be exclusive, but this is another topic).

How to optimize concurrent operations in Java?

I'm still quite shaky on multi-threading in Java. What I describe here is at the very heart of my application and I need to get this right. The solution needs to work fast and it needs to be practically safe. Will this work? Any suggestions/criticism/alternative solutions welcome.
Objects used within my application are somewhat expensive to generate but change rarely, so I am caching them in *.temp files. It is possible for one thread to try and retrieve a given object from cache, while another is trying to update it there. Cache operations of retrieve and store are encapsulated within a CacheService implementation.
Consider this scenario:
Thread 1: retrieve cache for objectId "page_1".
Thread 2: update cache for objectId "page_1".
Thread 3: retrieve cache for objectId "page_2".
Thread 4: retrieve cache for objectId "page_3".
Thread 5: retrieve cache for objectId "page_4".
Note: thread 1 appears to retrieve an obsolete object, because thread 2 has a newer copy of it. This is perfectly OK so I do not need any logic that will give thread 2 priority.
If I synchronize retrieve/store methods on my service, then I'm unnecessarily slowing things down for threads 3, 4 and 5. Multiple retrieve operations will be effective at any given time but the update operation will be called rarely. This is why I want to avoid method synchronization.
I gather I need to synchronize on an object that is exclusively common to thread 1 and 2, which implies a lock object registry. Here, an obvious choice would be a Hashtable but again, operations on Hashtable are synchronized, so I'm trying a HashMap. The map stores a string object to be used as a lock object for synchronization and the key/value would be the id of the object being cached. So for object "page_1" the key would be "page_1" and the lock object would be a string with a value of "page_1".
If I've got the registry right, then additionally I want to protect it from being flooded with too many entries. Let's not get into details why. Let's just assume, that if the registry has grown past defined limit, it needs to be reinitialized with 0 elements. This is a bit of a risk with an unsynchronized HashMap but this flooding would be something that is outside of normal application operation. It should be a very rare occurrence and hopefully never takes place. But since it is possible, I want to protect myself from it.
#Service
public class CacheServiceImpl implements CacheService {
private static ConcurrentHashMap<String, String> objectLockRegistry=new ConcurrentHashMap<>();
public Object getObject(String objectId) {
String objectLock=getObjectLock(objectId);
if(objectLock!=null) {
synchronized(objectLock) {
// read object from objectInputStream
}
}
public boolean storeObject(String objectId, Object object) {
String objectLock=getObjectLock(objectId);
synchronized(objectLock) {
// write object to objectOutputStream
}
}
private String getObjectLock(String objectId) {
int objectLockRegistryMaxSize=100_000;
// reinitialize registry if necessary
if(objectLockRegistry.size()>objectLockRegistryMaxSize) {
// hoping to never reach this point but it is not impossible to get here
synchronized(objectLockRegistry) {
if(objectLockRegistry.size()>objectLockRegistryMaxSize) {
objectLockRegistry.clear();
}
}
}
// add lock to registry if necessary
objectLockRegistry.putIfAbsent(objectId, new String(objectId));
String objectLock=objectLockRegistry.get(objectId);
return objectLock;
}
If you are reading from disk, lock contention is not going to be your performance issue.
You can have both threads grab the lock for the entire cache, do a read, if the value is missing, release the lock, read from disk, acquire the lock, and then if the value is still missing write it, otherwise return the value that is now there.
The only issue you will have with that is the concurrent read trashing the disk... but the OS caches will be hot, so the disk shouldn't be overly trashed.
If that is an issue then switch your cache to holding a Future<V> in place of a <V>.
The get method will become something like:
public V get(K key) {
Future<V> future;
synchronized(this) {
future = backingCache.get(key);
if (future == null) {
future = executorService.submit(new LoadFromDisk(key));
backingCache.put(key, future);
}
}
return future.get();
}
Yes that is a global lock... but you're reading from disk, and don't optimize until you have a proved performance bottleneck...
Oh. First optimization, replace the map with a ConcurrentHashMap and use putIfAbsent and you'll have no lock at all! (BUT only do that when you know this is an issue)
The complexity of your scheme has already been discussed. That leads to hard to find bugs. For example, not only do you lock on non-final variables, but you even change them in the middle of synchronized blocks that use them as a lock. Multi-threading is very hard to reason about, this kind of code makes it almost impossible:
synchronized(objectLockRegistry) {
if(objectLockRegistry.size() > objectLockRegistryMaxSize) {
objectLockRegistry = new HashMap<>(); //brrrrrr...
}
}
In particular, 2 simultaneous calls to get a lock on a specific string might actually return 2 different instances of the same string, each stored in a different instance of your hashmap (unless they are interned), and you won't be locking on the same monitor.
You should either use an existing library or keep it a lot simpler.
If your question includes the keywords "optimize", "concurrent", and your solution includes a complicated locking scheme ... you're doing it wrong. It is possible to succeed at this sort of venture, but the odds are stacked against you. Prepare to diagnose bizarre concurrency bugs, including but not limited to, deadlock, livelock, cache incoherency... I can spot multiple unsafe practices in your example code.
Pretty much the only way to create a safe and effective concurrent algorithm without being a concurrency god is to take one of the pre-baked concurrent classes and adapt them to your need. It's just too hard to do unless you have an exceptionally convincing reason.
You might take a look at ConcurrentMap. You might also like CacheBuilder.
Using Threads and synchronize directly is covered by the beginning of most tutorials about multithreading and concurrency. However, many real-world examples require more sophisticated locking and concurrency schemes, which are cumbersome and error prone if you implement them yourself. To prevent reinventing the wheel over an over again, the Java concurrency library was created. There, you can find many classes that will be of great help to you. Try googling for tutorials about java concurrency and locks.
As an example for a lock which might help you, see http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html .
Rather than roll your own cache I would take a look at Google's MapMaker. Something like this will give you a lock cache that automatically expires unused entries as they are garbage collected:
ConcurrentMap<String,String> objectLockRegistry = new MapMaker()
.softValues()
.makeComputingMap(new Function<String,String> {
public String apply(String s) {
return new String(s);
});
With this, the whole getObjectLock implementation is simply return objectLockRegistry.get(objectId) - the map takes care of all the "create if not already present" stuff for you in a safe way.
I Would do it similar, to you: just create a map of Object (new Object()).
But in difference to you i would use TreeMap<String, Object>
or HashMap
You call that the lockMap. One entry per file to lock. The lockMap is public available to all participating threads.
Each read and write to a specific file, gets the lock from the map. And uses syncrobize(lock) on that lock object.
If the lockMap is not fixed, and its content chan change, then reading and writing to the map must syncronized, too. (syncronized (this.lockMap) {....})
But your getObjectLock() is not safe, sync that all with your lock. (Double checked lockin is in Java not thread safe!) A recomended book: Doug Lea, Concurrent Programming in Java

concurrent HashMap: checking size

Concurrent Hashmap could solve synchronization issue which is seen in hashmap. So adding and removing would be fast if we are using synchronize key work with hashmap. What about checking hashmap size, if mulitple threads checking concurrentHashMap size? do we still need synchronzation key word: something as follows:
public static synchronized getSize(){
return aConcurrentHashmap.size();
}
concurentHashMap.size() will return the size known at the moment of the call, but it might be a stale value when you use that number because another thread has added / removed items in the meantime.
However the whole purpose of ConcurrentMaps is that you don't need to synchronize it as it is a thread safe collection.
You can simply call aConcurrentHashmap.size(). However, you have to bear in mind that by the time you get the answer it might already be obsolete. This would happen if another thread where to concurrently modify the map.
You don't need to use synchronized with ConcurretnHashMap except in very rare occasions where you need to perform multiple operations atomically.
To just get the size, you can call it without synchronization.
To clarify when I would use synchronization with ConcurrentHashMap...
Say you have an expensive object you want to create on demand. You want concurrent reads, but also want to ensure that values are only created once.
public ExpensiveObject get(String key) {
return map.get(key); // can work concurrently.
}
public void put(String key, ExepensiveBuilder builder) {
// cannot use putIfAbsent because it needs the object before checking.
synchronized(map) {
if (!map.containsKey(key))
map.put(key, builder.create());
}
}
Note: This requires that all writes are synchronized, but reads can still be concurrent.
The designers of ConcurrentHashMap thought of giving weightage to individual operations like : get(), put() and remove() over methods which operate over complete HashMap like isEmpty() or size(). This is done because the changes of these methods getting called (in general) are less than the other individual methods.
A synchronization for size() is not needed here. We can get the size by calling concurentHashMap.size() method. This method may return stale values as other thread might modify the map in the meanwhile. But, this is explicitely assumed to be broken as these operations are deprioritized.
ConcorrentHashMap is fail-safe. it won't give any concurrent modification exceptions. it works good for multi threaded operations.
The whole implementation of ConcurrentHashMap is same as HashMap but the while retrieving the elements , HashMap locks whole map restricting doing further modifications which gives concurrent modification exception.'
But in ConcurrentHashMap, the locking happens at bucket level so the chance of giving concurrent modification exception is not present.
So to answer you question here, checking size of ConcurrentHashMap doesn't help because , it keeps chaining based on the operations or modification code that you write on the map. It has size method which is same from the HashMap.

What is the name of this locking technique?

I've got a gigantic Trove map and a method that I need to call very often from multiple threads. Most of the time this method shall return true. The threads are doing heavy number crunching and I noticed that there was some contention due to the following method (it's just an example, my actual code is bit different):
synchronized boolean containsSpecial() {
return troveMap.contains(key);
}
Note that it's an "append only" map: once a key is added, is stays in there forever (which is important for what comes next I think).
I noticed that by changing the above to:
boolean containsSpecial() {
if ( troveMap.contains(key) ) {
// most of the time (>90%) we shall pass here, dodging lock-acquisition
return true;
}
synchronized (this) {
return troveMap.contains(key);
}
}
I get a 20% speedup on my number crunching (verified on lots of runs, running during long times etc.).
Does this optimization look correct (knowing that once a key is there it shall stay there forever)?
What is the name for this technique?
EDIT
The code that updates the map is called way less often than the containsSpecial() method and looks like this (I've synchronized the entire method):
synchronized void addSpecialKeyValue( key, value ) {
....
}
This code is not correct.
Trove doesn't handle concurrent use itself; it's like java.util.HashMap in that regard. So, like HashMap, even seemingly innocent, read-only methods like containsKey() could throw a runtime exception or, worse, enter an infinite loop if another thread modifies the map concurrently. I don't know the internals of Trove, but with HashMap, rehashing when the load factor is exceeded, or removing entries can cause failures in other threads that are only reading.
If the operation takes a significant amount of time compared to lock management, using a read-write lock to eliminate the serialization bottleneck will improve performance greatly. In the class documentation for ReentrantReadWriteLock, there are "Sample usages"; you can use the second example, for RWDictionary, as a guide.
In this case, the map operations may be so fast that the locking overhead dominates. If that's the case, you'll need to profile on the target system to see whether a synchronized block or a read-write lock is faster.
Either way, the important point is that you can't safely remove all synchronization, or you'll have consistency and visibility problems.
It's called wrong locking ;-) Actually, it is some variant of the double-checked locking approach. And the original version of that approach is just plain wrong in Java.
Java threads are allowed to keep private copies of variables in their local memory (think: core-local cache of a multi-core machine). Any Java implementation is allowed to never write changes back into the global memory unless some synchronization happens.
So, it is very well possible that one of your threads has a local memory in which troveMap.contains(key) evaluates to true. Therefore, it never synchronizes and it never gets the updated memory.
Additionally, what happens when contains() sees a inconsistent memory of the troveMap data structure?
Lookup the Java memory model for the details. Or have a look at this book: Java Concurrency in Practice.
This looks unsafe to me. Specifically, the unsynchronized calls will be able to see partial updates, either due to memory visibility (a previous put not getting fully published, since you haven't told the JMM it needs to be) or due to a plain old race. Imagine if TroveMap.contains has some internal variable that it assumes won't change during the course of contains. This code lets that invariant break.
Regarding the memory visibility, the problem with that isn't false negatives (you use the synchronized double-check for that), but that trove's invariants may be violated. For instance, if they have a counter, and they require that counter == someInternalArray.length at all times, the lack of synchronization may be violating that.
My first thought was to make troveMap's reference volatile, and to re-write the reference every time you add to the map:
synchronized (this) {
troveMap.put(key, value);
troveMap = troveMap;
}
That way, you're setting up a memory barrier such that anyone who reads the troveMap will be guaranteed to see everything that had happened to it before its most recent assignment -- that is, its latest state. This solves the memory issues, but it doesn't solve the race conditions.
Depending on how quickly your data changes, maybe a Bloom filter could help? Or some other structure that's more optimized for certain fast paths?
Under the conditions you describe, it's easy to imagine a map implementation for which you can get false negatives by failing to synchronize. The only way I can imagine obtaining false positives is an implementation in which key insertions are non-atomic and a partial key insertion happens to look like another key you are testing for.
You don't say what kind of map you have implemented, but the stock map implementations store keys by assigning references. According to the Java Language Specification:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.
If your map implementation uses object references as keys, then I don't see how you can get in trouble.
EDIT
The above was written in ignorance of Trove itself. After a little research, I found the following post by Rob Eden (one of the developers of Trove) on whether Trove maps are concurrent:
Trove does not modify the internal structure on retrievals. However, this is an implementation detail not a guarantee so I can't say that it won't change in future versions.
So it seems like this approach will work for now but may not be safe at all in a future version. It may be best to use one of Trove's synchronized map classes, despite the penalty.
I think you would be better off with a ConcurrentHashMap which doesn't need explicit locking and allows concurrent reads
boolean containsSpecial() {
return troveMap.contains(key);
}
void addSpecialKeyValue( key, value ) {
troveMap.putIfAbsent(key,value);
}
another option is using a ReadWriteLock which allows concurrent reads but no concurrent writes
ReadWriteLock rwlock = new ReentrantReadWriteLock();
boolean containsSpecial() {
rwlock.readLock().lock();
try{
return troveMap.contains(key);
}finally{
rwlock.readLock().release();
}
}
void addSpecialKeyValue( key, value ) {
rwlock.writeLock().lock();
try{
//...
troveMap.put(key,value);
}finally{
rwlock.writeLock().release();
}
}
Why you reinvent the wheel?
Simply use ConcurrentHashMap.putIfAbsent

Categories

Resources