Java Concurrent collection for few writes and frequent reads

Java Concurrent collection for few writes and frequent reads - java

I want to use a Comparator based key value Map. This will have reads and a rare write operation (once every 3 months through a scheduler). The initial load of the collection will be done at application startup.
Also note that the write will:
Add a single entry to the Map
Will not modify any existing entry to the Map.
Will ConcurrentSkipListMap be a good candidate for this. Is the get operation on this allows access to multiple threads simultaneously? I'm looking for concurrent non blocking read but atomic write.

ConcurrentHashMap is exactly what you're looking for. From the Javadoc:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. (More formally, an update operation for a given key bears a happens-before relation with any (non-null) retrieval for that key reporting the updated value.)
That sounds like it satisfies your requirement for "concurrent non blocking read but atomic write".
Since you're doing so few writes, you may want to specify a high loadFactor and appropriate initialSize when creating the ConcurrentHashMap, which will prevent table resizing as you're populating the map, though this is a modest benefit at best. (You could also set a concurrencyLevel of 1, though Java 8's Javadoc seems to imply that is no longer used as a sizing hint.)
If you absolutely must have a SortedMap or NavigableMap, then ConcurrentSkipListMap is the out-of-the-box way to go. But I would double-check that you actually need the functionality provided by those interfaces (getting the first/last key, submaps, finding nearby entries, etc.) before using them. You will pay a steep price (log n vs. constant time for most operations).

Since you are looking for concurrent operations you have basically 3 competitors.
Hashtable, ConcurrentHashMap, ConcurrentSkipListMap (or Collections.synchronizedMap() but that's not efficient).
Out of these 3 latter 2 are more suitable for concurrent operation as they just lock the portion of map rather than locking the entire map like Hashtable.
Out of latter 2 SkipListMap uses skip list data structure which ensures average O(log n) performance for fast search and variety of operations.
It also offers number of operations that ConcurrentHashMap can't, i.e. ceilingEntry/Key(), floorEntry/Key(), etc. It also maintains a sort order which would otherwise have to be calculated.
Thus if you had asked only for faster search i'd have suggested ConcurrentHashMap, but since you have also mentioned 'rare write operations' and 'desired sorting' order I think ConcurrentSkipListMap wins the race.

If you are willing to try third party code, you could consider a copy-on-write version of maps, which are ideal for infrequent writes. Here's one that came up via Googling:
https://bitbucket.org/atlassian/atlassian-util-concurrent/wiki/CopyOnWrite%20Maps
Never tried it myself so caveat emptor.

Related

Is it safe to update a HashMap concurrently if a thread-id is a Key? [duplicate]

If I have two multiple threads accessing a HashMap, but guarantee that they'll never be accessing the same key at the same time, could that still lead to a race condition?

In #dotsid's answer he says this:
If you change a HashMap in any way then your code is simply broken.
He is correct. A HashMap that is updated without synchronization will break even if the threads are using disjoint sets of keys. Here are just some1 of the things that can go wrong.
If one thread does a put, then another thread may see a stale value for the hashmap's size.
If one thread does a put with a key that is (currently) in the same hash bucket as the second thread's key, second thread's map entry might get lost, temporarily or permanently. It depends on how the hash chains (or whatever) are implemented.
When a thread does a put that triggers a rebuild of the table, another thread may see transient or stale versions of the hashtable array reference, its size, its contents or the hash chains. Chaos may ensue.
When a thread does a put for a key that collides with some key used by some other thread, and the latter thread does a put for its key, then the latter might see a stale copy of hash chain reference. Chaos may ensue.
When one thread probes the table with a key that collides with one of some other thread's keys, it may encounter that key on the chain. It will call equals on that key, and if the threads are not synchronized, the equals method may encounter stale state in that key.
And if you have two threads simultaneously doing put or remove requests, there are numerous opportunities for race conditions.
I can think of three solutions:
Use a ConcurrentHashMap.
Use a regular HashMap but synchronize on the outside; e.g. using primitive mutexes, Lock objects, etcetera. But beware that this could lead to a concurrency bottleneck due to lock contention.
Use a different HashMap for each thread. If the threads really have a disjoint set of keys, then there should be no need (from an algorithmic perspective) for them to share a single Map. Indeed, if your algorithms involve the threads iterating the keys, values or entries of the map at some point, splitting the single map into multiple maps could give a significant speedup for that part of the processing.
1 - We cannot enumerate all of the possible things that could go wrong. For a start, we can't predict how all JVMs will handle the unspecified aspects of the JMM ... on all platforms. But you should NOT be relying on that kind of information anyway. All you need to know is that it is fundamentally wrong to use a HashMap like this. An application that does this is broken ... even if you haven't observed the symptoms of the brokenness yet.

Just use a ConcurrentHashMap. The ConcurrentHashMap uses multiple locks which cover a range of hash buckets to reduce the chances of a lock being contested. There is a marginal performance impact to acquiring an uncontested lock.
To answer your original question: According to the javadoc, as long as the structure of the map doesn't change, your are fine. This mean no removing elements at all and no adding new keys that are not already in the map. Replacing the value associated with existing keys is fine.
If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.)
Though it makes no guarantees about visibility. So you have to be willing to accept retrieving stale associations occasionally.

It depends on what you mean under "accessing". If you just reading, you can read even the same keys as long as visibility of data guarantied under "happens-before" rules. This means that HashMap shouldn't change and all changes (initial constructions) should be completed before any reader start to accessing HashMap.
If you change a HashMap in any way then your code is simply broken. #Stephen C provides very good explanation why.
EDIT: If the first case is your actual situation, I recommend you to use Collections.unmodifiableMap() to be shure that your HashMap is never changed. Objects which are pointed by HashMap should not change also, so aggressive using final keyword can help you.
And as #Lars Andren says, ConcurrentHashMap is best choice in most cases.

Modifying a HashMap without proper synchronization from two threads may easily lead to a race condition.
When a put() leads to a resize of the internal table, this takes some time and the other thread continues to write to the old table.
Two put() for different keys lead to an update of the same bucket if the keys' hashcodes are equal modulo the table size. (Actually, the relation between hashcode and bucket index is more complicated, but collisions may still occur.)

Concurrent map with multiple values for one key and auto-remove on timeout

I'm newcomer in concurrency. I read about Guava Cache and MultiMap. I look for something that can combine some possibilities of both:
From Cache I want auto-removal after ACCESS_TIMEOUT and WRITE_TIMEOUT has been expired.
From Multimap I want multiple values associated with one key.
All that must be concurrent.
I has multiple writers and multiple readers. I want to add values with rundom keys and remove them.
Question: Is there map implementation that fits my needs?
UPDATED: Striped<Lock> solution
More I read about Striped<Lock> - more attractive that seems to me. But it arose even more questions in my head:
If I use something like Striped<Lock> with Guava Cache which already uses ConcurrentHashMap I can face the problems with deadlocks or performance decline. Am I wrong?
If I use Striped<Lock> over Cache it still doesn't remove the question linked with multiple values per key.
Does Striped<Lock> eliminate the need of using concurrent map in my case? (I suppose the answer is YES) but in GitHub a saw the contrary.

You could start with a Cache<SomeKey, Collection<SomeValue>> (so you still get the expiration) and use synchronized collections (Collections.synchronized*()) as the values.
But what's really the question here is the type of concurrent access you need on the collections:
Is it enough that the operations are synchronized so the collections don't get corrupted, or do you need higher-level semantics like what ConcurrentMap.putIfAbsent() offers?
Do you need to do multiple operations on the collections of values in an atomic way? Like if you need to do
if (c.contains(v)) {
c.remove(v);
} else {
c.add(v);
}
you usually want to put that into a synchronized(c) { } block.
If so, you'll probably want to wrap the collection inside a class exposing those higher-level semantics and managing the lock around multiple operations to get the atomicity you need, and use that class as the value: Cache<SomeKey, SomeValuesContainer>.
As mentioned in the comments, Striped<Lock> can be used to synchronize the access to multiple Caches/ConcurrentHashMaps without imposing a single lock and its performance impact in case of even moderate contention.
If you need multiple Caches/ConcurrentHashMaps, that is: why don't the Peers (or a wrapper around it) actually contain that information?
1. Deadlocks, performance
Guava's Cache is similar to ConcurrentHashMap, but it doesn't use it. However, both work in the same way by having segments which can be locked independently, thus reducing contention when accessing the map concurrently (especially when updating). Using a Striped<Lock> around the access to either one cannot cause a deadlock, which only happens if you're not locking multiple locks in a consistent order: that can't happen here, as you'll always lock your Lock obtained from Striped<Lock> before calling the Cache or ConcurrentHashMap, which then locks its segment (invisible to you).
As to performance, yes, locking has a cost but it really depends on the level of contention (and that can be tuned with the number of stripes in a Striped<Lock> or the concurrencyLevel in a Cache). However, you need proper concurrency support anyway since without it you can get invalid results (or corrupt your data), so you have to do something (using either locking or a lock-free algorithm).
2. Multiple values per key
My original answer still stands. It's difficult to get an exact idea of what you're exactly trying to do from your multiple questions (it's better if you can provide a complete, consistent context in one question), but I think you don't need more than concurrent modification of the multiple values per key so the synchronized collections should be enough (but you need at least that). You'll have to reason about your access patterns as you add them to make sure they still fit the model, though: make sure your replaceAll*() methods lock what they need, for example.
3. Is ConcurrentMap still needed with Striped<Lock>?
YES! Especially with Striped<Lock> vs a single Lock, because you'll still get concurrent updates for keys which don't use the same stripe (that's the whole point of Striped<Lock>) so you need data structures which support concurrent modification. If you use a simple HashMap, you have every chance of corrupting it under enough load (and cause infinite loops, for example).

What guarantee does concurrencyLevel argument of ConcurrentHashMap give us?

From the Javadocs of ConcurrentHashMap :
The allowed concurrency among update operations is guided by the
optional concurrencyLevel constructor argument (default 16), which is
used as a hint for internal sizing.
I do not understand the part that says "which is used as a hint for internal sizing." . What does this mean ? What is the best practice for setting this value and what guarantee does it give us ?

Take a look at the very next sentences in the Javadoc:
The table is internally partitioned to try to permit the indicated
number of concurrent updates without contention. Because placement
in hash tables is essentially random, the actual concurrency will
vary. Ideally, you should choose a value to accommodate as many
threads as will ever concurrently modify the table. Using a
significantly higher value than you need can waste space and time,
and a significantly lower value can lead to thread contention. But
overestimates and underestimates within an order of magnitude do
not usually have much noticeable impact. A value of one is
appropriate when it is known that only one thread will modify and
all others will only read. Also, resizing this or any other kind of
hash table is a relatively slow operation, so, when possible, it is
a good idea to provide estimates of expected table sizes in
constructors.
So in other words, a concurencyLevel of 16 means that the ConcurrentHashMap internally creates 16 separate hashtables in which to store data. Operations that modify data in one hashtable do not require locking the other hashtables, which allows somewhat-concurrent access to the overall Map.
You might want to try reading the source of ConcurrentHashMap.

Concurrency level is around equal how many operations on map can be invoked concurrently without using internal locking mechanism. As maat b is saying that ConcurrentHashMap will have N internal hashtables and thus operations which are working on different hashtables doesn't require additional locking - otherwise if operations are working on the same internal hashtable then ConcurrenyHashMap uses additional internal locking on them.

How does ConcurrentHashMap work internally?

I was reading the official Oracle documentation about Concurrency in Java and I was wondering what could be the difference between a Collection returned by
public static <T> Collection<T> synchronizedCollection(Collection<T> c);
and using for example a
ConcurrentHashMap. I'm assuming that I use synchronizedCollection(Collection<T> c) on a HashMap. I know that in general a synchronized collection is essentially just a decorator for my HashMap so it is obvious that a ConcurrentHashMap has something different in its internals. Do you have some information about those implementation details?
Edit: I realized that the source code is publicly available:
ConcurrentHashMap.java

I would read the source of ConcurrentHashMap as it is rather complicated in the detail. In short it has
Multiple partitions which can be locked independently. (16 by default)
Using concurrent Locks operations for thread safety instead of synchronized.
Has thread safe Iterators. synchronizedCollection's iterators are not thread safe.
Does not expose the internal locks. synchronizedCollection does.

The ConcurrentHashMap is very similar to the java.util.HashTable class, except that ConcurrentHashMap offers better concurrency than HashTable or synchronizedMap does. ConcurrentHashMap does not lock the Map while you are reading from it. Additionally,ConcurrentHashMap does not lock the entire Mapwhen writing to it. It only locks the part of the Map that is being written to, internally.
Another difference is that ConcurrentHashMap does not throw ConcurrentModificationException if the ConcurrentHashMap is changed while being iterated. The Iterator is not designed to be used by more than one thread though whereas synchronizedMap may throw ConcurrentModificationException

This is the article that helped me understand it Why ConcurrentHashMap is better than Hashtable and just as good as a HashMap
Hashtable’s offer concurrent access to their entries, with a small caveat, the entire map is locked to perform any sort of operation.
While this overhead is ignorable in a web application under normal
load, under heavy load it can lead to delayed response times and
overtaxing of your server for no good reason.
This is where ConcurrentHashMap’s step in. They offer all the features
of Hashtable with a performance almost as good as a HashMap.
ConcurrentHashMap’s accomplish this by a very simple mechanism.
Instead of a map wide lock, the collection maintains a list of 16
locks by default, each of which is used to guard (or lock on) a single
bucket of the map. This effectively means that 16 threads can modify
the collection at a single time (as long as they’re all working on
different buckets). Infact there is no operation performed by this
collection that locks the entire map. The concurrency level of the
collection, the number of threads that can modify it at the same time
without blocking, can be increased. However a higher number means more
overhead of maintaining this list of locks.

The "scalability issues" for Hashtable are present in exactly the same way in Collections.synchronizedMap(Map) - they use very simple synchronization, which means that only one thread can access the map at the same time.
This is not much of an issue when you have simple inserts and lookups (unless you do it extremely intensively), but becomes a big problem when you need to iterate over the entire Map, which can take a long time for a large Map - while one thread does that, all others have to wait if they want to insert or lookup anything.
The ConcurrentHashMap uses very sophisticated techniques to reduce the need for synchronization and allow parallel read access by multiple threads without synchronization and, more importantly, provides an Iterator that requires no synchronization and even allows the Map to be modified during interation (though it makes no guarantees whether or not elements that were inserted during iteration will be returned).

Returned by synchronizedCollection() is an object all methods of which are synchronized on this, so all concurrent operations on such wrapper are serialized. ConcurrentHashMap is a truly concurrent container with fine grained locking optimized to keep contention as low as possible. Have a look at the source code and you will see what it is inside.

ConcurrentHashMap implements ConcurrentMap which provides the concurrency.
Deep internally its iterators are designed to be used by only one thread at a time which maintains the synchronization.
This map is used widely in concurrency.

How to implement ConcurrentHashMap with features similar in LinkedHashMap?

I have used LinkedHashMap with accessOrder true along with allowing a maximum of 500 entries at any time as the LRU cache for data. But due to scalability issues I want to move on to some thread-safe alternative. ConcurrentHashMap seems good in that regard, but lacks the features of accessOrder and removeEldestEntry(Map.Entry e) found in LinkedHashMap. Can anyone point to some link or help me to ease the implementation.

I did something similar recently with ConcurrentHashMap<String,CacheEntry>, where CacheEntry wraps the actual item and adds cache eviction statistics: expiration time, insertion time (for FIFO/LIFO eviction), last used time (for LRU/MRU eviction), number of hits (for LFU/MFU eviction), etc. The actual eviction is synchronized and creates an ArrayList<CacheEntry> and does a Collections.sort() on it using the appropriate Comparator for the eviction strategy. Since this is expensive, each eviction then lops off the bottom 5% of the CacheEntries. I'm sure performance tuning would help though.
In your case, since you're doing FIFO, you could keep a separate ConcurrentLinkedQueue. When you add an object to the ConcurrentHashMap, do a ConcurrentLinkedQueue.add() of that object. When you want to evict an entry, do a ConcurrentLinkedQueue.poll() to remove the oldest object, then remove it from the ConcurrentHashMap as well.
Update: Other possibilities in this area include a Java Collections synchronization wrapper and the Java 1.6 ConcurrentSkipListMap.

Have you tried using one of the many caching solutions like ehcache?
You could try using LinkedHashMap with a ReadWriteLock. This would give you concurrent read access.

This might seem old now, but at least just for my own history tracking, I'm going to add my solution here: I combined ConcurrentHashMap that maps K->subclass of WeakReference, ConcurrentLinkedQueue, and an interface that defines deserialization of the value objects based on K to run LRU caching correctly. The queue holds strong refs, and the GC will evict the values from memory when appropriate. Tracking the queue size involved AtomicInteger, as you can't really inspect the queue to determine when to evict. The cache will handle eviction from/adding to the queue, as well as map management. If the GC evicted the value from memory, the implementation of the deserialization interface will handle retrieving the value back. I also had another implementation that involved spooling to disk/re-reading what was spooled, but that was a lot slower than the solution I posted here, as Ihad to synchronize spooling/reading.

You mention wanting to solve scalability problems with a "thread-safe" alternative. "Thread safety" here means that the structure is tolerant of attempts at concurrent access, in that it won't suffer corruption by concurrent use without external synchronization. However, such tolerance does not necessarily help to improve "scalability". In the simplest -- though usually misguided -- approach, you'll try to synchronize your structure internally and still leave non-atomic check-then-act operations unsafe.
LRU caches require at least some awareness of the total structure. They need something like a count of the members or the size of the members to decide when to evict, and then they need to be able to coordinate the eviction with concurrent attempts to read, add, or remove elements. Trying to reduce the synchronization necessary for concurrent access to the "main" structure fights against your eviction mechanism, and forces your eviction policy to be less precise in its guarantees.
The currently accepted answer mentions "when you want to evict an entry". Therein lies the rub. How do you know when you want to evict an entry? Which other operations do you need to pause in order to make this decision?

The moment you use another data structure along with concurrenthashmap, the atomicity of the operations sucha adding a new item in concurrenthashmap and adding in other data structure cant be guaranteed without additional synchronization such as ReadWriteLock which will degrade performance

Wrap the map in a Collections.synchronizedMap(). If you need to call additional methods, then synchronize on the map that you got back from this call, and invoke the original method on the original map (see the javadocs for an example). The same applies when you iterate over the keys, etc.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.