Is there some optimal value for ConcurrencyLevel beyond which ConcurrentHashMap's performance starts degrading?
If yes, what's that value, and what's the reason for performance degradation? (this question orginates from trying to find out any practical limitations that a ConcurrentHashMap may have).
The Javadoc offers pretty detailed guidance:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing.
The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read.
To summarize: the optimal value depends on the number of expected concurrent updates. A value within an order of magnitude of that should work well. Values outside that range can be expected to lead to performance degradation.
You have to ask yourself two questions
how many cpus do I have?
what percentage of the time will a useful program be accessing the same map?
The first question tells you the maximum number of threads which can access the map at once. You can have 10000 threads, but if you have only 4 cpus, at most 4 will be running at once.
The second question tells you the most any of those threads will be accessing the map AND doing something useful. You can optimise the map to do something useless (e.g. a micro-benchmark) but there is no point tuning for this IMHO. Say you have a useful program which uses the map a lot. It might be spending 90% of the time doing something else e.g. IO, accessing other maps, building keys or values, doing something with the values it gets from the map.
Say you spend 10% of the time accessing a map on a machine with 4 CPUs. This means on average you will be accessing the map in 0.4 threads on average. (Or one thread about 40% of the time) In this case a concurrency level of 1-4 is fine.
In any case, making the concurrency level higher than the number of cpus you have is likely to be unnecessary, even for a micro-benchmark.
As of Java 8, ConcurrentHashMap's constructor parameter for concurrencyLevel is effectively unused, and remains primarily for backwards-compatibility. The implementation was re-written to use the first node within each hash bin as the lock for that bin, rather than a fixed number of segments/stripes as was the case in earlier versions.
In short, starting in Java 8, don't worry about setting the concurrencyLevel parameter, as long as you set a positive (non-zero, non-negative) value, per the API contract.
Related
I see how Java's AtomicInteger works internally with CAS (Compare And Swap) operation. Basically when multiple threads try to update the value, JVM internally use the underlying CAS mechanism and try to update the value. If the update fails, then try again with the new value but never blocks.
In Java8 Oracle introduced a new Class LongAdder which seems to perform better than AtomicInteger under high contention. Some blog posts claim that LongAdder perform better by maintaining internal cells - does that mean LongAdder aggregates the values internally and update it later? Could you please help me to understand how LongAdder works?
does that mean LongAdder aggregates the values internally and update it later?
Yes, if I understand your statement correctly.
Each Cell in a LongAdder is a variant of an AtomicLong. Having multiple such cells is a way of spreading out the contention and thus increasing throughput.
When the final result (sum) is to be retrieved, it just adds together the values of each cell.
Much of the logic around how the cells are organized, how they are allocated etc can be seen in the source: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/f398670f3da7/src/java.base/share/classes/java/util/concurrent/atomic/Striped64.java
In particular the number of cells is bound by the number of CPUs:
/** Number of CPUS, to place bound on table size */
static final int NCPU = Runtime.getRuntime().availableProcessors();
The primary reason it is "faster" is its contended performance. This is important because:
Under low update contention, the two classes have similar characteristics.
You'd use a LongAdder for very frequent updates, in which atomic CAS and native calls to Unsafe would cause contention. (See source and volatile reads). Not to mention cache misses/false sharing on multiple AtomicLongs (although I have not looked at the class layout yet, there doesn't appear to be sufficient memory padding before the actual long field.
under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
The implementation extends Striped64, which is a data holder for 64-bit values. The values are held in cells, which are padded (or striped), hence the name. Each operation made upon the LongAdder will modify the collection of values present in the Striped64. When contention occurs, a new cell is created and modified, so the the old thread can finish concurrently with contending one. When you need the final value, the sums of each cell is simply added up.
Unfortunately, performance comes with a cost, which in this case is memory (as often is). The Striped64 can grow very large if a large load of threads and updates are being thrown at it.
Quote source:
Javadoc for LongAdder
Atomic Long uses CAS which - under heavy contention can lead to many wasted CPU cycles.
LongAdder, on the other hand, uses a very clever trick to reduce contention between threads, when these are incrementing it.
So when we call increment() , behind the scenes LongAdder maintains an array of counter that can grow on demand.
And so, when more threads are calling increment(), the array will be longer. Each record in the array can be updated separately – reducing the contention. Due to that fact, the LongAdder is a very efficient way to increment a counter from multiple threads.
The result of the counter in the LongAdder is not available until we call the sum() method.
I see how Java's AtomicInteger works internally with CAS (Compare And Swap) operation. Basically when multiple threads try to update the value, JVM internally use the underlying CAS mechanism and try to update the value. If the update fails, then try again with the new value but never blocks.
In Java8 Oracle introduced a new Class LongAdder which seems to perform better than AtomicInteger under high contention. Some blog posts claim that LongAdder perform better by maintaining internal cells - does that mean LongAdder aggregates the values internally and update it later? Could you please help me to understand how LongAdder works?
does that mean LongAdder aggregates the values internally and update it later?
Yes, if I understand your statement correctly.
Each Cell in a LongAdder is a variant of an AtomicLong. Having multiple such cells is a way of spreading out the contention and thus increasing throughput.
When the final result (sum) is to be retrieved, it just adds together the values of each cell.
Much of the logic around how the cells are organized, how they are allocated etc can be seen in the source: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/f398670f3da7/src/java.base/share/classes/java/util/concurrent/atomic/Striped64.java
In particular the number of cells is bound by the number of CPUs:
/** Number of CPUS, to place bound on table size */
static final int NCPU = Runtime.getRuntime().availableProcessors();
The primary reason it is "faster" is its contended performance. This is important because:
Under low update contention, the two classes have similar characteristics.
You'd use a LongAdder for very frequent updates, in which atomic CAS and native calls to Unsafe would cause contention. (See source and volatile reads). Not to mention cache misses/false sharing on multiple AtomicLongs (although I have not looked at the class layout yet, there doesn't appear to be sufficient memory padding before the actual long field.
under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
The implementation extends Striped64, which is a data holder for 64-bit values. The values are held in cells, which are padded (or striped), hence the name. Each operation made upon the LongAdder will modify the collection of values present in the Striped64. When contention occurs, a new cell is created and modified, so the the old thread can finish concurrently with contending one. When you need the final value, the sums of each cell is simply added up.
Unfortunately, performance comes with a cost, which in this case is memory (as often is). The Striped64 can grow very large if a large load of threads and updates are being thrown at it.
Quote source:
Javadoc for LongAdder
Atomic Long uses CAS which - under heavy contention can lead to many wasted CPU cycles.
LongAdder, on the other hand, uses a very clever trick to reduce contention between threads, when these are incrementing it.
So when we call increment() , behind the scenes LongAdder maintains an array of counter that can grow on demand.
And so, when more threads are calling increment(), the array will be longer. Each record in the array can be updated separately – reducing the contention. Due to that fact, the LongAdder is a very efficient way to increment a counter from multiple threads.
The result of the counter in the LongAdder is not available until we call the sum() method.
I am developing an application where we have large number of threads and have to add 100's of values atomically. I am using AtomicLong which work well but still need to improve the performance. Is there something which offers better performance then AtomicLong?
You can use LongAdder. LongAdder offer much better performance then AtomicLong. I would suggest reading this article where the author published benchmarking results and explained quit many details regarding LongAdder performance. But in a nutshell LongAdder extends Striped64 that handles contentation quit well by using hash table of cells. So when 2 threads try to put some value ,then there is good probability that both of them will end putting value in different cells. Cell class uses Padding stratergy to reduce CPU cache contentation. Moreover if you take a look on source code then you will find that cell class uses CAS.
Unsafe.compareAndSwap operations are atomic. They take a pointer to a
chunk of memory (in this case comprised of this and valueOffset which
together point to value), a compare value and a swap value. If the JVM
finds that the value of the addressed memory is equal to the compare
value, then it stores the swap value in the addressed memory and
returns true. This means that CAS operations are a fast and thread
safe way to update the value of a variable and get feedback on whether
the operation was successful or whether there was contention.
Section 15.3.2 of JCIP:
at high contention levels locking tends to outperform atomic
variables, but at more realistic contention levels atomic variables
outperform locks.
You could either try a back-off scheme to improve the performance of your atomic variables or you could switch to using fully-fledged ReentrantLocks.
I created a `ConcurrentHashMap with following values :
ConcurrentHashMap<String,String> concurrentHashMap = new ConcurrentHashMap<>(10,.9F,1);
Above means only 1 thread can update the map at a given point of time. If this is the case then can I say that it will work like HashMap in case of concurrency .i.e.; only one write operation will be performed at a given point of time.
Is my understanding correct or am I missing something here?
The concurrencyLevel is just a hint, to help size internal data structures. There's no guarantee 1 would be the actual value used, and rather than behaving as a regular HashMap it may mean it would be less efficient to use if you actually used it from more than 1 thread.
From the Javadoc:
Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention.
Your question is actually about the concurrencyLevel:
concurrencyLevel - the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.
Basically, a ConcurrentHashMap is chunked into segments. Each segment can only be modified by one thread at a time. Simply put, the more segments you have, the more concurrency you get. Yet you also end up using much more memory because each segment has its own memory overhead.
Therefore if you know that only one thread will access your map, setting the concurrencyLevel to 1 will only create 1 segment in the map, thus making it more memory-efficient.
If the value is too high, more memory will be used and some time will be used finding the right segment for every object you want to read/write in the map.
From the Javadocs of ConcurrentHashMap :
The allowed concurrency among update operations is guided by the
optional concurrencyLevel constructor argument (default 16), which is
used as a hint for internal sizing.
I do not understand the part that says "which is used as a hint for internal sizing." . What does this mean ? What is the best practice for setting this value and what guarantee does it give us ?
Take a look at the very next sentences in the Javadoc:
The table is internally partitioned to try to permit the indicated
number of concurrent updates without contention. Because placement
in hash tables is essentially random, the actual concurrency will
vary. Ideally, you should choose a value to accommodate as many
threads as will ever concurrently modify the table. Using a
significantly higher value than you need can waste space and time,
and a significantly lower value can lead to thread contention. But
overestimates and underestimates within an order of magnitude do
not usually have much noticeable impact. A value of one is
appropriate when it is known that only one thread will modify and
all others will only read. Also, resizing this or any other kind of
hash table is a relatively slow operation, so, when possible, it is
a good idea to provide estimates of expected table sizes in
constructors.
So in other words, a concurencyLevel of 16 means that the ConcurrentHashMap internally creates 16 separate hashtables in which to store data. Operations that modify data in one hashtable do not require locking the other hashtables, which allows somewhat-concurrent access to the overall Map.
You might want to try reading the source of ConcurrentHashMap.
Concurrency level is around equal how many operations on map can be invoked concurrently without using internal locking mechanism. As maat b is saying that ConcurrentHashMap will have N internal hashtables and thus operations which are working on different hashtables doesn't require additional locking - otherwise if operations are working on the same internal hashtable then ConcurrenyHashMap uses additional internal locking on them.