How LongAdder performs better than AtomicLong - java

I see how Java's AtomicInteger works internally with CAS (Compare And Swap) operation. Basically when multiple threads try to update the value, JVM internally use the underlying CAS mechanism and try to update the value. If the update fails, then try again with the new value but never blocks.
In Java8 Oracle introduced a new Class LongAdder which seems to perform better than AtomicInteger under high contention. Some blog posts claim that LongAdder perform better by maintaining internal cells - does that mean LongAdder aggregates the values internally and update it later? Could you please help me to understand how LongAdder works?

does that mean LongAdder aggregates the values internally and update it later?
Yes, if I understand your statement correctly.
Each Cell in a LongAdder is a variant of an AtomicLong. Having multiple such cells is a way of spreading out the contention and thus increasing throughput.
When the final result (sum) is to be retrieved, it just adds together the values of each cell.
Much of the logic around how the cells are organized, how they are allocated etc can be seen in the source: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/f398670f3da7/src/java.base/share/classes/java/util/concurrent/atomic/Striped64.java
In particular the number of cells is bound by the number of CPUs:
/** Number of CPUS, to place bound on table size */
static final int NCPU = Runtime.getRuntime().availableProcessors();

The primary reason it is "faster" is its contended performance. This is important because:
Under low update contention, the two classes have similar characteristics.
You'd use a LongAdder for very frequent updates, in which atomic CAS and native calls to Unsafe would cause contention. (See source and volatile reads). Not to mention cache misses/false sharing on multiple AtomicLongs (although I have not looked at the class layout yet, there doesn't appear to be sufficient memory padding before the actual long field.
under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
The implementation extends Striped64, which is a data holder for 64-bit values. The values are held in cells, which are padded (or striped), hence the name. Each operation made upon the LongAdder will modify the collection of values present in the Striped64. When contention occurs, a new cell is created and modified, so the the old thread can finish concurrently with contending one. When you need the final value, the sums of each cell is simply added up.
Unfortunately, performance comes with a cost, which in this case is memory (as often is). The Striped64 can grow very large if a large load of threads and updates are being thrown at it.
Quote source:
Javadoc for LongAdder

Atomic Long uses CAS which - under heavy contention can lead to many wasted CPU cycles.
LongAdder, on the other hand, uses a very clever trick to reduce contention between threads, when these are incrementing it.
So when we call increment() , behind the scenes LongAdder maintains an array of counter that can grow on demand.
And so, when more threads are calling increment(), the array will be longer. Each record in the array can be updated separately – reducing the contention. Due to that fact, the LongAdder is a very efficient way to increment a counter from multiple threads.
The result of the counter in the LongAdder is not available until we call the sum() method.

Related

What are LongAdders in java? [duplicate]

I see how Java's AtomicInteger works internally with CAS (Compare And Swap) operation. Basically when multiple threads try to update the value, JVM internally use the underlying CAS mechanism and try to update the value. If the update fails, then try again with the new value but never blocks.
In Java8 Oracle introduced a new Class LongAdder which seems to perform better than AtomicInteger under high contention. Some blog posts claim that LongAdder perform better by maintaining internal cells - does that mean LongAdder aggregates the values internally and update it later? Could you please help me to understand how LongAdder works?
does that mean LongAdder aggregates the values internally and update it later?
Yes, if I understand your statement correctly.
Each Cell in a LongAdder is a variant of an AtomicLong. Having multiple such cells is a way of spreading out the contention and thus increasing throughput.
When the final result (sum) is to be retrieved, it just adds together the values of each cell.
Much of the logic around how the cells are organized, how they are allocated etc can be seen in the source: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/f398670f3da7/src/java.base/share/classes/java/util/concurrent/atomic/Striped64.java
In particular the number of cells is bound by the number of CPUs:
/** Number of CPUS, to place bound on table size */
static final int NCPU = Runtime.getRuntime().availableProcessors();
The primary reason it is "faster" is its contended performance. This is important because:
Under low update contention, the two classes have similar characteristics.
You'd use a LongAdder for very frequent updates, in which atomic CAS and native calls to Unsafe would cause contention. (See source and volatile reads). Not to mention cache misses/false sharing on multiple AtomicLongs (although I have not looked at the class layout yet, there doesn't appear to be sufficient memory padding before the actual long field.
under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
The implementation extends Striped64, which is a data holder for 64-bit values. The values are held in cells, which are padded (or striped), hence the name. Each operation made upon the LongAdder will modify the collection of values present in the Striped64. When contention occurs, a new cell is created and modified, so the the old thread can finish concurrently with contending one. When you need the final value, the sums of each cell is simply added up.
Unfortunately, performance comes with a cost, which in this case is memory (as often is). The Striped64 can grow very large if a large load of threads and updates are being thrown at it.
Quote source:
Javadoc for LongAdder
Atomic Long uses CAS which - under heavy contention can lead to many wasted CPU cycles.
LongAdder, on the other hand, uses a very clever trick to reduce contention between threads, when these are incrementing it.
So when we call increment() , behind the scenes LongAdder maintains an array of counter that can grow on demand.
And so, when more threads are calling increment(), the array will be longer. Each record in the array can be updated separately – reducing the contention. Due to that fact, the LongAdder is a very efficient way to increment a counter from multiple threads.
The result of the counter in the LongAdder is not available until we call the sum() method.

Multithreading summing large number of values atomically

I am developing an application where we have large number of threads and have to add 100's of values atomically. I am using AtomicLong which work well but still need to improve the performance. Is there something which offers better performance then AtomicLong?
You can use LongAdder. LongAdder offer much better performance then AtomicLong. I would suggest reading this article where the author published benchmarking results and explained quit many details regarding LongAdder performance. But in a nutshell LongAdder extends Striped64 that handles contentation quit well by using hash table of cells. So when 2 threads try to put some value ,then there is good probability that both of them will end putting value in different cells. Cell class uses Padding stratergy to reduce CPU cache contentation. Moreover if you take a look on source code then you will find that cell class uses CAS.
Unsafe.compareAndSwap operations are atomic. They take a pointer to a
chunk of memory (in this case comprised of this and valueOffset which
together point to value), a compare value and a swap value. If the JVM
finds that the value of the addressed memory is equal to the compare
value, then it stores the swap value in the addressed memory and
returns true. This means that CAS operations are a fast and thread
safe way to update the value of a variable and get feedback on whether
the operation was successful or whether there was contention.
Section 15.3.2 of JCIP:
at high contention levels locking tends to outperform atomic
variables, but at more realistic contention levels atomic variables
outperform locks.
You could either try a back-off scheme to improve the performance of your atomic variables or you could switch to using fully-fledged ReentrantLocks.

What guarantee does concurrencyLevel argument of ConcurrentHashMap give us?

From the Javadocs of ConcurrentHashMap :
The allowed concurrency among update operations is guided by the
optional concurrencyLevel constructor argument (default 16), which is
used as a hint for internal sizing.
I do not understand the part that says "which is used as a hint for internal sizing." . What does this mean ? What is the best practice for setting this value and what guarantee does it give us ?
Take a look at the very next sentences in the Javadoc:
The table is internally partitioned to try to permit the indicated
number of concurrent updates without contention. Because placement
in hash tables is essentially random, the actual concurrency will
vary. Ideally, you should choose a value to accommodate as many
threads as will ever concurrently modify the table. Using a
significantly higher value than you need can waste space and time,
and a significantly lower value can lead to thread contention. But
overestimates and underestimates within an order of magnitude do
not usually have much noticeable impact. A value of one is
appropriate when it is known that only one thread will modify and
all others will only read. Also, resizing this or any other kind of
hash table is a relatively slow operation, so, when possible, it is
a good idea to provide estimates of expected table sizes in
constructors.
So in other words, a concurencyLevel of 16 means that the ConcurrentHashMap internally creates 16 separate hashtables in which to store data. Operations that modify data in one hashtable do not require locking the other hashtables, which allows somewhat-concurrent access to the overall Map.
You might want to try reading the source of ConcurrentHashMap.
Concurrency level is around equal how many operations on map can be invoked concurrently without using internal locking mechanism. As maat b is saying that ConcurrentHashMap will have N internal hashtables and thus operations which are working on different hashtables doesn't require additional locking - otherwise if operations are working on the same internal hashtable then ConcurrenyHashMap uses additional internal locking on them.

Java: ConcurrencyLevel value for ConcurrentHashMap

Is there some optimal value for ConcurrencyLevel beyond which ConcurrentHashMap's performance starts degrading?
If yes, what's that value, and what's the reason for performance degradation? (this question orginates from trying to find out any practical limitations that a ConcurrentHashMap may have).
The Javadoc offers pretty detailed guidance:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing.
The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read.
To summarize: the optimal value depends on the number of expected concurrent updates. A value within an order of magnitude of that should work well. Values outside that range can be expected to lead to performance degradation.
You have to ask yourself two questions
how many cpus do I have?
what percentage of the time will a useful program be accessing the same map?
The first question tells you the maximum number of threads which can access the map at once. You can have 10000 threads, but if you have only 4 cpus, at most 4 will be running at once.
The second question tells you the most any of those threads will be accessing the map AND doing something useful. You can optimise the map to do something useless (e.g. a micro-benchmark) but there is no point tuning for this IMHO. Say you have a useful program which uses the map a lot. It might be spending 90% of the time doing something else e.g. IO, accessing other maps, building keys or values, doing something with the values it gets from the map.
Say you spend 10% of the time accessing a map on a machine with 4 CPUs. This means on average you will be accessing the map in 0.4 threads on average. (Or one thread about 40% of the time) In this case a concurrency level of 1-4 is fine.
In any case, making the concurrency level higher than the number of cpus you have is likely to be unnecessary, even for a micro-benchmark.
As of Java 8, ConcurrentHashMap's constructor parameter for concurrencyLevel is effectively unused, and remains primarily for backwards-compatibility. The implementation was re-written to use the first node within each hash bin as the lock for that bin, rather than a fixed number of segments/stripes as was the case in earlier versions.
In short, starting in Java 8, don't worry about setting the concurrencyLevel parameter, as long as you set a positive (non-zero, non-negative) value, per the API contract.

Performance ConcurrentHashmap vs HashMap

How is the performance of ConcurrentHashMap compared to HashMap, especially .get() operation (I'm especially interested for the case of only few items, in the range between maybe 0-5000)?
Is there any reason not to use ConcurrentHashMap instead of HashMap?
(I know that null values aren't allowed)
Update
just to clarify, obviously the performance in case of actual concurrent access will suffer, but how compares the performance in case of no concurrent access?
I was really surprised to find this topic to be so old and yet no one has yet provided any tests regarding the case. Using ScalaMeter I have created tests of add, get and remove for both HashMap and ConcurrentHashMap in two scenarios:
using single thread
using as many threads as I have cores available. Note that because HashMap is not thread-safe, I simply created separate HashMap for each thread, but used one, shared ConcurrentHashMap.
Code is available on my repo.
The results are as follows:
X axis (size) presents number of elements written to the map(s)
Y axis (value) presents time in milliseconds
The summary
If you want to operate on your data as fast as possible, use all the threads available. That seems obvious, each thread has 1/nth of the full work to do.
If you choose a single thread access use HashMap, it is simply faster. For add method it is even as much as 3x more efficient. Only get is faster on ConcurrentHashMap, but not much.
When operating on ConcurrentHashMap with many threads it is similarly effective to operating on separate HashMaps for each thread. So there is no need to partition your data in different structures.
To sum up, the performance for ConcurrentHashMap is worse when you use with single thread, but adding more threads to do the work will definitely speed-up the process.
Testing platform
AMD FX6100, 16GB Ram
Xubuntu 16.04, Oracle JDK 8 update 91, Scala 2.11.8
Thread safety is a complex question. If you want to make an object thread safe, do it consciously, and document that choice. People who use your class will thank you if it is thread safe when it simplifies their usage, but they will curse you if an object that once was thread safe becomes not so in a future version. Thread safety, while really nice, is not just for Christmas!
So now to your question:
ConcurrentHashMap (at least in Sun's current implementation) works by dividing the underlying map into a number of separate buckets. Getting an element does not require any locking per se, but it does use atomic/volatile operations, which implies a memory barrier (potentially very costly, and interfering with other possible optimisations).
Even if all the overhead of atomic operations can be eliminated by the JIT compiler in a single-threaded case, there is still the overhead of deciding which of the buckets to look in - admittedly this is a relatively quick calculation, but nevertheless, it is impossible to eliminate.
As for deciding which implementation to use, the choice is probably simple.
If this is a static field, you almost certainly want to use ConcurrentHashMap, unless testing shows this is a real performance killer. Your class has different thread safety expectations from the instances of that class.
If this is a local variable, then chances are a HashMap is sufficient - unless you know that references to the object can leak out to another thread. By coding to the Map interface, you allow yourself to change it easily later if you discover a problem.
If this is an instance field, and the class hasn't been designed to be thread safe, then document it as not thread safe, and use a HashMap.
If you know that this instance field is the only reason the class isn't thread safe, and are willing to live with the restrictions that promising thread safety implies, then use ConcurrentHashMap, unless testing shows significant performance implications. In that case, you might consider allowing a user of the class to choose a thread safe version of the object somehow, perhaps by using a different factory method.
In either case, document the class as being thread safe (or conditionally thread safe) so people who use your class know they can use objects across multiple threads, and people who edit your class know that they must maintain thread safety in future.
I would recommend you measure it, since (for one reason) there may be some dependence on the hashing distribution of the particular objects you're storing.
The standard hashmap provides no concurrency protection whereas the concurrent hashmap does. Before it was available, you could wrap the hashmap to get thread safe access but this was coarse grain locking and meant all concurrent access got serialised which could really impact performance.
The concurrent hashmap uses lock stripping and only locks items that affected by a particular lock. If you're running on a modern vm such as hotspot, the vm will try and use lock biasing, coarsaning and ellision if possible so you'll only pay the penalty for the locks when you actually need it.
In summary, if your map is going to be accesaed by concurrent threads and you need to guarantee a consistent view of it's state, use the concurrent hashmap.
In the case of a 1000 element hash table using 10 locks for whole table saves close to half the time when 10000 threads are inserting and 10000 threads are deleting from it.
The interesting run time difference is here
Always use Concurrent data structure. except when the downside of striping (mentioned below) becomes a frequent operation. In that case you will have to acquire all the locks? I read that the best ways to do this is by recursion.
Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.
What answer are you expecting here?
It is obviously going to depend on the number of reads happening at the same time as writes and how long a normal map must be "locked" on a write operation in your app (and whether you would make use of the putIfAbsent method on ConcurrentMap). Any benchmark is going to be largely meaningless.
It's not clear what your mean. If you need thread safeness, you have almost no choice - only ConcurrentHashMap. And it's definitely have performance/memory penalties in get() call - access to volatile variables and lock if you're unlucky.
Of course a Map without any lock system wins against one with thread-safe behavior which needs more work.
The point of the Concurrent one is to be thread safe without using synchronized so to be faster than HashTable.
Same graphics would would be very interesting for ConcurrentHashMap vs Hashtable (which is synchronized).

Categories

Resources