Lock-free guard for synchronized acquire/release

Lock-free guard for synchronized acquire/release - java

I have a shared tempfile resource that is divided into chunks of 4K (or some such value). Each 4K in the file is represented by an index starting from zero. For this shared resource, I track the 4K chunk indices in use and always return the lowest indexed 4K chunk not in use, or -1 if all are in use.
This ResourceSet class for the indices has a public acquire and release method, both of which use synchronized lock whose duration is about like that of generating 4 random numbers (expensive, cpu-wise).
Therefore as you can see from the code that follows, I use an AtomicInteger "counting semaphore" to prevent a large number of threads from entering the critical section at the same time on acquire(), returning -1 (not available right now) if there are too many threads.
Currently, I am using a constant of 100 for the tight CAS loop to try to increment the atomic integer in acquire, and a constant of 10 for the maximum number of threads to then allow into the critical section, which is long enough to create contention. My question is, what should these constants be for a moderate to highly loaded servlet engine that has several threads trying to get access to these 4K chunks?
public class ResourceSet {
// ??? what should this be
// maximum number of attempts to try to increment with CAS on acquire
private static final int CAS_MAX_ATTEMPTS = 50;
// ??? what should this be
// maximum number of threads contending for lock before returning -1 on acquire
private static final int CONTENTION_MAX = 10;
private AtomicInteger latch = new AtomicInteger(0);
... member variables to track free resources
private boolean aquireLatchForAquire ()
{
for (int i = 0; i < CAS_MAX_ATTEMPTS; i++) {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (!latch.compareAndSet(val, val+1))
continue;
if (val < 0 || val >= CONTENTION_MAX) {
latch.decrementAndGet();
// added to fix BUG that comment pointed out, thanks!
return false;
}
}
return false;
}
private void aquireLatchForRelease ()
{
do {
int val = latch.get();
if (val == -1)
throw new AssertionError("bug in ResourceSet"); // this means more threads than can exist on any system, so its a bug!
if (latch.compareAndSet(val, val+1))
return;
} while (true);
}
public ResourceSet (int totalResources)
{
... initialize
}
public int acquire (ResourceTracker owned)
{
if (!aquireLatchForAquire())
return -1;
try {
synchronized (this) {
... algorithm to compute minimum free resoource or return -1 if all in use
return resourceindex;
}
} finally {
latch.decrementAndGet();
}
}
public boolean release (ResourceIter iter)
{
aquireLatchForRelease();
try {
synchronized (this) {
... iterate and release all resources
}
} finally {
latch.decrementAndGet();
}
}
}

Writting a good and performant spinlock is actually pretty complicated and requires a good understanding of memory barriers. Merely picking a constant is not going to cut it and will definitely not be portable. Google's gperftools has an example that you can look at but is probably way more complicated then what you'd need.
If you really want to reduce contention on the lock, you might want to consider using a more fine-grained and optimistic scheme. A simple one could be to divide your chunks into n groups and associate a lock with each group (also called stripping). This will help reduce contention and increase throughput but it won't help reduce latency. You could also associate an AtomicBoolean to each chunk and CAS to acquire it (retry in case of failure). Do be careful when dealing with lock-free algorithms because they tend to be tricky to get right. If you do get it right, it could considerably reduce the latency of acquiring a chunk.
Note that it's difficult to propose a more fine-grained approach without knowing what your chunk selection algorithm looks like. I also assume that you really do have a performance problem (it's been profiled and everything).
While I'm at it, your spinlock implementation is flawed. You should never spin directly on a CAS because you're spamming memory barriers. This will be incredibly slow with any serious amount of contention (related to the thundering-herd problem). A minimum would be to first check the variable for availability before your CAS (simple if on a no barrier read will do). Even better would be to not have all your threads spinning on the same value. This should avoid the associated cache-line from ping-pong-ing between your cores.
Note that I don't know what type of memory barriers are associated with atomic ops in Java so my above suggestions might not be optimal or correct.
Finally, The Art Of Multiprocessor Programming is a fun book to read to get better acquainted with all the non-sense I've been spewing in this answer.

I'm not sure if it's necessary to forge your own Lock class for this scenario. As JDK provided ReentrantLock, which also leverage CAS instruction during lock acquire. The performance should be pretty good when compared with your personal Lock class.

You can use Semaphore's tryAcquire method if you want your threads to balk on no resource available.
I for one would simply substitute your synchronized keyword with a ReentrantLock and use the tryLock() method on it. If you want to let your threads wait a bit, you can use tryLock(timeout) on the same class. Which one to choose and what value to use for timeout, needs to be determined by way of a performance test.
Creating an explicit gate seems as you seem to be doing seems unnecessary to me. I'm not saying that it can never help, but IMO it's more likely to actually hurt performance, and it's an added complication for sure. So unless you have an performance issue around here (based on a test you did) and you found that this kind of gating helps, I'd recommend to go with the simplest implementation.

Related

Thread Safety: Maximum efficiency

Good Evening,
I am trying to understand how I am using multi-threading and how to implement thread-safety in the context.
When I want to achieve maximum speed of my threads do I use:
public void addMarketOrder(MarketOrder marketOrder) {
if (marketOrder.id != this.id) {
return;
}
synchronized (this) {
ordered += marketOrder.ordered;
}
}
or just synchronized the entire method?
public synchronized void addMarketOrder(MarketOrder marketOrder) {
if (marketOrder.id != this.id) {
return;
}
ordered += marketOrder.ordered;
}

Assuming the ids do not change, the first case is preferable to the second. The first case avoids synchronization if the ids do not match. The second case synchronizes even if there isn't any write operation.

If you want an efficient multi-threaded system, then do not let threads communicate with each other. If the threads contend for the same data, you can get significant slows downs even if you use fast alternatives like volatiles or Atomics.
I'm not sure which parts of your code need to be thread-safe. If it is only a matter of atomically increasing the counter, then making the 'ordered' field an AtomicLong and calling getAndAdd would be a reasonably fast solution that doesn't make use of any locks.

What you are hinting towards is double check locking. The correct form is:
public void addMarketOrder(MarketOrder marketOrder) {
if (marketOrder.id != this.id) {
return;
}
synchronized (this) {
if (marketOrder.id != this.id) {
ordered += marketOrder.ordered;
}
}
}
Because you shouldn't assume that because the condition became true that it will continue to be true.
Also if you read the id without synchronization it should be volatile because the compiler may optimize away memory reads under certain circumstances and the value that is being held in one thread could be different from another. Also the when not volatile the compiler can change to order of operations assuming a single thread that can make your code misbehave when running with multiple threads.
Volatility rule: any variable that is accessed outside of a synchronized block by multiple threads MUST be final or volatile. Or you will get thread visibility problems in certain circumstances.
You can also synchronize the whole method without any problem. But synchronizing the whole method will make it less efficient (which could matter if you have a lot of processors and this is a highly contested method).

Java - cache coherence between successive parallel streams?

Consider the following piece of code (which isn't quite what it seems at first glance).
static class NumberContainer {
int value = 0;
void increment() {
value++;
}
int getValue() {
return value;
}
}
public static void main(String[] args) {
List<NumberContainer> list = new ArrayList<>();
int numElements = 100000;
for (int i = 0; i < numElements; i++) {
list.add(new NumberContainer());
}
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
}
list.forEach(container -> {
if (container.getValue() != numIterations) {
System.out.println("Problem!!!");
}
});
}
My question is: In order to be absolutely certain that "Problem!!!" won't be printed, does the "value" variable in the NumberContainer class need to be marked volatile?
Let me explain how I currently understand this.
In the first parallel stream, NumberContainer-123 (say) is incremented by ForkJoinWorker-1 (say). So ForkJoinWorker-1 will have an up-to-date cache of NumberContainer-123.value, which is 1. (Other fork-join workers, however, will have out-of-date caches of NumberContainer-123.value - they will store the value 0. At some point, these other workers' caches will be updated, but this doesn't happen straight away.)
The first parallel stream finishes, but the common fork-join pool worker threads aren't killed. The second parallel stream then starts, using the very same common fork-join pool worker threads.
Suppose, now, that in the second parallel stream, the task of incrementing NumberContainer-123 is assigned to ForkJoinWorker-2 (say). ForkJoinWorker-2 will have its own cached value of NumberContainer-123.value. If a long period of time has elapsed between the first and second increments of NumberContainer-123, then presumably ForkJoinWorker-2's cache of NumberContainer-123.value will be up-to-date, i.e. the value 1 will be stored, and everything is good. But what if the time elapsed between first and second increments if NumberContainer-123 is extremely short? Then perhaps ForkJoinWorker-2's cache of NumberContainer-123.value might be out of date, storing the value 0, causing the code to fail!
Is my description above correct? If so, can anyone please tell me what kind of time delay between the two incrementing operations is required to guarantee cache consistency between the threads? Or if my understanding is wrong, then can someone please tell me what mechanism causes the thread-local caches to be "flushed" in between the first parallel stream and the second parallel stream?

It should not need any delay. By the time you're out of ParallelStream's forEach, all the tasks have finished. That establishes a happens-before relation between the increment and the end of forEach. All the forEach calls are ordered by being called from the same thread, and the check, similarly, happens-after all the forEach calls.
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
// here, everything is "flushed", i.e. the ForkJoinTask is finished
}
Back to your question about the threads, the trick here is, the threads are irrelevant. The memory model hinges on the happens-before relation, and the fork-join task ensures happens-before relation between the call to forEach and the operation body, and between the operation body and the return from forEach (even if the returned value is Void)
See also Memory visibility in Fork-join
As #erickson mentions in comments,
If you can't establish correctness through happens-before relationships,
no amount of time is "enough." It's not a wall-clock timing issue; you
need to apply the Java memory model correctly.
Moreover, thinking about it in terms of "flushing" the memory is wrong, as there are many more things that can affect you. Flushing, for instance, is trivial: I have not checked, but can bet that there's just a memory barrier on the task completion; but you can get wrong data because the compiler decided to optimise non-volatile reads away (the variable is not volatile, and is not changed in this thread, so it's not going to change, so we can allocate it to a register, et voila), reorder the code in any way allowed by the happens-before relation, etc.
Most importantly, all those optimizations can and will change over time, so even if you went to the generated assembly (which may vary depending on the load pattern) and checked all the memory barriers, it does not guarantee that your code will work unless you can prove that your reads happen-after your writes, in which case Java Memory Model is on your side (assuming there's no bug in JVM).
As for the great pain, it's the very goal of ForkJoinTask to make the synchronization trivial, so enjoy. It was (it seems) done by marking the java.util.concurrent.ForkJoinTask#status volatile, but that's an implementation detail you should not care about or rely upon.

How is LongAccumulator implemented, so that it is more efficient?

I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.
I wonder how is it implemented to be more efficient?

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.
To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.

In Java what is the performance of AtomicInteger compareAndSet() versus synchronized keyword?

I was implementing a FIFO queue of requests instances (preallocated request objects for speed) and started with using the "synchronized" keyword on the add method. The method was quite short (check if room in fixed size buffer, then add value to array). Using visualVM it appeared the thread was blocking more often than I liked ("monitor" to be precise). So I converted the code over to use AtomicInteger values for things such as keeping track of the current size, then using compareAndSet() in while loops (as AtomicInteger does internally for methods such as incrementAndGet()). The code now looks quite a bit longer.
What I was wondering is what is the performance overhead of using synchronized and shorter code versus longer code without the synchronized keyword (so should never block on a lock).
Here is the old get method with the synchronized keyword:
public synchronized Request get()
{
if (head == tail)
{
return null;
}
Request r = requests[head];
head = (head + 1) % requests.length;
return r;
}
Here is the new get method without the synchronized keyword:
public Request get()
{
while (true)
{
int current = size.get();
if (current <= 0)
{
return null;
}
if (size.compareAndSet(current, current - 1))
{
break;
}
}
while (true)
{
int current = head.get();
int nextHead = (current + 1) % requests.length;
if (head.compareAndSet(current, nextHead))
{
return requests[current];
}
}
}
My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc), even though the code is shorter.
Thanks!

My guess was the synchronized keyword is worse because of the risk of blocking on the lock (potentially causing thread context switches etc)
Yes, in the common case you are right. Java Concurrency in Practice discusses this in section 15.3.2:
[...] at high contention levels locking tends to outperform atomic variables, but at more realistic contention levels atomic variables outperform locks. This is because a lock reacts to contention by suspending threads, reducing CPU usage and synchronization traffic on the shared memory bus. (This is similar to how blocking producers in a producer-consumer design reduces the load on consumers and thereby lets them catch up.) On the other hand, with atomic variables, contention management is pushed back to the calling class. Like most CAS-based algorithms, AtomicPseudoRandom reacts to contention by trying again immediately, which is usually the right approach but in a high-contention environment just creates more contention.
Before we condemn AtomicPseudoRandom as poorly written or atomic variables as a poor choice compared to locks, we should realize that the level of contention in Figure 15.1 is unrealistically high: no real program does nothing but contend for a lock or atomic variable. In practice, atomics tend to scale better than locks because atomics deal more effectively with typical contention levels.
The performance reversal between locks and atomics at differing levels of contention illustrates the strengths and weaknesses of each. With low to moderate contention, atomics offer better scalability; with high contention, locks offer better contention avoidance. (CAS-based algorithms also outperform lock-based ones on single-CPU systems, since a CAS always succeeds on a single-CPU system except in the unlikely case that a thread is preempted in the middle of the read-modify-write operation.)
(On the figures referred to by the text, Figure 15.1 shows that the performance of AtomicInteger and ReentrantLock is more or less equal when contention is high, while Figure 15.2 shows that under moderate contention the former outperforms the latter by a factor of 2-3.)
Update: on nonblocking algorithms
As others have noted, nonblocking algorithms, although potentially faster, are more complex, thus more difficult to get right. A hint from section 15.4 of JCiA:
Good nonblocking algorithms are known for many common data structures, including stacks, queues, priority queues, and hash tables, though designing new ones is a task best left to experts.
Nonblocking algorithms are considerably more complicated than their lock-based equivalents. The key to creating nonblocking algorithms is figuring out how to limit the scope of atomic changes to a single variable while maintaining data consistency. In linked collection classes such as queues, you can sometimes get away with expressing state transformations as changes to individual links and using an AtomicReference to represent each link that must be updated atomically.

I wonder if jvm already does a few spin before really suspending the thread. It anticipate that well written critical sections, like yours, are very short and complete almost immediately. Therefore it should optimistically busy-wait for, I don't know, dozens of loops, before giving up and suspending the thread. If that's the case, it should behave the same as your 2nd version.
what a profiler shows might be very different from what's realy happending in a jvm at full speed, with all kinds of crazy optimizations. it's better to measure and compare throughputs without profiler.

Before doing this kind of synchronization optimizations, you really need a profiler to tell you that it's absolutely necessary.
Yes, synchronized under some conditions may be slower than atomic operation, but compare your original and replacement methods. The former is really clear and easy to maintain, the latter, well it's definitely more complex. Because of this there may be very subtle concurrency bugs, that you will not find during initial testing. I already see one problem, size and head can really get out of sync, because, though each of these operations is atomic, the combination is not, and sometimes this may lead to an inconsistent state.
So, my advise:
Start simple
Profile
If performance is good enough, leave simple implementation as is
If you need performance improvement, then start to get clever (possibly using more specialized lock at first), and TEST, TEST, TEST

Here's code for a busy wait lock.
public class BusyWaitLock
{
private static final boolean LOCK_VALUE = true;
private static final boolean UNLOCK_VALUE = false;
private final static Logger log = LoggerFactory.getLogger(BusyWaitLock.class);
/**
* #author Rod Moten
*
*/
public class BusyWaitLockException extends RuntimeException
{
/**
*
*/
private static final long serialVersionUID = 1L;
/**
* #param message
*/
public BusyWaitLockException(String message)
{
super(message);
}
}
private AtomicBoolean lock = new AtomicBoolean(UNLOCK_VALUE);
private final long maximumWaitTime ;
/**
* Create a busy wait lock with that uses the default wait time of two minutes.
*/
public BusyWaitLock()
{
this(1000 * 60 * 2); // default is two minutes)
}
/**
* Create a busy wait lock with that uses the given value as the maximum wait time.
* #param maximumWaitTime - a positive value that represents the maximum number of milliseconds that a thread will busy wait.
*/
public BusyWaitLock(long maximumWaitTime)
{
if (maximumWaitTime < 1)
throw new IllegalArgumentException (" Max wait time of " + maximumWaitTime + " is too low. It must be at least 1 millisecond.");
this.maximumWaitTime = maximumWaitTime;
}
/**
*
*/
public void lock ()
{
long startTime = System.currentTimeMillis();
long lastLogTime = startTime;
int logMessageCount = 0;
while (lock.compareAndSet(UNLOCK_VALUE, LOCK_VALUE)) {
long waitTime = System.currentTimeMillis() - startTime;
if (waitTime - lastLogTime > 5000) {
log.debug("Waiting for lock. Log message # {}", logMessageCount++);
lastLogTime = waitTime;
}
if (waitTime > maximumWaitTime) {
log.warn("Wait time of {} exceed maximum wait time of {}", waitTime, maximumWaitTime);
throw new BusyWaitLockException ("Exceeded maximum wait time of " + maximumWaitTime + " ms.");
}
}
}
public void unlock ()
{
lock.set(UNLOCK_VALUE);
}
}

java concurrency: many writers, one reader

I need to gather some statistics in my software and i am trying to make it fast and correct, which is not easy (for me!)
first my code so far with two classes, a StatsService and a StatsHarvester
public class StatsService
{
private Map<String, Long> stats = new HashMap<String, Long>(1000);
public void notify ( String key )
{
Long value = 1l;
synchronized (stats)
{
if (stats.containsKey(key))
{
value = stats.get(key) + 1;
}
stats.put(key, value);
}
}
public Map<String, Long> getStats ( )
{
Map<String, Long> copy;
synchronized (stats)
{
copy = new HashMap<String, Long>(stats);
stats.clear();
}
return copy;
}
}
this is my second class, a harvester which collects the stats from time to time and writes them to a database.
public class StatsHarvester implements Runnable
{
private StatsService statsService;
private Thread t;
public void init ( )
{
t = new Thread(this);
t.start();
}
public synchronized void run ( )
{
while (true)
{
try
{
wait(5 * 60 * 1000); // 5 minutes
collectAndSave();
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
}
private void collectAndSave ( )
{
Map<String, Long> stats = statsService.getStats();
// do something like:
// saveRecords(stats);
}
}
At runtime it will have about 30 concurrent running threads each calling notify(key) about 100 times. Only one StatsHarvester is calling statsService.getStats()
So i have many writers and only one reader. it would be nice to have accurate stats but i don't care if some records are lost on high concurrency.
The reader should run every 5 Minutes or whatever is reasonable.
Writing should be as fast as possible. Reading should be fast but if it locks for about 300ms every 5 minutes, its fine.
I've read many docs (Java concurrency in practice, effective java and so on), but i have the strong feeling that i need your advice to get it right.
I hope i stated my problem clear and short enough to get valuable help.
EDIT
Thanks to all for your detailed and helpful answers. As i expected there is more than one way to do it.
I tested most of your proposals (those i understood) and uploaded a test project to google code for further reference (maven project)
http://code.google.com/p/javastats/
I have tested different implementations of my StatsService
HashMapStatsService (HMSS)
ConcurrentHashMapStatsService (CHMSS)
LinkedQueueStatsService (LQSS)
GoogleStatsService (GSS)
ExecutorConcurrentHashMapStatsService (ECHMSS)
ExecutorHashMapStatsService (EHMSS)
and i tested them with x number of Threads each calling notify y times, results are in ms
10,100 10,1000 10,5000 50,100 50,1000 50,5000 100,100 100,1000 100,5000
GSS 1 5 17 7 21 117 7 37 254 Summe: 466
ECHMSS 1 6 21 5 32 132 8 54 249 Summe: 508
HMSS 1 8 45 8 52 233 11 103 449 Summe: 910
EHMSS 1 5 24 7 31 113 8 67 235 Summe: 491
CHMSS 1 2 9 3 11 40 7 26 72 Summe: 171
LQSS 0 3 11 3 16 56 6 27 144 Summe: 266
At this moment i think i will use ConcurrentHashMap, as it offers good performance while it is quite easy to understand.
Thanks for all your input!
Janning

As jack was eluding to you can use the java.util.concurrent library which includes a ConcurrentHashMap and AtomicLong. You can put the AtomicLong in if absent else, you can increment the value. Since AtomicLong is thread safe you will be able to increment the variable without worry about a concurrency issue.
public void notify(String key) {
AtomicLong value = stats.get(key);
if (value == null) {
value = stats.putIfAbsent(key, new AtomicLong(1));
}
if (value != null) {
value.incrementAndGet();
}
}
This should be both fast and thread safe
Edit: Refactored sligthly so there is only at most two lookups.

Why don't you use java.util.concurrent.ConcurrentHashMap<K, V>? It handles everything internally avoiding useless locks on the map and saving you a lot of work: you won't have to care about synchronizations on get and put..
From the documentation:
A hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates. This class obeys the same functional specification as Hashtable, and includes versions of methods corresponding to each method of Hashtable. However, even though all operations are thread-safe, retrieval operations do not entail locking, and there is not any support for locking the entire table in a way that prevents all access.
You can specify its concurrency level:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing. The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read. Also, resizing this or any other kind of hash table is a relatively slow operation, so, when possible, it is a good idea to provide estimates of expected table sizes in constructors.
As suggested in comments read carefully the documentation of ConcurrentHashMap, especially when it states about atomic or not atomic operations.
To have the guarantee of atomicity you should consider which operations are atomic, from ConcurrentMap interface you will know that:
V putIfAbsent(K key, V value)
V replace(K key, V value)
boolean replace(K key,V oldValue, V newValue)
boolean remove(Object key, Object value)
can be used safely.

I would suggest taking a look at Java's util.concurrent library. I think you can implement this solution a lot cleaner. I don't think you need a map here at all. I would recommend implementing this using the ConcurrentLinkedQueue. Each 'producer' can freely write to this queue without worrying about others. It can put an object on the queue with the data for its statistics.
The harvester can consume the queue continually pulling data off and processsing it. It can then store it however it needs.

Chris Dail's answer looks like a good approach.
Another alternative would be to use a concurrent Multiset. There is one in the Google Collections library. You could use this as follows:
private Multiset<String> stats = ConcurrentHashMultiset.create();
public void notify ( String key )
{
stats.add(key, 1);
}
Looking at the source, this is implemented using a ConcurrentHashMap and using putIfAbsent and the three-argument version of replace to detect concurrent modifications and retry.

A different approach to the problem is to exploit the (trivial) thread safety via thread confinement. Basically create a single background thread that takes care of both reading and writing. It has a pretty good characteristics in terms of scalability and simplicity.
The idea is that instead of all the threads trying to update the data directly, they produce an "update" task for the background thread to process. The same thread can also do the read task, assuming some lags in processing updates is tolerable.
This design is pretty nice because the threads will no longer have to compete for a lock to update data, and since the map is confined to a single thread you can simply use a plain HashMap to do get/put, etc. In terms of implementation, it would mean creating a single threaded executor, and submitting write tasks which may also perform the optional "collectAndSave" operation.
A sketch of code may look like the following:
public class StatsService {
private ExecutorService executor = Executors.newSingleThreadExecutor();
private final Map<String,Long> stats = new HashMap<String,Long>();
public void notify(final String key) {
Runnable r = new Runnable() {
public void run() {
Long value = stats.get(key);
if (value == null) {
value = 1L;
} else {
value++;
}
stats.put(key, value);
// do the optional collectAndSave periodically
if (timeToDoCollectAndSave()) {
collectAndSave();
}
}
};
executor.execute(r);
}
}
There is a BlockingQueue associated with an executor, and each thread that produces a task for the StatsService uses the BlockingQueue. The key point is this: the locking duration for this operation should be much shorter than the locking duration in the original code, so the contention should be much less. Overall it should result in a much better throughput and latency.
Another benefit is that since only one thread reads and writes to the map, plain HashMap and primitive long type can be used (no ConcurrentHashMap or atomic types involved). This also simplifies the code that actually processes it a great deal.
Hope it helps.

Have you looked into ScheduledThreadPoolExecutor? You could use that to schedule your writers, which could all write to a concurrent collection, such as the ConcurrentLinkedQueue mentioned by #Chris Dail. You can have a separately schedule job to read from the Queue as necessary, and the Java SDK should handle pretty much all your concurrency concerns, no manual locking needed.

If we ignore the harvesting part and focus on the writing, the main bottleneck of the program is that the stats are locked at a very coarse level of granularity. If two threads want to update different keys, they must wait.
If you know the set of keys in advance, and can preinitialize the map so that by the time an update thread arrives the key is guaranteed to exist, you would be able to do locking on the accumulator variable instead of the whole map, or use a thread-safe accumulator object.
Instead of implementing this yourself, there are map implementations that are designed specifically for concurrency and do this more fine-grained locking for you.
One caveat though are the stats, since you would need to get locks on all the accumulators at roughly the same time. If you use an existing concurrency-friendly map, there might be a construct for getting a snapshot.

Another alternative for implement both methods using ReentranReadWriteLock. This implementation protects against race conditions at getStats method, if you need to clear the counters. Also it removes the mutable AtomicLong from the getStats an uses an immutable Long.
public class StatsService {
private final Map<String, AtomicLong> stats = new HashMap<String, AtomicLong>(1000);
private final ReentrantReadWriteLock rwl = new ReentrantReadWriteLock();
private final Lock r = rwl.readLock();
private final Lock w = rwl.writeLock();
public void notify(final String key) {
r.lock();
AtomicLong count = stats.get(key);
if (count == null) {
r.unlock();
w.lock();
count = stats.get(key);
if(count == null) {
count = new AtomicLong();
stats.put(key, count);
}
r.lock();
w.unlock();
}
count.incrementAndGet();
r.unlock();
}
public Map<String, Long> getStats() {
w.lock();
Map<String, Long> copy = new HashMap<String, Long>();
for(Entry<String,AtomicLong> entry : stats.entrySet() ){
copy.put(entry.getKey(), entry.getValue().longValue());
}
stats.clear();
w.unlock();
return copy;
}
}
I hope this helps, any comments are welcome!

Here is how to do it with minimal impact on the performance of the threads being measured. This is the fastest solution possible in Java, without resorting to special hardware registers for performance counting.
Have each thread output its stats independently of the others, that is with no synchronization, to some stats object. Make the field containing the count volatile, so it is memory fenced:
class Stats
{
public volatile long count;
}
class SomeRunnable implements Runnable
{
public void run()
{
doStuff();
stats.count++;
}
}
Have another thread, that holds a reference to all the Stats objects, periodically go around them all and add up the counts across all threads:
public long accumulateStats()
{
long count = previousCount;
for (Stats stat : allStats)
{
count += stat.count;
}
long resultDelta = count - previousCount;
previousCount = count;
return resultDelta;
}
This gatherer thread also needs a sleep() (or some other throttle) added to it. It can periodically output counts/sec to the console for example, to give you a "live" view of how your application is performing.
This avoids the synchronization overhead about as much as you can.
The other trick to consider is padding the Stats objects to 128 (or 256 bytes on SandyBridge or later), so as to keep the different threads counts on different cache lines, or there will be caching contention on the CPU.
When only one thread reads and one writes, you do not need locks or atomics, a volatile is sufficient. There will still be some thread contention, when the stats reader thread interacts with the CPU cache line of the thread being measured. This cannot be avoided, but it is the way to do it with minimal impact on the running thread; read the stats maybe once a second or less.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.