Synchronized code performs faster than unsynchronized one

Synchronized code performs faster than unsynchronized one - java

I came out with this stunning result which i absolutely do not know the reason for:
I have two methods which are shortened to:
private static final ConcurrentHashMap<Double,Boolean> mapBoolean =
new ConcurrentHashMap<Double, Boolean>();
private static final ConcurrentHashMap<Double,LinkedBlockingQueue<Runnable>> map
= new ConcurrentHashMap<Double, LinkedBlockingQueue<Runnable>>();
protected static <T> Future<T> execute(final Double id, Callable<T> call){
// where id is the ID number of each thread
synchronized(id)
{
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
protected static <T> Future<T> executeLoosely(final Double id, Callable<T> call){
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
On profiling with over 500 threads, and each thread calling each of the above methods 400 times each, I found out that execute(..) performs atleast 500 times better than executeLoosely(..) which is weird because executeLoosely is not synchronized and hence more threads can process the code simultaneously.
Any reasons??

The overhead of using 500 threads on a machine which I assume doesn't have 500 cores, using tasks which takes about 100-1000x as long as a lookup on a Map to execute code which the JVM could detect doesn't do anything, is likely to produce a random outcome. ;)
Another problem you could have is that a test which faster being performed with one thread can benefit from using synchronized because it biases access to one thread. i.e. it turns your multi-threaded test back into a single threaded one which is the fastest in the first place.
You should compare the timings you get with a single thread doing a loop. If this is faster (which I believe it would be) then its not a useful multi-threaded test.
My guess is that you are running the synchronized code after the unsynchronised code. i.e. after the JVM has warmed up a little. Swap the order you perform these tests and run them many times and you will get different results.

In the non synchronized scenario :
1) wait to acquire lock on a segment of the map, lock, perform operation on the map, unlock, wait to acquire lock on a segment of the other map, lock, perform operation on the other map, unlock.
The segment level locking will be performed only in cases of concurrent write to the segment which doesn't look to be the case in your example.
In the synchronized scenario :
1) wait to lock, perform both the operations, unlock.
The time taken for context switching can have an impact? How many cores does the machine running the test have?
How are the maps structured, same sort of keys?

Related

Race condition with Java executor service

Issue: Iterating through files in a directory and scanning them for findings. ExecutorService used to create a thread-pool with fixed # of threads and invoking submit method like this:
final List<Future<List<ObjectWithResults>>> futures = Files.walk(baseDirObj) .map(baseDirObj::relativize)
.filter(pathMatcher::matches)
.filter(filePath -> isScannableFile(baseDirObj, filePathObj))
.map(filePathObj -> executorService.submit(() -> scanFileMethod(baseDirObj, filePathObj, resultMetricsObj, countDownLatchObj)))
.collect(ImmutableList.toImmutableList())
the scanFile method calls 3 concurrent scans that return a list of results. These results are added using:
resultsListObj.addAll(scanMethod1)
resultsListObj.addAll(scanMethod2)
resultsListObj.addAll(scanMethod3)
followed by:
countDownLatch.countDown()
In the method that calls executorService.submit() when iteratively walking through files, I call:
boolean completed = countDownLatch.awaitTermination(200, TimeUnit.MILLISECONDS);
if(isDone)
executorService.shutdown();
Made static members used in unsynchronized context 'volatile' so they will be read from JVM and not from cache.Initially there were 5 to 10% failures (like 22 out of 473), I brought it down to less than 1%. There were static variables declared, I made them volatile that helped bring down the failures
Changed to thread-safe data-structures, like ConcurrentHashMaps, CopyOnWriteArrayLists, etc.
The elements added to these thread-safe lists, maps, etc. are bound to variables declared final which means they should be thread-safe ideally.
I introduced a count down latch mechanism to decrement the thread-count, wait for a bit before calling executor service's shutdown method.
I also added a if (! future.isDone()) check which returns true meaning some future tasks are taking longer, in these cases I used the overloaded flavor of future.get with timeout to wait longer, still I get 2-5 failures in 1000 iterations.
I want to know if declaring variables holding elements added to thread-safe data-structures as final or volatile is better. I read a lot about them, but still unclear.
Result:
For test iterations greater than 500, I always see 04 to 0.7 % failures.
Note: If I synchronize the main scanFile() method, it works without a single failure, but negates the multi-threaded asynchronous performance benefit and takes 3 times longer.
What I tried?
Added countdown latch mechanism.
Declared variables holding values added to thread-safe lists, maps volatile or final
Expected 0 failures after using thread-safe data-structure objects like ConcurrentHashMaps, CopyOnWriteArrayList, but still get 1-3 failures every 1000 runs.

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).
Use case:
A counter variable which
is updated by multiple threads concurrently
has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.
So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).
However I'm not sure I correctly understand the way LongAdder works.
The JavaDoc says:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
and for sum()
Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.
What is meant by fine-grained synchronization control ...
From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:
private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)
If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".
And then with my write/read^n pattern I probably can't do better then AtomicLong?

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.
In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:
the incrementAndGet in LongAdder is not atomic
... but I don't see incrementAndGet at all in LongAdder. Where are you looking?
Your idea of:
usage in a thread will consist of a w rite with an immediate read afterwards
Requirement is that each read
from the counter (immediately after the writing) gets a unique
incremented value. It is not required that each retrieved counter
value is increasing in the same order as the different
threads(writers) increment the value.
Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.
So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().
Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.
For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.
LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.
Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.
Something Useful
OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.
Re-purpose the LongAdder Idea
You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.
Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.
Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.
Thread-local Counters
A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.
The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.
Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).
A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.
The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.
There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).
1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).
2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.
3 Where N is lg2(number of counters).

There's a subtle difference between the two implementations.
An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.
A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.
LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).
public static void main (String[] args) throws Exception
{
LongAdder adder = new LongAdder();
ExecutorService executor = Executors.newFixedThreadPool(10);
Map<Long, Integer> count = new ConcurrentHashMap<>();
for (int i = 0; i < 10; i++)
{
executor.execute(() -> {
for (int j = 0; j < 1000000; j++)
{
adder.add(1);
count.merge(adder.longValue(), 1, Integer::sum);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.HOURS);
count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}

Java - Synchronized methods causes program to slow down massively

I'm trying to learn about threads and synchronization. I made this test program:
public class Test {
static List<Thread> al = new ArrayList<>();
public static void main(String[] args) throws IOException, InterruptedException {
long startTime = System.currentTimeMillis();
al.add(new Thread(() -> fib1(47)));
al.add(new Thread(() -> fib2(47)));
for (Thread t : al)
t.start();
for (Thread t: al)
t.join();
long totalTime = System.currentTimeMillis() - startTime;
System.out.println(totalTime);
}
public static synchronized int fib1(int x) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
public static synchronized int fib2(int x) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
This program takes around 273 seconds to finish, but if I remove both of the synchronized it runs in 7 seconds instead. What causes this massive difference?
EDIT:
I'm aware that I'm using a terribly slow algorithm for calculating fibonacci numbers. And I'm also aware that the threads don't share resources and thus the methods don't need to be synchronized. However, this is just a test program where I'm trying to figure out how synchronized works and I choose a slow algorithm on purpose so I could measure time taken in milliseconds.

Your program does not get stuck - it's just terribly slow.
This is due to two reasons:
1. Algorithm Complexity
As others and youself have mentioned, the way you compute the Fibonacci numbers is really slow because it computes the same values over and over again. Using a smaller input will bring down the runtime to a reasonable value. But this is not what your question is about.
2. Synchronized
This slows down your program in 2 ways:
First of all, making the methods synchronized is not necessary since they do not modify anything outside of the method itself. In fact it prevents both threads from running at the same time as the methods are static therefore preventing two thread from being in either of them at the same time.
So your code is effectively using only one thread, not two.
Also synchronized adds a significant overhead to the methods since it requires acquiring a lock when entering the method - or at least checking whether the current thread already possesses the lock.
These operations are quite expensive and they have to be done every single time one of the methods is entered. Since - due to the recursion - this happens a lot, it has an extreme impact on the program performance.
Interestingly the performance is much better when you run it with just a single thread - even with the methods being synchronized.
The reason is the runtime optimizations done by the JVM.
If you are using just one thread, the JVM can optimize the synchronized checks away since there cannot be a conflict. This reduces the runtime significantly - but not exactly to the value that it would have without synchronized due to starting with 'cold code' and some remaining runtime checks.
When running with 2 threads on the other hand, the JVM cannot do this optimization, therefore leaving the expensive synchronized operations that cause the code to be so terribly slow.
Btw: fib1 and fib2 are identical, delete one of them

When you put static synchronized on a method that means that, in order for a thread to execute that method, it first has to acquire the lock for the class (which here is Test). The two static fib methods use the same lock. One thread gets the lock, executes the fib method, and releases the lock, then the other thread gets to execute the method. Which thread gets the lock first is up to the OS.
It was already mentioned the locks are re-entrant and there's no problem with calling a synchronized method recursively. The thread holds the lock from the time it first calls the fib method, that call doesn't complete until all the recursive calls have completed, so the method runs to completion before the thread releases the lock.
The main thread isn't doing anything but waiting, and only one of the threads calling a fib method can run at a time. It does make sense that removing the synchronized modifier would speed up things, without locking the two threads can run concurrently, possibly using different processors.
The methods do not modify any shared state so there's no reason to synchronize them. Even if they did need to be synchronized there would still be no reason to have two separate fib methods here, because in any case invoking either the fib1 or fib2 method requires acquiring the same lock.
Using synchronized without static means that the object instance, not the class, is used as the lock. The reason that all the synchronized methods use the same lock is that the point is to protect shared state, an object might have various methods that modify the object's internal state, and to protect that state from concurrent modifications no more than one thread should be executing any one of these methods at a time.

Your program is not deadlocked, and it also isn't appreciably slower because of unnecessary synchronization. Your program appears "stuck" because of the branching factor of your recursive function.
Branching Factor of Recursion
When N >= 4, you recurse twice. In other words, on average, your recursion has a branching factor of two, meaning if you are computing the N-th Fibonacci number recursively, you will call your function about 2^N times. 2^47 is a HUGE number (like, in the hundreds of trillions). As others have suggested, you can cut this number WAY down by saving intermediate results and returning them instead of recomputing them.
More on synchronization
Acquiring locks is expensive. However, in Java, if a thread has a lock and re-enters the same synchronized block that it already owns the lock for, it doesn't have to reacquire the lock. Since each thread already owns the respective lock for each function they enter, they only have to acquire one lock apiece for the duration of your program. The cost of acquiring one lock is weensy compared to recursing hundreds of trillions of times :)

#MartinS is correct that synchronized is not necessary here because you have no shared state. That is, there is no data that you are trying to prevent being accessed concurrently by multiple threads.
However, you are slowing your program down by the addition of the synchronized call. My guess is that without synchronized, you should see two cores spinning at 100% for however long it takes to compute this method. When you add synchronized, whichever thread grabs the lock first gets to spin at 100%. The other one sits there waiting for the lock. When the first thread finishes, the second one gets to go.
You can test this by timing your program (start with smaller values to keep it to a reasonable time). The program should run in approximately half the time without synchronized than it does with.

When the fib1 (or fib2) method recurs, it doesn't release the lock. More over, it acquires the lock again (it is faster than initial locking).
Good news is that synchronized methods in Java are reentrant.
You are better not to synchronize the recursion itself.
Split your recursive methods into two:
one recursive non-synchronized method (it should be private as it is not thread-safe);
one public synchronized method without recursion per se, which calls the second method.
Try to measure such code, you should get 14 seconds, because both threads synchronize on the same lock Test.class.

The issue you see is because a static synchronized method synchronizes on the Class. So your two Threads spend an extraordinary amount of time fighting over the single lock on Test.class.
For the purposes of this learning exercise, the best way to speed it up would be to create two explicit lock objects. In Test, add
static final Object LOCK1 = new Object();
static final Object LOCK2 = new Object();
and then, in fib1() and fib2(), use a synchronized block on those two objects. e.g.
public static int fib1(int x) {
synchronized(LOCK1) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
}
public static int fib2(int x) {
synchronized(LOCK2) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
Now the first thread only needs to grab LOCK1, with no contention, and the second thread only grabs LOCK2, again, with no contention. (So long as you only have those two threads) This should run only slightly slower than the completely unsynchronized code.

Performance comparison between compare-and-swap and blocking algorithm

I have a ConcurrentLinkedQueue that I use as the underlying datastructure. On every put call, I add a unique incremented value to the list. I have both the synchronized and compare-and-swap versions of this method. When I have few threads (e.g., 5) and doing 10 million puts in all, I see that synchronized version works much better. When I have many threads (e.g., 2000) and do the same number of puts in total, I see that CAS works much better. Why does CAS underperform in comparison to blocking algorithm with fewer threads?
// AtomicReference<Foo> latestValue that is initialized
public void put(Double value) {
Foo currentValue;
while (true) {
currentValue = latestValue.get();
Foo newValue = new Foo(value);
if (latestValue.compareAndSet(currentValue, newValue)) {
historyList.add(newValue);
return;
}
}
}
Statistics
NON-BLOCKING
Threads 2000
Puts per thread 10000
Put time average 208493309
BLOCKING
Threads 2000
Puts per thread 10000
Put time average 2370823534
NON-BLOCKING
Threads 2
Puts per thread 10000000
Put time average 13117487385
BLOCKING
Threads 2
Puts per thread 10000000
Put time average 4201127857

TL;DR because in uncontended case JVM will optimize synchronized and replace it with CAS lock.
In your CAS case you got overhead: you are trying to do some computation even if your CAS will fail. Of course it's nothing in comparison to real mutex acquiring, what usually happens when you are using synchronized.
But JVM isn't stupid and when it can see that lock you are currently acquiring is uncontented, it just replaces real mutex with CAS lock (or even with simple store in case of biased locking).
So for two threads in case of synchronized you are measuring just a CAS, but in case of your own CAS implementation you're also measuring time for allocating Foo instance, for compareAndSet and for get().
For 2000 threads JVM doesn't perform CAS-optimization, so your implementation outperforms mutex acquiring as expected.

Is this java code thread-safe?

I am planning to use this schema in my application, but I was not sure whether this is safe.
To give a little background, a bunch of servers will compute results of sub-tasks that belong to a single task and report them back to the central server. This piece of code is used to register the results, and also check whether all the subtasks for the task has completed and if so, report that fact only once.
The important point is that, all task must be reported once and only once as soon as it is completed (all subTaskResults are set).
Can anybody help? Thank you! (Also, if you have a better idea to solve this problem, please let me know!)
*Note that I simplified the code for brevity.
Solution I
class Task {
//Populate with bunch of (Long, new AtomicReference()) pairs
//Actual app uses read only HashMap
Map<Id, AtomicReference<SubTaskResult>> subtasks = populatedMap();
Semaphore permission = new Semaphore(1);
public Task set(id, subTaskResult){
//null check omitted
subtasks.get(id).set(result);
return check() ? this : null;
}
private boolean check(){
for(AtomicReference ref : subtasks){
if(ref.get()==null){
return false;
}
}//for
return permission.tryAquire();
}
}//class
Stephen C kindly suggested to use a counter. Actually, I have considered that once, but I reasoned that the JVM could reorder the operations and thus, a thread can observe a decremented counter (by another thread) before the result is set in AtomicReference (by that other thread).
*EDIT: I now see this is thread safe. I'll go with this solution. Thanks, Stephen!
Solution II
class Task {
//Populate with bunch of (Long, new AtomicReference()) pairs
//Actual app uses read only HashMap
Map<Id, AtomicReference<SubTaskResult>> subtasks = populatedMap();
AtomicInteger counter = new AtomicInteger(subtasks.size());
public Task set(id, subTaskResult){
//null check omitted
subtasks.get(id).set(result);
//In the actual app, if !compareAndSet(null, result) return null;
return check() ? this : null;
}
private boolean check(){
return counter.decrementAndGet() == 0;
}
}//class

I assume that your use-case is that there are multiple multiple threads calling set, but for any given value of id, the set method will be called once only. I'm also assuming that populateMap creates the entries for all used id values, and that subtasks and permission are really private.
If so, I think that the code is thread-safe.
Each thread should see the initialized state of the subtasks Map, complete with all keys and all AtomicReference references. This state never changes, so subtasks.get(id) will always give the right reference. The set(result) call operates on an AtomicReference, so the subsequent get() method calls in check() will give the most up-to-date values ... in all threads. Any potential races with multiple threads calling check seem to sort themselves out.
However, this is a rather complicated solution. A simpler solution would be to use an concurrent counter; e.g. replace the Semaphore with an AtomicInteger and use decrementAndGet instead of repeatedly scanning the subtasks map in check.
In response to this comment in the updated solution:
Actually, I have considered that once,
but I reasoned that the JVM could
reorder the operations and thus, a
thread can observe a decremented
counter (by another thread) before the
result is set in AtomicReference (by
that other thread).
The AtomicInteger and AtomicReference by definition are atomic. Any thread that tries to access one is guaranteed to see the "current" value at the time of the access.
In this particular case, each thread calls set on the relevant AtomicReference before it calls decrementAndGet on the AtomicInteger. This cannot be reordered. Actions performed by a thread are performed in order. And since these are atomic actions, the efects will be visible to other threads in order as well.
In other words, it should be thread-safe ... AFAIK.

The atomicity guaranteed (per class documentation) explicitly for AtomicReference.compareAndSet extends to set and get methods (per package documentation), so in that regard your code appears to be thread-safe.
I am not sure, however, why you have Semaphore.tryAquire as a side-effect there, but without complimentary code to release the semaphore, that part of your code looks wrong.

The second solution does provide a thread-safe latch, but it's vulnerable to calls to set() that provide an ID that's not in the map -- which would trigger a NullPointerException -- or more than one call to set() with the same ID. The latter would mistakenly decrement the counter too many times and falsely report completion when there are presumably other subtasks IDs for which no result has been submitted. My criticism isn't with regard to the thread safety, but rather to the invariant maintenance; the same flaw would be present even without the thread-related concern.
Another way to solve this problem is with AbstractQueuedSynchronizer, but it's somewhat gratuitous: you can implement a stripped-down counting semaphore, where each call set() would call releaseShared(), decrementing the counter via a spin on compareAndSetState(), and tryAcquireShared() would only succeed when the count is zero. That's more or less what you implemented above with the AtomicInteger, but you'd be reusing a facility that offers more capabilities you can use for other portions of your design.
To flesh out the AbstractQueuedSynchronizer-based solution requires adding one more operation to justify the complexity: being able to wait on the results from all the subtasks to come back, such that the entire task is complete. That's Task#awaitCompletion() and Task#awaitCompletion(long, TimeUnit) in the code below.
Again, it's possibly overkill, but I'll share it for the purpose of discussion.
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.AbstractQueuedSynchronizer;
final class Task
{
private static final class Sync extends AbstractQueuedSynchronizer
{
public Sync(int count)
{
setState(count);
}
#Override
protected int tryAcquireShared(int ignored)
{
return 0 == getState() ? 1 : -1;
}
#Override
protected boolean tryReleaseShared(int ignored)
{
int current;
do
{
current = getState();
if (0 == current)
return true;
}
while (!compareAndSetState(current, current - 1));
return 1 == current;
}
}
public Task(int count)
{
if (count < 0)
throw new IllegalArgumentException();
sync_ = new Sync(count);
}
public boolean set(int id, Object result)
{
// Ensure that "id" refers to an incomplete task. Doing so requires
// additional synchronization over the structure mapping subtask
// identifiers to results.
// Store result somehow.
return sync_.releaseShared(1);
}
public void awaitCompletion()
throws InterruptedException
{
sync_.acquireSharedInterruptibly(0);
}
public void awaitCompletion(long time, TimeUnit unit)
throws InterruptedException
{
sync_.tryAcquireSharedNanos(0, unit.toNanos(time));
}
private final Sync sync_;
}

I have a weird feeling reading your example program, but it depends on the larger structure of your program what to do about that. A set function that also checks for completion is almost a code smell. :-) Just a few ideas.
If you have synchronous communication with your servers you might use an ExecutorService with the same number of threads like the number of servers that do the communication. From this you get a bunch of Futures, and you can naturally proceed with your calculation - the get calls will block at the moment the result is needed but not yet there.
If you have asynchronous communication with the servers you might also use a CountDownLatch after submitting the task to the servers. The await call blocks the main thread until the completion of all subtasks, and other threads can receive the results and call countdown on each received result.
With all these methods you don't need special threadsafety measures other than that the concurrent storing of the results in your structure is threadsafe. And I bet there are even better patterns for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.