Do "chained" CompletableFuture instances stay in memory?

Do "chained" CompletableFuture instances stay in memory? - java

Let's say that in Java I have a method doSomethingAsync(input) that schedules some work via an executor service and returns a CompletableFuture<FooBar>. Let's say that I have a billion (or whatever huge number of distinct inputs). And let's say I chain the CompleteableFuture<FooBar> instances together using thenCombine(), but I don't keep a reference to the previous CompleteableFuture<FooBar> instance. Something like this:
CompletableFuture<FooBar> future = doSomethingAsync(0);
for(int input = 1; i < 1_000_000_000; i++) {
future = future.thenCombine(doSomethingAsync(i), (foo, bar) -> bar);
}
future.join();
The interesting thing is that I can then do future.join() to wait until they all finish. And I can set a bound (e.g. 100) on the queue on the executor service inside doSomethingAsync() so that it makes the submission block when there are two many unfinished tasks in play. That would provides some back-pressure so that I don't run out of memory in the executor service with all 1,000,000,000 tasks being submitted to the executor service at the same time.
At the end of the process, the logic only has a reference to a single CompletableFuture<>—the final one representing the outcome of the final submission to doSomethingAsync(). But there were one billion of them chained together. Here's the big question: will all those 1,000,000,000 CompletableFuture<> instances stick around in memory until the last one is finished because they were chained using thenComposeAsync(), or will the initial CompletableFuture<> instances be garbage collected after they are completed and the executor service submits the "next" task via thenCombine()?

There's nothing in the public documentation about whether a CompletableFuture can be garbage collected after completion, assuming no other "external" strong references to it exist. However, if you look at the source code then you'll find this comment1:
/*
[...]
* Without precautions, CompletableFutures would be prone to
* garbage accumulation as chains of Completions build up, each
* pointing back to its sources. So we null out fields as soon as
* possible. The screening checks needed anyway harmlessly ignore
* null arguments that may have been obtained during races with
* threads nulling out fields. We also try to unlink non-isLive
* (fired or cancelled) Completions from stacks that might
* otherwise never be popped: Method cleanStack always unlinks non
* isLive completions from the head of stack; others may
* occasionally remain if racing with other cancellations or
* removals.
[...]
*/
And if you look through the implementation, you'll see lines like this one:
src = null; dep = null; fn = null;
So, at least as currently implemented, it looks like a CompletableFuture becomes eligible for garbage collection after it completes, assuming you don't maintain a separate strong reference to it yourself, regardless of the subsequent chain.
1. That link is to the source code tagged jdk-20+8. But it looks like that comment (and associated improvements) was added as part of this commit from September 2014. Perhaps as part of JDK-8056249, which was "fixed" for version 9, but looks to have been backported to Java 8

Related

Race condition with Java executor service

Issue: Iterating through files in a directory and scanning them for findings. ExecutorService used to create a thread-pool with fixed # of threads and invoking submit method like this:
final List<Future<List<ObjectWithResults>>> futures = Files.walk(baseDirObj) .map(baseDirObj::relativize)
.filter(pathMatcher::matches)
.filter(filePath -> isScannableFile(baseDirObj, filePathObj))
.map(filePathObj -> executorService.submit(() -> scanFileMethod(baseDirObj, filePathObj, resultMetricsObj, countDownLatchObj)))
.collect(ImmutableList.toImmutableList())
the scanFile method calls 3 concurrent scans that return a list of results. These results are added using:
resultsListObj.addAll(scanMethod1)
resultsListObj.addAll(scanMethod2)
resultsListObj.addAll(scanMethod3)
followed by:
countDownLatch.countDown()
In the method that calls executorService.submit() when iteratively walking through files, I call:
boolean completed = countDownLatch.awaitTermination(200, TimeUnit.MILLISECONDS);
if(isDone)
executorService.shutdown();
Made static members used in unsynchronized context 'volatile' so they will be read from JVM and not from cache.Initially there were 5 to 10% failures (like 22 out of 473), I brought it down to less than 1%. There were static variables declared, I made them volatile that helped bring down the failures
Changed to thread-safe data-structures, like ConcurrentHashMaps, CopyOnWriteArrayLists, etc.
The elements added to these thread-safe lists, maps, etc. are bound to variables declared final which means they should be thread-safe ideally.
I introduced a count down latch mechanism to decrement the thread-count, wait for a bit before calling executor service's shutdown method.
I also added a if (! future.isDone()) check which returns true meaning some future tasks are taking longer, in these cases I used the overloaded flavor of future.get with timeout to wait longer, still I get 2-5 failures in 1000 iterations.
I want to know if declaring variables holding elements added to thread-safe data-structures as final or volatile is better. I read a lot about them, but still unclear.
Result:
For test iterations greater than 500, I always see 04 to 0.7 % failures.
Note: If I synchronize the main scanFile() method, it works without a single failure, but negates the multi-threaded asynchronous performance benefit and takes 3 times longer.
What I tried?
Added countdown latch mechanism.
Declared variables holding values added to thread-safe lists, maps volatile or final
Expected 0 failures after using thread-safe data-structure objects like ConcurrentHashMaps, CopyOnWriteArrayList, but still get 1-3 failures every 1000 runs.

CompletableFuture takes more time - Java 8

I have two snippets of code which are technically same, but the second one takes 1 sec extra then the first one. The first one executes in 6 sec and the second in 7.
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
return currencyConv * yearlyEarnings;
});
The above one takes 6s and the next takes 7s
Here is the link to code
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
return currencyConv * yearlyEarnings;
});
Please explain why the second code consistently takes 1s more (extra time) as compared to the first one
Below is the signature of the method getYearlyEarningForUserWithEmployer. Just sharing, but it should not have any affect
Double getYearlyEarningForUserWithEmployer(long userId, long employerId);
Here is the link to code

Your question is horribly incomplete, but from what we can guess, it’s entirely plausible that the second variant takes longer, if we assume that currencyConvCF represents an asynchronous operation which might be running concurrently while your code fragments are executed and you’re talking about the overall time it takes to complete all operations, including the one represented by the CompletableFuture returned by thenApplyAsync (earlyEarningsInHomeCountryCF).
In the first variant you are invoking getYearlyEarningForUserWithEmployer while the operation represented by currencyConvCF might be still running concurrently. The multiplication will happen when both operations completed.
In the second variant, the getYearlyEarningForUserWithEmployer invocation is part of the operation passed to currencyConvCF.thenApplyAsync, thus it will not start before the operation represented by currencyConvCF has been completed, so no operation will run concurrently. If we assume that getYearlyEarningForUserWithEmployer takes a significant time, say one second, and has no internal dependencies to the other operation, it’s not surprising when the overall operation takes longer in that variant.
It seems, what you actually want to do is something like:
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenCombineAsync(
CompletableFuture.supplyAsync(
() -> employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())),
(currencyConv, yearlyEarnings) -> currencyConv * yearlyEarnings);
so getYearlyEarningForUserWithEmployer is not executed sequentially in the initiating thread but both source operations can run asynchronously before the final multiplication applies.
However, when you are invoking get right afterwards in the initiating thread, like in your linked code on github, that asynchronous processing of the second operation has no benefit. Instead of waiting for the completion, your initiating thread can just perform the independent operation as the second code variant of your question already does and you will likely be even faster when not spawning an asynchronous operation for something as simple as a single multiplication, i.e. use instead:
CompletableFuture<Double> currencyConvCF = /* a true asynchronous operation */
return employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())
* employerCurrencyCF.join();

What ever Holger said does make sense, but not in the problem I posted. I do agree that the question is not written in the best way.
The problem was that the order in which the futures were written were causing a consistent increase in time.
Ideally the order of the future should not matter as long as the code is written in a correct reactive fashion
The reason of the problem was the default ForkJoinPool of Java and Java uses this pool by default to run all the CompletableFutures. If I run all the CompletableFutues with a custom pool, I get almost the same time, irrespective of the order in which the future statements were written.
I still need to find what are the limitation of ForkJoinPool and find why my custom pool of 20 threads performs better.
I ll update my answer when I find the right reason.

How to use the method "removeIf" using a Predicate in a ArrayBlockingQueue

I have the following classes:
WorkerTask.java
public interface WorkerTask extends Task {
// Constants
public static final short WORKERTASK_SPIDER = 1;
public static final short WORKERTASK_PARSER = 2;
public static final short WORKERTASK_PRODUCT = 3;
public int getType();
}
WorkerPool.java
class workerPool {
private ThreadPoolExecutor executorPool_;
//----------------------------------------------------
public WorkerPool(int poolSize)
{
executorPool_ = new ThreadPoolExecutor(
poolSize,5,10,TimeUnit.SECONDS,
new ArrayBlockingQueue<Runnable>(10000000,false),
Executors.defaultThreadFactory()
);
//----------------------------------------------------
public void assign(WorkerTask workerTask) {
executorPool_.execute(new WorkerThread(workerTask));
}
//----------------------------------------------------
public void removeTasks(int siteID) {
executorPool_.getQueue().removeIf(...);
}
}
I want to call the method removeTasks to remove certain amount of pending tasks but I have no idea of how to use the method removeIf. It says: Removes all of the elements of this collection that satisfy the given predicate, but I have no idea how to create the parameter Predicate. Any idea?

If you had a Queue<WorkerTask>, you could do something like this:
queue.removeIf(task -> task.getSiteID() == siteID)
There are several problems. One problem is that the queue you get from getQueue() is BlockingQueue<Runnable> and not Queue<WorkerTask>. If you're submitting Runnable instances to the pool, the queue might contain references to your actual tasks; if so, you could downcast them to WorkerTask. However, this isn't guaranteed. Furthermore, the class doc for ThreadPoolExecutor says (under "Queue maintenance"):
Method getQueue() allows access to the work queue for purposes of monitoring and debugging. Use of this method for any other purpose is strongly discouraged. Two supplied methods, remove(Runnable) and purge() are available to assist in storage reclamation when large numbers of queued tasks become cancelled.
Looking at the remove(Runnable) method, its doc says
It may fail to remove tasks that have been converted into other forms before being placed on the internal queue.
This suggests that you should hang onto the Runnable instances that have been submitted in order to call remove() on them later. Or, call submit(Runnable) to get a Future and save those instances around in order to cancel them.
But there is also a second problem that probably renders this approach inadequate. Suppose you've found a way to remove or cancel the matching tasks from the queue. Another thread might have decided to submit a new task that matches, but hasn't submitted it yet. There's a race condition here. You might be able to cancel the enqueued tasks, but after you've done so, you can't guarantee that new matching tasks haven't been submitted.
Here's an alternative approach. Presumably, when you cancel (or whatever) a site ID, there's some logic somewhere to stop submitting new tasks that match that side ID. The problem is how to deal with matching tasks that are "in-flight," that is, that are in the queue or are about to be enqueued.
Instead of trying to cancel the matching tasks, change the task so that if its site ID has been canceled, the task turns into a no-op. You could record the cancellation of a site ID in, say, a ConcurrentHashMap. Any task would check this map before beginning its work, and if the site ID is present, it'd simply return. Adding a site ID to the map would have the immediate effect of ensuring that no new task on that site ID will commence. (Tasks that have already started will run to completion.) Any in-flight tasks will eventually drain from the queue without causing any actual work to occur.

A predicate is a function that receives an input and returns a boolean value.
If you are using java 8 you can use lambda expressions:
(elem) -> return elem.id == siteID

Using a semaphore inside a nested Java 8 parallel stream action may DEADLOCK. Is this a bug?

Consider the following situation: We are using a Java 8 parallel stream to perform a parallel forEach loop, e.g.,
IntStream.range(0,20).parallel().forEach(i -> { /* work done here */})
The number of parallel threads is controlled by the system property "java.util.concurrent.ForkJoinPool.common.parallelism" and usually equal to the number of processors.
Now assume that we like to limit the number of parallel executions for a specific piece of work - e.g. because that part is memory intensive and memory constrain imply a limit of parallel executions.
An obvious and elegant way to limit parallel executions is to use a Semaphore (suggested here), e.g., the following pice of code limits the number of parallel executions to 5:
final Semaphore concurrentExecutions = new Semaphore(5);
IntStream.range(0,20).parallel().forEach(i -> {
concurrentExecutions.acquireUninterruptibly();
try {
/* WORK DONE HERE */
}
finally {
concurrentExecutions.release();
}
});
This works just fine!
However: Using any other parallel stream inside the worker (at /* WORK DONE HERE */) may result in a deadlock.
For me this is an unexpected behavior.
Explanation: Since Java streams use a ForkJoin pool, the inner forEach is forking, and the join appears to be waiting for ever. However, this behavior is still unexpected. Note that parallel streams even work if you set "java.util.concurrent.ForkJoinPool.common.parallelism" to 1.
Note also that it may not be transparent if there is an inner parallel forEach.
Question: Is this behavior in accordance with the Java 8 specification (in that case it would imply that the use of Semaphores inside parallel streams workers is forbidden) or is this a bug?
For convenience: Below is a complete test case. Any combinations of the two booleans work, except "true, true", which results in the deadlock.
Clarification: To make the point clear, let me stress one aspect: The deadlock does not occur at the acquire of the semaphore. Note that the code consists of
acquire semaphore
run some code
release semaphore
and the deadlock occurs at 2. if that piece of code is using ANOTHER parallel stream. Then the deadlock occurs inside that OTHER stream. As a consequence it appears that it is not allowed to use nested parallel streams and blocking operations (like a semaphore) together!
Note that it is documented that parallel streams use a ForkJoinPool and that ForkJoinPool and Semaphore belong to the same package - java.util.concurrent (so one would expect that they interoperate nicely).
/*
* (c) Copyright Christian P. Fries, Germany. All rights reserved. Contact: email#christian-fries.de.
*
* Created on 03.05.2014
*/
package net.finmath.experiments.concurrency;
import java.util.concurrent.Semaphore;
import java.util.stream.IntStream;
/**
* This is a test of Java 8 parallel streams.
*
* The idea behind this code is that the Semaphore concurrentExecutions
* should limit the parallel executions of the outer forEach (which is an
* <code>IntStream.range(0,numberOfTasks).parallel().forEach</code> (for example:
* the parallel executions of the outer forEach should be limited due to a
* memory constrain).
*
* Inside the execution block of the outer forEach we use another parallel stream
* to create an inner forEach. The number of concurrent
* executions of the inner forEach is not limited by us (it is however limited by a
* system property "java.util.concurrent.ForkJoinPool.common.parallelism").
*
* Problem: If the semaphore is used AND the inner forEach is active, then
* the execution will be DEADLOCKED.
*
* Note: A practical application is the implementation of the parallel
* LevenbergMarquardt optimizer in
* {#link http://finmath.net/java/finmath-lib/apidocs/net/finmath/optimizer/LevenbergMarquardt.html}
* In one application the number of tasks in the outer and inner loop is very large (>1000)
* and due to memory limitation the outer loop should be limited to a small (5) number
* of concurrent executions.
*
* #author Christian Fries
*/
public class ForkJoinPoolTest {
public static void main(String[] args) {
// Any combination of the booleans works, except (true,true)
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final int numberOfTasksInOuterLoop = 20; // In real applications this can be a large number (e.g. > 1000).
final int numberOfTasksInInnerLoop = 100; // In real applications this can be a large number (e.g. > 1000).
final int concurrentExecusionsLimitInOuterLoop = 5;
final int concurrentExecutionsLimitForStreams = 10;
final Semaphore concurrentExecutions = new Semaphore(concurrentExecusionsLimitInOuterLoop);
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism",Integer.toString(concurrentExecutionsLimitForStreams));
System.out.println("java.util.concurrent.ForkJoinPool.common.parallelism = " + System.getProperty("java.util.concurrent.ForkJoinPool.common.parallelism"));
IntStream.range(0,numberOfTasksInOuterLoop).parallel().forEach(i -> {
if(isUseSemaphore) {
concurrentExecutions.acquireUninterruptibly();
}
try {
System.out.println(i + "\t" + concurrentExecutions.availablePermits() + "\t" + Thread.currentThread());
if(isUseInnerStream) {
runCodeWhichUsesParallelStream(numberOfTasksInInnerLoop);
}
else {
try {
Thread.sleep(10*numberOfTasksInInnerLoop);
} catch (Exception e) {
}
}
}
finally {
if(isUseSemaphore) {
concurrentExecutions.release();
}
}
});
System.out.println("D O N E");
}
/**
* Runs code in a parallel forEach using streams.
*
* #param numberOfTasksInInnerLoop Number of tasks to execute.
*/
private static void runCodeWhichUsesParallelStream(int numberOfTasksInInnerLoop) {
IntStream.range(0,numberOfTasksInInnerLoop).parallel().forEach(j -> {
try {
Thread.sleep(10);
} catch (Exception e) {
}
});
}
}

Any time you are decomposing a problem into tasks, where those tasks could be blocked on other tasks, and try and execute them in a finite thread pool, you are at risk for pool-induced deadlock. See Java Concurrency in Practice 8.1.
This is unquestionably a bug -- in your code. You're filling up the FJ pool with tasks that are going to block waiting for the results of other tasks in the same pool. Sometimes you get lucky and things manage to not deadlock (just like not all lock-ordering errors result in deadlock all the time), but fundamentally you're skating on some very thin ice here.

I ran your test in a profiler (VisualVM) and I agree: Threads are waiting for the semaphore and on aWaitJoin() in the F/J Pool.
This framework has serious problems where join() is concerned. I’ve been writing a critique about this framework for four years now. The basic join problem starts here.
aWaitJoin() has similar problems. You can peruse the code yourself. When the framework gets to the bottom of the work deque it issues a wait(). What it all comes down to is this framework has no way of doing a context-switch.
There is a way of getting this framework to create compensation threads for the threads that are stalled. You need to implement the ForkJoinPool.ManagedBlocker interface. How you can do this, I have no idea. You’re running a basic API with streams. You’re not implementing the Streams API and writing your own code.
I stick to my comment, above: Once you turn over the parallelism to the API you relinquish your ability to control the inner workings of that parallel mechanism. There is no bug with the API (other than it is using a faulty framework for parallel operations.) The problem is that semaphores or any other method for controlling parallelism within the API are hazardous ideas.

After a bit of investigation of the source code of ForkJoinPool and ForkJoinTask, I assume that I found an answer:
It is a bug (in my opinion), and the bug is in doInvoke() of ForkJoinTask. The problem is actually related to the nesting of the two loops and presumably not to the use of the Semaphore, however, one needs the Semaphore (or s.th. blocking in the outer loop) to make the problem become apparent and result in a deadlock (but I can imagine there are other issues implied by this bug - see Nested Java 8 parallel forEach loop perform poor. Is this behavior expected? ).
The implementation of the doInvoke() method currently looks as follows:
/**
* Implementation for invoke, quietlyInvoke.
*
* #return status upon completion
*/
private int doInvoke() {
int s; Thread t; ForkJoinWorkerThread wt;
return (s = doExec()) < 0 ? s :
((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) ?
(wt = (ForkJoinWorkerThread)t).pool.awaitJoin(wt.workQueue, this) :
externalAwaitDone();
}
(and maybe also in doJoin which looks similar). In the line
((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) ?
it is tested if Thread.currentThread() is an instance of ForkJoinWorkerThread. The reason of this test is to check if the ForkJoinTask is running on a worker thread of the pool or the main thread. I believe that this line is OK for a non-nested parallel for, where it allows to distinguish if the current tasks runs on the main thread or on a pool worker. However, for tasks of the inner loop this test is problematic: Let us call the thread who runs the parallel().forEeach the creator thread. For the outer loop the creator thread is the main thread and it is not an instanceof ForkJoinWorkerThread. However, for inner loops running from a ForkJoinWorkerThread the creator thread is an instanceof ForkJoinWorkerThread too. Hence, in this situation, the test ((t = Thread.currentThread()) instanceof ForkJoinWorkerThread) IS ALWAYS TRUE!
Hence, we always call pool.awaitJoint(wt.workQueue).
Now, note that we call awaitJoint on the FULL workQueue of that thread (I believe that this is an additional flaw). It appears as if we are not only joining the inner-loops tasks, but also the task(s) of the outer loop and we JOIN ALL THOSE tasks. Unfortunately, the outer task contains that Semaphore.
To proof, that the bug is related to this, we may check a very simple workaround. I create a t = new Thread() which runs the inner loop, then perform t.start(); t.join();. Note that this will not introduce any additional parallelism (I am immediately joining). However, it will change the result of the instanceof ForkJoinWorkerThread test for the creator thread. (Note that task will still be submitted to the common pool).
If that wrapper thread is created, the problem does not occur anymore - at least in my current test situation.
I postet a full demo to
http://svn.finmath.net/finmath%20experiments/trunk/src/net/finmath/experiments/concurrency/ForkJoinPoolTest.java
In this test code the combination
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final boolean isWrappedInnerLoopThread = false;
will result in a deadlock, while the combination
final boolean isUseSemaphore = true;
final boolean isUseInnerStream = true;
final boolean isWrappedInnerLoopThread = true;
(and actually all other combinations) will not.
Update: Since many are pointing out that the use of the Semaphore is dangerous I tried to create a demo of the problem without Semaphore. Now, there is no more deadlock, but an - in my opinion - unexpected performance issue. I created a new post for that at Nested Java 8 parallel forEach loop perform poor. Is this behavior expected?. The demo code is here:
http://svn.finmath.net/finmath%20experiments/trunk/src/net/finmath/experiments/concurrency/NestedParallelForEachTest.java

Efficient cache synchronization

Consider this
public Object doGet() {
return getResource();
}
private Object getResource() {
synchronized (lock) {
if (cachedResourceIsStale()) {
downloadNewVersionOfResource();
}
}
return resource;
}
Assuming that doGet will be executed concurrently, and a lot at that, and that downloading the new version of the resource takes a while, are there more efficient ways to do the synchronization in getResource? I know about read/write locks but I don't think they can be applied here.
Why synchronize at all? If the cache goes stale, all threads accessing the resource while it's still being refreshed by the first one, will execute their own refresh. Among other problems this causes, it's hardly efficient.
As BalusC mentions in the comments, I'm currently facing this issue in a servlet, but I'm happy with generic answers because who knows in what situation I'll run into it again.

Assumptions
efficient means that doGet() should complete as quickly as possible
cachedPageIsStale() takes no time at all
downloadNewVersionOfResource() takes a little time
Answer
Sychronizing reduces network load, because only one thread fetches the resource when it expires. Also, it will not unduly delay processing of other threads - as the VM contains no current snapshot the threads could return, they'll have to block, and there is no reason an additional concurrent downloadNewVersionOfResource() will complete any faster (I'd expect the opposite due to network bandwith contention).
So synchronizing is good, and optimal in bandwith consumption and response times. (The CPU overhead of synchronization is vanishingly small compared to I/O waits) - assuming that a current version of the resource might not be available when doGet() is invoked; if your server always had a current version of the resource, it could send it back without delay. (You might have a background thread download the new version just before the old one expires.)
PS
You haven't shown any error handling. You'll have to decide whether to propagate exceptions thrown by downloadNewVersionOfResource() to your callers or continue to serve the old version of the resource.
Edit
So? Let's assume you have 100 connection workers, and the check whether the resource is stale takes one microsecond, the resource is not stale, and serving it takes one second. Then, on average, 100 * 10^-6 / 1 = 0.0001 threads are trying to get the lock. Barely any contention at all. And the overhead of acquiring an untaken lock is on the order of 10^-8 seconds. There is no point optimizing things that already take microsends, when the network will cause delays of milliseonds. And if you don't believe me, do a microbenchmark for synchronization. It is true that frequent, needless sychronization adds significant overhead, and that synchronizing collection classes were deprecated for that reason. But that's because these methods do very little work per invocation, and the relative overhead of sychronization was a lot bigger. I just did a small microbenchmark for the following code:
synchronized (lock) {
c++;
}
On my notebook, this takes 50 nanoseconds (5*10^-8 seconds) averaged over 10 million executions in sun's hotspot vm. This is about 20 times as long as the naked increment operation, so if one does a lot of increments, sychronizing each of those will slow the program an order of magnitude. If however, that method did blocking I/O, waiting, say, 1 ms, adding those same 50 nanoseconds would reduce throughput by 0.005%. Surely you have better opportunities for performance tuning :-)
This is why, you should always measure before starting to optimize. It prevents you from investing hours of your time to save a couple nano seconds processor time.

There might be possibility that you can reduce lock contention (thus improving through-put) by using "lock-striping" -- essentially, split one lock into several ones, each lock protecting particular group of users.
The tricky part is how to figure out how to assign users to groups. The simplest case is when you can assign request from any user to any group. If your data model requires that requests from one user must be processed sequentially, you must introduce some mapping between user requests and groups. Here's sample implementation of StripedLock:
import java.util.concurrent.locks.ReentrantLock;
/**
* Striped locks holder, contains array of {#link java.util.concurrent.locks.ReentrantLock}, on which lock/unlock
* operations are performed. Purpose of this is to decrease lock contention.
* <p>When client requests lock, it gives an integer argument, from which target lock is derived as follows:
* index of lock in array equals to <code>id & (locks.length - 1)</code>.
* Since <code>locks.length</code> is the power of 2, <code>locks.length - 1</code> is string of '1' bits,
* and this means that all lower bits of argument are taken into account.
* <p>Number of locks it can hold is bounded: it can be from set {2, 4, 8, 16, 32, 64}.
*/
public class StripedLock {
private final ReentrantLock[] locks;
/**
* Default ctor, creates 16 locks
*/
public StripedLock() {
this(4);
}
/**
* Creates array of locks, size of array may be any from set {2, 4, 8, 16, 32, 64}
* #param storagePower size of array will be equal to <code>Math.pow(2, storagePower)</code>
*/
public StripedLock(int storagePower) {
if (storagePower < 1 || storagePower > 6)
throw new IllegalArgumentException("storage power must be in [1..6]");
int lockSize = (int) Math.pow(2, storagePower);
locks = new ReentrantLock[lockSize];
for (int i = 0; i < locks.length; i++)
locks[i] = new ReentrantLock();
}
/**
* Locks lock associated with given id.
* #param id value, from which lock is derived
*/
public void lock(int id) {
getLock(id).lock();
}
/**
* Unlocks lock associated with given id.
* #param id value, from which lock is derived
*/
public void unlock(int id) {
getLock(id).unlock();
}
/**
* Map function between integer and lock from locks array
* #param id argument
* #return lock which is result of function
*/
private ReentrantLock getLock(int id) {
return locks[id & (locks.length - 1)];
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.