Race condition with Java executor service

Race condition with Java executor service - java

Issue: Iterating through files in a directory and scanning them for findings. ExecutorService used to create a thread-pool with fixed # of threads and invoking submit method like this:
final List<Future<List<ObjectWithResults>>> futures = Files.walk(baseDirObj) .map(baseDirObj::relativize)
.filter(pathMatcher::matches)
.filter(filePath -> isScannableFile(baseDirObj, filePathObj))
.map(filePathObj -> executorService.submit(() -> scanFileMethod(baseDirObj, filePathObj, resultMetricsObj, countDownLatchObj)))
.collect(ImmutableList.toImmutableList())
the scanFile method calls 3 concurrent scans that return a list of results. These results are added using:
resultsListObj.addAll(scanMethod1)
resultsListObj.addAll(scanMethod2)
resultsListObj.addAll(scanMethod3)
followed by:
countDownLatch.countDown()
In the method that calls executorService.submit() when iteratively walking through files, I call:
boolean completed = countDownLatch.awaitTermination(200, TimeUnit.MILLISECONDS);
if(isDone)
executorService.shutdown();
Made static members used in unsynchronized context 'volatile' so they will be read from JVM and not from cache.Initially there were 5 to 10% failures (like 22 out of 473), I brought it down to less than 1%. There were static variables declared, I made them volatile that helped bring down the failures
Changed to thread-safe data-structures, like ConcurrentHashMaps, CopyOnWriteArrayLists, etc.
The elements added to these thread-safe lists, maps, etc. are bound to variables declared final which means they should be thread-safe ideally.
I introduced a count down latch mechanism to decrement the thread-count, wait for a bit before calling executor service's shutdown method.
I also added a if (! future.isDone()) check which returns true meaning some future tasks are taking longer, in these cases I used the overloaded flavor of future.get with timeout to wait longer, still I get 2-5 failures in 1000 iterations.
I want to know if declaring variables holding elements added to thread-safe data-structures as final or volatile is better. I read a lot about them, but still unclear.
Result:
For test iterations greater than 500, I always see 04 to 0.7 % failures.
Note: If I synchronize the main scanFile() method, it works without a single failure, but negates the multi-threaded asynchronous performance benefit and takes 3 times longer.
What I tried?
Added countdown latch mechanism.
Declared variables holding values added to thread-safe lists, maps volatile or final
Expected 0 failures after using thread-safe data-structure objects like ConcurrentHashMaps, CopyOnWriteArrayList, but still get 1-3 failures every 1000 runs.

Related

What is 'batchsize' in java's .asparallel() method

Java, more specifically the eclipse collection base, has a method called...
asparallel(ExecutorService ES, int batchSize)
It takes two arguments, as shown. What is the batchSize argument?
Background
It is not the the amount of threads that are active. This is defined by the ExecutorService. To illustrate, this is the schematic flow of the code I was provided (can't share more).
ExecutorService executor = Executors.newWorkStealingPool(pThreads);
List<Object> mParallelOutput = objectsOfInterest.toList()
.asParallel(executor, batchsize1)
.flatCollect(myObject-> MySubComponent.toList()
.asParallel(executor, batchsize2)
.flatCollect(p -> p.performComputation(myObject));
Just as an example, I have 'a bunch' of computations ( the performComputation() method) that need to be performed for 1-32 objects in the list objectsOfInterest. All of the computations can be done in parallel, and we need to be as efficient as possible. In trying to understand the flow, I want to know what the batchsize argument is for the asparallel() method. In addition, I am not sure whether the double call to asparallel in the sample code above has any added benefit.

From https://github.com/eclipse/eclipse-collections/blob/master/docs/guide.md :
The batch size determines the number of elements from the backing collection (FastList or UnifiedSet) that get processed by each task submitted to the thread pool. Appropriate batch sizes for CPU-bound tasks are usually larger, in the 10,000 to 100,000 range.
So, I conclude, in Your case a batchSize = 1 makes sense. (fully parallel)
Or:
#tasks (submitted to the TP) = collection.size() / batchSize
And as I understand the Executor's thread count as the number of tasks, which can be processed simultaneously, I would "align" both values: #threads ~= #tasks

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).
Use case:
A counter variable which
is updated by multiple threads concurrently
has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.
So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).
However I'm not sure I correctly understand the way LongAdder works.
The JavaDoc says:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
and for sum()
Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.
What is meant by fine-grained synchronization control ...
From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:
private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)
If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".
And then with my write/read^n pattern I probably can't do better then AtomicLong?

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.
In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:
the incrementAndGet in LongAdder is not atomic
... but I don't see incrementAndGet at all in LongAdder. Where are you looking?
Your idea of:
usage in a thread will consist of a w rite with an immediate read afterwards
Requirement is that each read
from the counter (immediately after the writing) gets a unique
incremented value. It is not required that each retrieved counter
value is increasing in the same order as the different
threads(writers) increment the value.
Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.
So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().
Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.
For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.
LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.
Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.
Something Useful
OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.
Re-purpose the LongAdder Idea
You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.
Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.
Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.
Thread-local Counters
A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.
The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.
Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).
A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.
The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.
There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).
1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).
2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.
3 Where N is lg2(number of counters).

There's a subtle difference between the two implementations.
An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.
A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.
LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).
public static void main (String[] args) throws Exception
{
LongAdder adder = new LongAdder();
ExecutorService executor = Executors.newFixedThreadPool(10);
Map<Long, Integer> count = new ConcurrentHashMap<>();
for (int i = 0; i < 10; i++)
{
executor.execute(() -> {
for (int j = 0; j < 1000000; j++)
{
adder.add(1);
count.merge(adder.longValue(), 1, Integer::sum);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.HOURS);
count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}

CompletableFuture takes more time - Java 8

I have two snippets of code which are technically same, but the second one takes 1 sec extra then the first one. The first one executes in 6 sec and the second in 7.
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
return currencyConv * yearlyEarnings;
});
The above one takes 6s and the next takes 7s
Here is the link to code
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
return currencyConv * yearlyEarnings;
});
Please explain why the second code consistently takes 1s more (extra time) as compared to the first one
Below is the signature of the method getYearlyEarningForUserWithEmployer. Just sharing, but it should not have any affect
Double getYearlyEarningForUserWithEmployer(long userId, long employerId);
Here is the link to code

Your question is horribly incomplete, but from what we can guess, it’s entirely plausible that the second variant takes longer, if we assume that currencyConvCF represents an asynchronous operation which might be running concurrently while your code fragments are executed and you’re talking about the overall time it takes to complete all operations, including the one represented by the CompletableFuture returned by thenApplyAsync (earlyEarningsInHomeCountryCF).
In the first variant you are invoking getYearlyEarningForUserWithEmployer while the operation represented by currencyConvCF might be still running concurrently. The multiplication will happen when both operations completed.
In the second variant, the getYearlyEarningForUserWithEmployer invocation is part of the operation passed to currencyConvCF.thenApplyAsync, thus it will not start before the operation represented by currencyConvCF has been completed, so no operation will run concurrently. If we assume that getYearlyEarningForUserWithEmployer takes a significant time, say one second, and has no internal dependencies to the other operation, it’s not surprising when the overall operation takes longer in that variant.
It seems, what you actually want to do is something like:
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenCombineAsync(
CompletableFuture.supplyAsync(
() -> employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())),
(currencyConv, yearlyEarnings) -> currencyConv * yearlyEarnings);
so getYearlyEarningForUserWithEmployer is not executed sequentially in the initiating thread but both source operations can run asynchronously before the final multiplication applies.
However, when you are invoking get right afterwards in the initiating thread, like in your linked code on github, that asynchronous processing of the second operation has no benefit. Instead of waiting for the completion, your initiating thread can just perform the independent operation as the second code variant of your question already does and you will likely be even faster when not spawning an asynchronous operation for something as simple as a single multiplication, i.e. use instead:
CompletableFuture<Double> currencyConvCF = /* a true asynchronous operation */
return employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())
* employerCurrencyCF.join();

What ever Holger said does make sense, but not in the problem I posted. I do agree that the question is not written in the best way.
The problem was that the order in which the futures were written were causing a consistent increase in time.
Ideally the order of the future should not matter as long as the code is written in a correct reactive fashion
The reason of the problem was the default ForkJoinPool of Java and Java uses this pool by default to run all the CompletableFutures. If I run all the CompletableFutues with a custom pool, I get almost the same time, irrespective of the order in which the future statements were written.
I still need to find what are the limitation of ForkJoinPool and find why my custom pool of 20 threads performs better.
I ll update my answer when I find the right reason.

Java - Synchronized methods causes program to slow down massively

I'm trying to learn about threads and synchronization. I made this test program:
public class Test {
static List<Thread> al = new ArrayList<>();
public static void main(String[] args) throws IOException, InterruptedException {
long startTime = System.currentTimeMillis();
al.add(new Thread(() -> fib1(47)));
al.add(new Thread(() -> fib2(47)));
for (Thread t : al)
t.start();
for (Thread t: al)
t.join();
long totalTime = System.currentTimeMillis() - startTime;
System.out.println(totalTime);
}
public static synchronized int fib1(int x) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
public static synchronized int fib2(int x) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
This program takes around 273 seconds to finish, but if I remove both of the synchronized it runs in 7 seconds instead. What causes this massive difference?
EDIT:
I'm aware that I'm using a terribly slow algorithm for calculating fibonacci numbers. And I'm also aware that the threads don't share resources and thus the methods don't need to be synchronized. However, this is just a test program where I'm trying to figure out how synchronized works and I choose a slow algorithm on purpose so I could measure time taken in milliseconds.

Your program does not get stuck - it's just terribly slow.
This is due to two reasons:
1. Algorithm Complexity
As others and youself have mentioned, the way you compute the Fibonacci numbers is really slow because it computes the same values over and over again. Using a smaller input will bring down the runtime to a reasonable value. But this is not what your question is about.
2. Synchronized
This slows down your program in 2 ways:
First of all, making the methods synchronized is not necessary since they do not modify anything outside of the method itself. In fact it prevents both threads from running at the same time as the methods are static therefore preventing two thread from being in either of them at the same time.
So your code is effectively using only one thread, not two.
Also synchronized adds a significant overhead to the methods since it requires acquiring a lock when entering the method - or at least checking whether the current thread already possesses the lock.
These operations are quite expensive and they have to be done every single time one of the methods is entered. Since - due to the recursion - this happens a lot, it has an extreme impact on the program performance.
Interestingly the performance is much better when you run it with just a single thread - even with the methods being synchronized.
The reason is the runtime optimizations done by the JVM.
If you are using just one thread, the JVM can optimize the synchronized checks away since there cannot be a conflict. This reduces the runtime significantly - but not exactly to the value that it would have without synchronized due to starting with 'cold code' and some remaining runtime checks.
When running with 2 threads on the other hand, the JVM cannot do this optimization, therefore leaving the expensive synchronized operations that cause the code to be so terribly slow.
Btw: fib1 and fib2 are identical, delete one of them

When you put static synchronized on a method that means that, in order for a thread to execute that method, it first has to acquire the lock for the class (which here is Test). The two static fib methods use the same lock. One thread gets the lock, executes the fib method, and releases the lock, then the other thread gets to execute the method. Which thread gets the lock first is up to the OS.
It was already mentioned the locks are re-entrant and there's no problem with calling a synchronized method recursively. The thread holds the lock from the time it first calls the fib method, that call doesn't complete until all the recursive calls have completed, so the method runs to completion before the thread releases the lock.
The main thread isn't doing anything but waiting, and only one of the threads calling a fib method can run at a time. It does make sense that removing the synchronized modifier would speed up things, without locking the two threads can run concurrently, possibly using different processors.
The methods do not modify any shared state so there's no reason to synchronize them. Even if they did need to be synchronized there would still be no reason to have two separate fib methods here, because in any case invoking either the fib1 or fib2 method requires acquiring the same lock.
Using synchronized without static means that the object instance, not the class, is used as the lock. The reason that all the synchronized methods use the same lock is that the point is to protect shared state, an object might have various methods that modify the object's internal state, and to protect that state from concurrent modifications no more than one thread should be executing any one of these methods at a time.

Your program is not deadlocked, and it also isn't appreciably slower because of unnecessary synchronization. Your program appears "stuck" because of the branching factor of your recursive function.
Branching Factor of Recursion
When N >= 4, you recurse twice. In other words, on average, your recursion has a branching factor of two, meaning if you are computing the N-th Fibonacci number recursively, you will call your function about 2^N times. 2^47 is a HUGE number (like, in the hundreds of trillions). As others have suggested, you can cut this number WAY down by saving intermediate results and returning them instead of recomputing them.
More on synchronization
Acquiring locks is expensive. However, in Java, if a thread has a lock and re-enters the same synchronized block that it already owns the lock for, it doesn't have to reacquire the lock. Since each thread already owns the respective lock for each function they enter, they only have to acquire one lock apiece for the duration of your program. The cost of acquiring one lock is weensy compared to recursing hundreds of trillions of times :)

#MartinS is correct that synchronized is not necessary here because you have no shared state. That is, there is no data that you are trying to prevent being accessed concurrently by multiple threads.
However, you are slowing your program down by the addition of the synchronized call. My guess is that without synchronized, you should see two cores spinning at 100% for however long it takes to compute this method. When you add synchronized, whichever thread grabs the lock first gets to spin at 100%. The other one sits there waiting for the lock. When the first thread finishes, the second one gets to go.
You can test this by timing your program (start with smaller values to keep it to a reasonable time). The program should run in approximately half the time without synchronized than it does with.

When the fib1 (or fib2) method recurs, it doesn't release the lock. More over, it acquires the lock again (it is faster than initial locking).
Good news is that synchronized methods in Java are reentrant.
You are better not to synchronize the recursion itself.
Split your recursive methods into two:
one recursive non-synchronized method (it should be private as it is not thread-safe);
one public synchronized method without recursion per se, which calls the second method.
Try to measure such code, you should get 14 seconds, because both threads synchronize on the same lock Test.class.

The issue you see is because a static synchronized method synchronizes on the Class. So your two Threads spend an extraordinary amount of time fighting over the single lock on Test.class.
For the purposes of this learning exercise, the best way to speed it up would be to create two explicit lock objects. In Test, add
static final Object LOCK1 = new Object();
static final Object LOCK2 = new Object();
and then, in fib1() and fib2(), use a synchronized block on those two objects. e.g.
public static int fib1(int x) {
synchronized(LOCK1) {
return x <= 2 ? 1 : fib1(x-2) + fib1(x-1);
}
}
public static int fib2(int x) {
synchronized(LOCK2) {
return x <= 2 ? 1 : fib2(x-2) + fib2(x-1);
}
}
Now the first thread only needs to grab LOCK1, with no contention, and the second thread only grabs LOCK2, again, with no contention. (So long as you only have those two threads) This should run only slightly slower than the completely unsynchronized code.

Synchronized code performs faster than unsynchronized one

I came out with this stunning result which i absolutely do not know the reason for:
I have two methods which are shortened to:
private static final ConcurrentHashMap<Double,Boolean> mapBoolean =
new ConcurrentHashMap<Double, Boolean>();
private static final ConcurrentHashMap<Double,LinkedBlockingQueue<Runnable>> map
= new ConcurrentHashMap<Double, LinkedBlockingQueue<Runnable>>();
protected static <T> Future<T> execute(final Double id, Callable<T> call){
// where id is the ID number of each thread
synchronized(id)
{
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
protected static <T> Future<T> executeLoosely(final Double id, Callable<T> call){
mapBoolean.get();// then do something with the result
map.get();//the do somethign with the result
}
}
On profiling with over 500 threads, and each thread calling each of the above methods 400 times each, I found out that execute(..) performs atleast 500 times better than executeLoosely(..) which is weird because executeLoosely is not synchronized and hence more threads can process the code simultaneously.
Any reasons??

The overhead of using 500 threads on a machine which I assume doesn't have 500 cores, using tasks which takes about 100-1000x as long as a lookup on a Map to execute code which the JVM could detect doesn't do anything, is likely to produce a random outcome. ;)
Another problem you could have is that a test which faster being performed with one thread can benefit from using synchronized because it biases access to one thread. i.e. it turns your multi-threaded test back into a single threaded one which is the fastest in the first place.
You should compare the timings you get with a single thread doing a loop. If this is faster (which I believe it would be) then its not a useful multi-threaded test.
My guess is that you are running the synchronized code after the unsynchronised code. i.e. after the JVM has warmed up a little. Swap the order you perform these tests and run them many times and you will get different results.

In the non synchronized scenario :
1) wait to acquire lock on a segment of the map, lock, perform operation on the map, unlock, wait to acquire lock on a segment of the other map, lock, perform operation on the other map, unlock.
The segment level locking will be performed only in cases of concurrent write to the segment which doesn't look to be the case in your example.
In the synchronized scenario :
1) wait to lock, perform both the operations, unlock.
The time taken for context switching can have an impact? How many cores does the machine running the test have?
How are the maps structured, same sort of keys?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.