Java - cache coherence between successive parallel streams?

Java - cache coherence between successive parallel streams? - java

Consider the following piece of code (which isn't quite what it seems at first glance).
static class NumberContainer {
int value = 0;
void increment() {
value++;
}
int getValue() {
return value;
}
}
public static void main(String[] args) {
List<NumberContainer> list = new ArrayList<>();
int numElements = 100000;
for (int i = 0; i < numElements; i++) {
list.add(new NumberContainer());
}
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
}
list.forEach(container -> {
if (container.getValue() != numIterations) {
System.out.println("Problem!!!");
}
});
}
My question is: In order to be absolutely certain that "Problem!!!" won't be printed, does the "value" variable in the NumberContainer class need to be marked volatile?
Let me explain how I currently understand this.
In the first parallel stream, NumberContainer-123 (say) is incremented by ForkJoinWorker-1 (say). So ForkJoinWorker-1 will have an up-to-date cache of NumberContainer-123.value, which is 1. (Other fork-join workers, however, will have out-of-date caches of NumberContainer-123.value - they will store the value 0. At some point, these other workers' caches will be updated, but this doesn't happen straight away.)
The first parallel stream finishes, but the common fork-join pool worker threads aren't killed. The second parallel stream then starts, using the very same common fork-join pool worker threads.
Suppose, now, that in the second parallel stream, the task of incrementing NumberContainer-123 is assigned to ForkJoinWorker-2 (say). ForkJoinWorker-2 will have its own cached value of NumberContainer-123.value. If a long period of time has elapsed between the first and second increments of NumberContainer-123, then presumably ForkJoinWorker-2's cache of NumberContainer-123.value will be up-to-date, i.e. the value 1 will be stored, and everything is good. But what if the time elapsed between first and second increments if NumberContainer-123 is extremely short? Then perhaps ForkJoinWorker-2's cache of NumberContainer-123.value might be out of date, storing the value 0, causing the code to fail!
Is my description above correct? If so, can anyone please tell me what kind of time delay between the two incrementing operations is required to guarantee cache consistency between the threads? Or if my understanding is wrong, then can someone please tell me what mechanism causes the thread-local caches to be "flushed" in between the first parallel stream and the second parallel stream?

It should not need any delay. By the time you're out of ParallelStream's forEach, all the tasks have finished. That establishes a happens-before relation between the increment and the end of forEach. All the forEach calls are ordered by being called from the same thread, and the check, similarly, happens-after all the forEach calls.
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
// here, everything is "flushed", i.e. the ForkJoinTask is finished
}
Back to your question about the threads, the trick here is, the threads are irrelevant. The memory model hinges on the happens-before relation, and the fork-join task ensures happens-before relation between the call to forEach and the operation body, and between the operation body and the return from forEach (even if the returned value is Void)
See also Memory visibility in Fork-join
As #erickson mentions in comments,
If you can't establish correctness through happens-before relationships,
no amount of time is "enough." It's not a wall-clock timing issue; you
need to apply the Java memory model correctly.
Moreover, thinking about it in terms of "flushing" the memory is wrong, as there are many more things that can affect you. Flushing, for instance, is trivial: I have not checked, but can bet that there's just a memory barrier on the task completion; but you can get wrong data because the compiler decided to optimise non-volatile reads away (the variable is not volatile, and is not changed in this thread, so it's not going to change, so we can allocate it to a register, et voila), reorder the code in any way allowed by the happens-before relation, etc.
Most importantly, all those optimizations can and will change over time, so even if you went to the generated assembly (which may vary depending on the load pattern) and checked all the memory barriers, it does not guarantee that your code will work unless you can prove that your reads happen-after your writes, in which case Java Memory Model is on your side (assuming there's no bug in JVM).
As for the great pain, it's the very goal of ForkJoinTask to make the synchronization trivial, so enjoy. It was (it seems) done by marking the java.util.concurrent.ForkJoinTask#status volatile, but that's an implementation detail you should not care about or rely upon.

Related

HashMap synchronized `put` but not `get`

I have the following code snippet that I'm trying to see if it can crash/misbehave at some point. The HashMap is being called from multiple threads in which put is inside a synchronized block and get is not. Is there any issue with this code? If so, what modification I need to make to see that happens given that I only use put and get this way, and there is no putAll, clear or any operations involved.
import java.util.HashMap;
import java.util.Map;
public class Main {
Map<Integer, String> instanceMap = new HashMap<>();
public static void main(String[] args) {
System.out.println("Hello");
Main main = new Main();
Thread thread1 = new Thread("Thread 1"){
public void run(){
System.out.println("Thread 1 running");
for (int i = 0; i <= 100; i++) {
System.out.println("Thread 1 " + i + "-" + main.getVal(i));
}
}
};
thread1.start();
Thread thread2 = new Thread("Thread 2"){
public void run(){
System.out.println("Thread 2 running");
for (int i = 0; i <= 100; i++) {
System.out.println("Thread 2 " + i + "-" + main.getVal(i));
}
}
};
thread2.start();
}
private String getVal(int key) {
check(key);
return instanceMap.get(key);
}
private void check(int key) {
if (!instanceMap.containsKey(key)) {
synchronized (instanceMap) {
if (!instanceMap.containsKey(key)) {
// System.out.println(Thread.currentThread().getName());
instanceMap.put(key, "" + key);
}
}
}
}
}
What I have checked out:
Are size(), put(), remove(), get() atomic in Java synchronized HashMap?
Extending HashMap<K,V> and synchronizing only puts
Why does HashMap.get(key) needs to be synchronized when change operations are synchronized?

I somewhat modified your code:
removed System.out.println() from the "hot" loop, it is internally synchronized
increased the number of iterations
changed printing to only print when there's an unexpected value
There's much more we can do and try, but this already fails, so I stopped there. The next step would we to rewrite the whole thing to jcsctress.
And voila, as expected, sometimes this happens on my Intel MacBook Pro with Temurin 17:
Exception in thread "Thread 2" java.lang.NullPointerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "java.util.Map.get(Object)" is null
at com.gitlab.janecekpetr.playground.Playground.getVal(Playground.java:35)
at com.gitlab.janecekpetr.playground.Playground.lambda$0(Playground.java:21)
at java.base/java.lang.Thread.run(Thread.java:833)
Code:
private record Val(int index, int value) {}
private static final int MAX = 100_000;
private final Map<Integer, Integer> instanceMap = new HashMap<>();
public static void main(String... args) {
Playground main = new Playground();
Runnable runnable = () -> {
System.out.println(Thread.currentThread().getName() + " running");
Val[] vals = new Val[MAX];
for (int i = 0; i < MAX; i++) {
vals[i] = new Val(i, main.getVal(i));
}
System.out.println(Stream.of(vals).filter(val -> val.index() != val.value()).toList());
};
Thread thread1 = new Thread(runnable, "Thread 1");
thread1.start();
Thread thread2 = new Thread(runnable, "Thread 2");
thread2.start();
}
private int getVal(int key) {
check(key);
return instanceMap.get(key);
}
private void check(int key) {
if (!instanceMap.containsKey(key)) {
synchronized (instanceMap) {
if (!instanceMap.containsKey(key)) {
instanceMap.put(key, key);
}
}
}
}

To specifically explain the excellent sleuthing work in the answer by #PetrJaneček :
Every field in java has an evil coin attached to it. Anytime any thread reads the field, it flips this coin. It is not a fair coin - it is evil. It will flip heads 10,000 times in a row if that's going to ruin your day (for example, you may have code that depends on coinflips landing a certain way, or it'll fail to work. The coin is evil: You may run into the situation that just to ruin your day, during all your extensive testing, the coin flips heads, and during the first week in production it's all heads flips. And then the big new potential customer demos your app and the coin starts flipping some tails on you).
The coinflip decides which variant of the field is used - because every thread may or may not have a local cache of that field. When you write to a field from any thread? Coin is flipped, on tails, the local cache is updated and nothing more happens. Read from any thread? Coin is flipped. On tails, the local cache is used.
That's not really what happens of course (your JVM does not actually have evil coins nor is it out to get you), but the JMM (Java Memory Model), along with the realities of modern hardware, means that this abstraction works very well: It will reliably lead to the right answer when writing concurrent code, namely, that any field that is touched by more than one thread must have guards around it, or must never change at all during the entire duration of the multi-thread access 'session'.
You can force the JVM to flip the coin the way you want, by establishing so-called Happens Before relationships. This is explicit terminology used by the JMM. If 2 lines of code have a Happens-Before relationship (one is defined as 'happening before' the other, as per the JMM's list of HB relationship establishing actions), then it is not possible (short of a bug in the JVM itself) to observe any side effect of the HA line whilst not also observing all side effects of the HB line. (That is to say: the 'happens before' line happens before the 'happens after' line as far as your code could ever tell, though it's a bit of schrodiner's cat situation. If your code doesn't actually look at these files in a way that you'd ever be able to tell, then the JVM is free to not do that. And it won't, you can rely on the evil coin being evil: If the JMM takes a 'right', there will be some combination of CPU, OS, JVM release, version, and phase of the moon that combine to use it).
A small selection of common HB/HA establishing conditions:
The first line inside a synchronized(lock) block is HA relative to the hitting of that block in any other thread.
Exiting a synchronized(lock) block is HB relative to any other thread entering any synchronized(lock) block, assuming the two locks are the same reference.
thread.start() is HB relative to the first line that thread will run.
The 'natural' HB/HA: line X is HB relative to line Y if X and Y are run by the same thread and X is 'before it' in your code. You can't write x = 5; y = x; and have y be set by a version of x that did not witness the x = 5 happening (of course, if another thread is also modifying x, all bets are off unless you have HB/HA with whatever line is doing that).
writes and reads to volatile establish HB/HA but you usually can't get any guarantees about which direction.
This explains the way your code may fail: The get() call establishes absolutely no HB/HA relationship with the other thread that is calling put(), and therefore the get() call may or may not use locally cached variants of the various fields that HashMap uses internally, depending on the evil coin (which is of course hitting some fields; it'll be private fields in the HashMap implementation someplace, so you don't know which ones, but HashMap obviously has long-lived state, which implies fields are involved).
So why haven't you actually managed to 'see' your code asplode like the JMM says it will? Because the coin is EVIL. You cannot rely on this line of reasoning: "I wrote some code that should fail if the synchronizations I need aren't happening the way I want. I ran it a whole bunch of times, and it never failed, therefore, apparently this code is concurrency-safe and I can use this in my production code". That is simply not ever actually reliable. That's why you need to be thinking: Evil! That coin is out to get me! Because if you don't, you may be tempted to write test code like this.
You should be scared of writing code where more than one thread interacts with the same field. You should be bending over backwards to avoid it. Use message queues. Do your chat between threads by using databases, which have much nicer primitives for this stuff (transactions and isolation levels). Rewrite the code so that it takes a bunch of params up front and then runs without interacting with other threads via fields at all, until it is all done, and it then returns a result (and then use e.g. fork/join framework to make it all work). Make your webserver performant and using all the cores simply by relying on the fact that every incoming request will be its own thread, so the only thing that needs to happen for you to use all the cores is for that many folks to hit your server at the same time. If you don't have enough requests, great! Your server isn't busy so it doesn't matter you aren't using all the cores.
If truly you decide that interacting with the same field from multiple threads is the right answer, you need to think NASA programming mars rovers on the lines that interact with those fields, because tests simply cannot be relied upon. It's not as hard as it sounds - especially if you keep the actual interacting with the relevant fields down to a minimum and keep thinking: "Have I established HB/HA"?
In this case, I think Petr figured it out correctly: System.out.println is hella slow and does various synchronizing actions. JMM is a package deal, and commutative: Once HB/HA establishes, everything the HB line changed is observable to the code in the HA line, and add in the natural rule, which means all code that follows the HA line cannot possibly observe a universe where something any line before the HB line did is not yet visible. In other words, the System.out.println statements HB/HA with each other in some order, but you can't rely on that (System.out is not specced to synchronize. But, just about every implementation does. You should not rely on implementation details, and I can trivially write you some java code that is legal, compiles, runs, and breaks no contracts, because you can set System.out with System.setOut - that does not synchronize when interacting with System.out!). The evil coin in this case took the shape of 'accidental' synchronization via intentionally unspecced behaviour of System.out.

The following explanation is more in line with the terminology used in the JMM. Could be useful if you want a more solid understanding of this topic.
2 Actions are conflicting when they access the same address and there is at least 1 write.
2 Actions are concurrent when they are not ordered by a happens-before relation (there is no happens-before edge between them).
2 Actions are in data race when they are conflicting and concurrent.
When there are data races in your program, weird problems can happen like unexpected reordering of instructions, visibility problems, or atomicity problems.
So what makes up the happens-before relation. If a volatile read observes a particular volatile write, then there is a happens-before edge between the write and the read. This means that read will not only see that write, but everything that happened before that write. There are other sources of happens-before edges like the release of a monitor and subsequent acquire of the same monitor. And there is a happens-before edge between A, B when A occurs before B in the program order. Note: the happens-before relation is transitive, so if A happens-before B and B happens-before C, then A happens-before C.
In your case, you have a get/put operations which are conflicting since they access the same address(es) and there is at least 1 write.
The put/get action are concurrent, since is no happens-before edge between writing and reading because even though the write releases the monitor, the get doesn't acquire it.
Since the put/get operations are concurrent and conflicting, they are in data race.
The simplest way to fix this problem, is to execute the map.get in a synchronized block (using the same monitor). This will introduce the desired happens-before edge and makes the actions sequential instead of concurrent and as consequence, the data-race disappears.
A better-performing solution would be to make use of a ConcurrentHashMap. Instead of a single central lock, there are many locks and they can be acquired concurrently to improve scalability and performance. I'm not going to dig into the optimizations of the ConcurrentHashMap because would create confusion.
[Edit]
Apart from a data-race, your code also suffers from race conditions.

Java volatile loop

I am working on someone's code and came across the equivalent of this:
for (int i = 0; i < someVolatileMember; i++) {
// Removed for SO
}
Where someVolatileMember is defined like this:
private volatile int someVolatileMember;
If some thread, A, is running the for loop and another thread, B, writes to someVolatileMember then I assume the number of iterations to do would change while thread A is running the loop which is not great. I assume this would fix it:
final int someLocalVar = someVolatileMember;
for (int i = 0; i < someLocalVar; i++) {
// Removed for SO
}
My questions are:
Just to confirm that the number of iterations thread A does can be
changed while the for loop is active if thread B modifies
someVolatileMember
That the local non-volatile copy is sufficient to make sure that when
thread A runs the loop thread B cannot change the number of
iterations

Your understanding is correct:
Per the Java Language Specification, the semantics of a volatile field ensure consistency between values seen after updates done between different threads:
The Java programming language provides a second mechanism, volatile fields, that is more convenient than locking for some purposes.
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (§17.4).
Note that even without the volatile modifier, the loop count is likely to change depending on many factors.
Once a final variable is assigned, its value is never changed so the loop count will not change.

Well first of all that field is private (unless you omitted some methods that actually might alter it)...
That loop is a bit on non-sense, the way it is written and assuming there are methods that actually might alter someVolatileMember; it is so because you might never know when if finishes, or if does at all. That might even turn out to be a much more expensive loop as having a non-volatile field, because volatile means invalidating caches and draining buffers at the CPU level much more often than usual variables.
Your solution to first read a volatile and use that is actually a very common pattern; it's also given birth to a very common anti-pattern too : "check then act"... You read it into a local variable because if it later changes, you don't care - you are working with the freshest copy you had at the moment. So yes, your solution to copy it locally is fine.

There are also performance implications, since the value of volatile is never fetched from the most local cache but additional steps are being taken by the CPU to ensure that modifications are propagated (it could be cache coherence protocols, deferring reads to L3 cache, or reading from RAM). There are also implications to other variables in scope where volatile variable is used (these get synced with main memory too, however i am not demonstrating it here).
Regarding performance, following code:
private static volatile int limit = 1_000_000_000;
public static void main(String[] args) {
long start = System.nanoTime();
for (int i = 0; i < limit; i++ ) {
limit--; //modifying and reading, otherwise compiler will optimise volatile out
}
System.out.println(limit + " took " + (System.nanoTime() - start) / 1_000_000 + "ms");
}
... prints 500000000 took 4384ms
Removing volatile keyword from above will result in output 500000000 took 275ms.

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).
Use case:
A counter variable which
is updated by multiple threads concurrently
has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.
So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).
However I'm not sure I correctly understand the way LongAdder works.
The JavaDoc says:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
and for sum()
Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.
What is meant by fine-grained synchronization control ...
From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:
private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)
If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".
And then with my write/read^n pattern I probably can't do better then AtomicLong?

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.
In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:
the incrementAndGet in LongAdder is not atomic
... but I don't see incrementAndGet at all in LongAdder. Where are you looking?
Your idea of:
usage in a thread will consist of a w rite with an immediate read afterwards
Requirement is that each read
from the counter (immediately after the writing) gets a unique
incremented value. It is not required that each retrieved counter
value is increasing in the same order as the different
threads(writers) increment the value.
Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.
So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().
Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.
For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.
LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.
Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.
Something Useful
OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.
Re-purpose the LongAdder Idea
You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.
Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.
Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.
Thread-local Counters
A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.
The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.
Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).
A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.
The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.
There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).
1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).
2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.
3 Where N is lg2(number of counters).

There's a subtle difference between the two implementations.
An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.
A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.
LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).
public static void main (String[] args) throws Exception
{
LongAdder adder = new LongAdder();
ExecutorService executor = Executors.newFixedThreadPool(10);
Map<Long, Integer> count = new ConcurrentHashMap<>();
for (int i = 0; i < 10; i++)
{
executor.execute(() -> {
for (int j = 0; j < 1000000; j++)
{
adder.add(1);
count.merge(adder.longValue(), 1, Integer::sum);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.HOURS);
count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}

The reorder explanation of modified String.hashCode()

Refer to this blog and this topic.
It seems the code will be reorder even in single thread ?
public int hashCode() {
if (hash == 0) { // (1)
int off = offset;
char val[] = value;
int len = count;
int h = 0;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return hash; // (2)
}
But its really confusing to me, why (2) could return 0 and (1) could be non-zero ?
If i use the code in single thread, this will even doen't work, how could it happens ?

The first point of java memory model is:
Each action in a thread happens before every action in that thread
that comes later in the program's order.
That's why reordering in single thread is impossible. As long as code is not synchronized such guaranties are not provided for multiple threads.
Have a look at String hashCode implementation. It first loads hash to a local variable and only then performs check and return. That's how such reorderings are prevented. But this does not save us from multiple hashCode calculations.

First question:
Will reordering of instructions happen in single threaded execution?
Answer:
Reordering of instructions is a compiler optimization. The order of instructions in one thread will be the same no matter how many threads involved. Or: Yes in single threaded, too.
Second question:
Why could this lead to a problem in multi-threading but not with one thread?
Answer:
The rules for this reordering are designed to gurantee that there are no strange effects in single threaded or correctly synchronized code. That means: If we write code that's neither single threaded nor correctly synchronized there might be strange effects and we have to understand the rules and take care to avoid those effects.
So again as the auhor of the orginal blog said: Don't try if you're not really sure to understand those rules. And every compiler will be tested not to break String.hashCode() but compliers won't be tested with your code.
Edit:
Third question:
And again what is really happening?
Answer:
As we look at the code it will deal fine with not seeing changes of another thread.So the first thing we have to understand is: A method doesn't return a variable nor a constanst nor a literal. No a method return what's on top of the stack when the programm counter is reset. This has to be initialized at some point in time and it can be overwritten later on. This means it can be initialized first with the content of hash (0 now) then another thread finishes calculation and set hash to something and then the check hash == 0 happens. In turn the return value is not overwritten anymore and 0 is returned.
So the point is: The return value can change independently of the returned variable as it is not the same. Modern programming language make it look the same to make our lives easier. But this abstraction as wholes when you don't adhere to the rules.

Measuring time interval and out of order execution

I've been reading about Java memory model and I'm aware that it's possible for compiler to reorganize statements to optimize code.
Suppose I had a the following code:
long tick = System.nanoTime();
function_or_block_whose_time_i_intend_to_measure();
long tock = System.nanoTime();
would the compiler ever reorganize the code in a way that what I intend to measure is not executed between tick and tock? For example,
long tick = System.nanoTime();
long tock = System.nanoTime();
function_or_block_whose_time_i_intend_to_measure();
If so, what's the right way to preserve execution order?
EDIT:
Example illustrating out-of-order execution with nanoTime :
public class Foo {
public static void main(String[] args) {
while (true) {
long x = 0;
long tick = System.nanoTime();
for (int i = 0; i < 10000; i++) { // This for block takes ~15sec on my machine
for (int j = 0; j < 600000; j++) {
x = x + x * x;
}
}
long tock = System.nanoTime();
System.out.println("time=" + (tock - tick));
x = 0;
}
}
}
Output of above code:
time=3185600
time=16176066510
time=16072426522
time=16297989268
time=16063363358
time=16101897865
time=16133391254
time=16170513289
time=16249963612
time=16263027561
time=16239506975
In the above example, the time measured in first iteration is significantly lower than the measured time in the subsequent runs. I thought this was due to out of order execution. What am I doing wrong with the first iteration?

would the compiler ever reorganize the code in a way that what I intend to measure is not executed between tick and tock?
Nope. That would never happen. If that compiler optimization ever messed up, it would be a very serious bug. Quoting a statement from wiki.
The runtime (which, in this case, usually refers to the dynamic compiler, the processor and the memory subsystem) is free to introduce any useful execution optimizations as long as the result of the thread in isolation is guaranteed to be exactly the same as it would have been had all the statements been executed in the order the statements occurred in the program (also called program order).
So the optimization may be done as long as the result is the same as when executed in program order. In the case that you cited I would assume the optimization is local and that there are no other threads that would be interested in this data. These optimizations are done to reduce the number of trips made to main memory which can be costly. You will only run into trouble with these optimizations when multiple threads are involved and they need to know each other's state.
Now if 2 threads need to see each other's state consistently, they can use volatile variables or a memory barrier (synchronized) to force serialization of writes / reads to main memory. Infoq ran a nice article on this that might interest you.

Java Memory Model (JMM) defines a partial ordering called happens-before on all actions with the program. There are seven rules defined to ensure happens-before ordering. One of them is called Program order rule:
Program order rule. Each action in a thread happens-before every action in that thread that comes later in the program order.
According to this rule, your code will not be re-ordered by the compiler.
The book Java Concurrency in Practice gives an excellent explanation on this topic.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.