Could the JIT collapse two volatile reads as one in certain expressions?

Could the JIT collapse two volatile reads as one in certain expressions? - java

Suppose we have a volatile int a. One thread does
while (true) {
a = 1;
a = 0;
}
and another thread does
while (true) {
System.out.println(a+a);
}
Now, would it be illegal for a JIT compiler to emit assembly corresponding to 2*a instead of a+a?
On one hand the very purpose of a volatile read is that it should always be fresh from memory.
On the other hand, there's no synchronization point between the two reads, so I can't see that it would be illegal to treat a+a atomically, in which case I don't see how an optimization such as 2*a would break the spec.
References to JLS would be appreciated.

Short answer:
Yes, this optimization is allowed. Collapsing two sequential read operations produes the observable behavior of the sequence being atomic, but does not appear as a reordering of operations. Any sequence of actions performed on a single thread of execution can be executed as an atomic unit. In general, it is difficult to ensure a sequence of operations executes atomically, and it rarely results in a performance gain because most execution environments introduce overhead to execute items atomically.
In the example given by the original question, the sequence of operations in question is the following:
read(a)
read(a)
Performing these operations atomically guarantees that the value read on the first line is equal to the value read on the second line. Furthermore, it means the value read on the second line is the value contained in a at the time the first read was executed (and vice versa, because atomic both read operations occurred at the same time according to the observable execution state of the program). The optimization in question, which is reusing the value of the first read for the second read, is equivalent to the compiler and/or JIT executing the sequence atomically, and is thus valid.
Original longer answer:
The Java Memory Model describes operations using a happens-before partial ordering. In order to express the restriction that the first read r1 and second read r2 of a cannot be collapsed, you need to show that some operation is semantically required to appear between them.
The operations on the thread with r1 and r2 is the following:
--> r(a) --> r(a) --> add -->
To express the requirement that something (say y) lie between r1 and r2, you need to require that r1 happens-before y and y happens-before r2. As it happens, there is no rule where a read operation appears on the left side of a happens-before relationship. The closest you could get is saying y happens-before r2, but the partial order would allow y to also occur before r1, thus collapsing the read operations.
If no scenario exists which requires an operation to fall between r1 and r2, then you can declare that no operation ever appears between r1 and r2 and not violate the required semantics of the language. Using a single read operation would be equivalent to this claim.
Edit My answer is getting voted down, so I'm going to go into additional details.
Here are some related questions:
Is the Java compiler or JVM required to collapse these read operations?
No. The expressions a and a used in the add expression are not constant expressions, so there is no requirement that they be collapsed.
Does the JVM collapse these read operations?
To this, I'm not sure of the answer. By compiling a program and using javap -c, it's easy to see that the Java compiler does not collapse these read operations. Unfortunately it's not as easy to prove the JVM does not collapse the operations (or even tougher, the processor itself).
Should the JVM collapse these read operations?
Probably not. Each optimization takes time to execute, so there is a balance between the time it takes to analyze the code and the benefit you expect to gain. Some optimizations, such as array bounds check elimination or checking for null references, have proven to have extensive benefits for real-world applications. The only case where this particular optimization has the possibility of improving performance is cases where two identical read operations appear sequentially.
Furthermore, as shown by the response to this answer along with the other answers, this particular change would result in an unexpected behavior change for certain applications which users may not desire.
Edit 2: Regarding Rafael's description of a claim that two read operations that cannot be reordered. This statement is designed to highlight the fact that caching the read operation of a in the following sequence could produce an incorrect result:
a1 = read(a)
b1 = read(b)
a2 = read(a)
result = op(a1, b1, a2)
Suppose initially a and b have their default value 0. Then you execute just the first read(a).
Now suppose another thread executes the following sequence:
a = 1
b = 1
Finally, suppose the first thread executes the line read(b). If you were to cache the originally read value of a, you would end up with the following call:
op(0, 1, 0)
This is not correct. Since the updated value of a was stored before writing to b, there is no way to read the value b1 = 1 and then read the value a2 = 0. Without caching, the correct sequence of events leads to the following call.
op(0, 1, 1)
However, if you were to ask the question "Is there any way to allow the read of a to be cached?", the answer is yes. If you can execute all three read operations in the first thread sequence as an atomic unit, then caching the value is allowed. While synchronizing across multiple variables is difficult and rarely provides an opportunistic optimization advantage, it is certainly conceivable to encounter an exception. For example, suppose a and b are each 4 bytes, and they appear sequentially in memory with a aligned on an 8-byte boundary. A 64-bit process could implement the sequence read(a) read(b) as an atomic 64-bit load operation, which would allow the value of a to be cached (effectively treating all three read operations as an atomic operation instead of just the first two).

In my original answer, I argued against the legality of the suggested optimization. I backed this mainly from information of the JSR-133 cookbook where it states that a volatile read must not be reordered with another volatile read and where it further states that a cached read is to be treated as a reordering. The latter statement is however formulated with some ambiguouity which is why I went through the formal definition of the JMM where I did not find such indication. Therefore, I would now argue that the optimization is allowed. However, the JMM is quite complex and the discussion on this page indicates that this corner case might be decided differently by someone with a more thorough understanding of the formalism.
Denoting thread 1 to execute
while (true) {
System.out.println(a // r_1
+ a); // r_2
}
and thread 2 to execute:
while (true) {
a = 0; // w_1
a = 1; // w_2
}
The two reads r_i and two writes w_i of a are synchronization actions as a is volatile (JSR 17.4.2). They are external actions as variable a is used in several threads. These actions are contained in the set of all actions A. There exists a total order of all synchronization actions, the synchronization order which is consistent with program order for thread 1 and thread 2 (JSR 17.4.4). From the definition of the synchronizes-with partial order, there is no edge defined for this order in the above code. As a consequence, the happens-before order only reflects the intra-thread semantics of each thread (JSR 17.4.5).
With this, we define W as a write-seen function where W(r_i) = w_2 and a value-written function V(w_i) = w_2 (JLS 17.4.6). I took some freedom and eliminated w_1 as it makes this outline of a formal proof even simpler. The question is of this proposed execution E is well-formed (JLS 17.5.7). The proposed execution E obeys intra-thread semantics, is happens-before consistent, obeys the synchronized-with order and each read observes a consistent write. Checking the causality requirements is trivial (JSR 17.4.8). I do neither see why the rules for non-terminating executions would be relevant as the loop covers the entire discussed code (JLS 17.4.9) and we do not need to distinguish observable actions.
For all this, I cannot find any indication of why this optimization would be forbidden. Nevertheless, it is not applied for volatile reads by the HotSpot VM as one can observe using -XX:+PrintAssembly. I assume that the performance benefits are however minor and this pattern is not normally observed.
Remark: After watching the Java memory model pragmatics (multiple times), I am pretty sure, this reasoning is correct.

On one hand the very purpose of a volatile read is that it should always be fresh from memory.
That is not how the Java Language Specification defines volatile. The JLS simply says:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order).
Therefore, a write to a volatile variable happens-before (and is visible to) any subsequent reads of that same variable.
This constraint is trivially satisfied for a read that is not subsequent. That is, volatile only ensures visibility of a write if the read is known to occur after the write.
This is not the case in your program. For every well formed execution that observes a to be 1, I can construct another well formed execution where a is observed to be 0, simply be moving the read after the write. This is possible because the happens-before relation looks as follows:
write 1 --> read 1 write 1 --> read 1
| | | |
| v v |
v --> read 1 write 0 v
write 0 | vs. | --> read 0
| | | |
v v v v
write 1 --> read 1 write 1 --> read 1
That is, all the JMM guarantees for your program is that a+a will yield 0, 1 or 2. That is satisfied if a+a always yields 0. Just as the operating system is permitted to execute this program on a single core, and always interrupt thread 1 before the same instruction of the loop, the JVM is permitted to reuse the value - after all, the observable behavior remains the same.
In general, moving the read across the write violates happens-before consistency, because some other synchronization action is "in the way". In the absence of such intermediary synchronization actions, a volatile read can be satisfied from a cache.

Modified the OP Problem a little
volatile int a
//thread 1
while (true) {
a = some_oddNumber;
a = some_evenNumber;
}
// Thread 2
while (true) {
if(isOdd(a+a)) {
break;
}
}
If the above code have been executed Sequentially, then there exist a valid Sequential Consistent Execution which will break the thread2 while loop.
Whereas if compiler optimizes a+a to 2a then thread2 while loop will never exist.
So the above optimization will prohibit one particular execution if it had been Sequentially Executed Code.
Main Question, is this optimization a Problem ?
Q. Is the Transformed code Sequentially Consistent.
Ans. A program is correctly synchronized if, when it is executed in a sequentially consistent manner, there are no data races. Refer Example 17.4.8-1 from JLS chapter 17
Sequential consistency: the result of any execution is the same as
if the read and write operations by all processes were executed in
some sequential order and the operations of each individual
process appear in this sequence in the order specified by its
program [Lamport, 1979].
Also see http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4.3
Sequential Consistency is a strong guarantee. The Execution Path where compiler optimizes a+a as 2a is also a valid Sequentially Consistent Execution.
So the Answer is Yes.
Q. Is the code violates happens before guarantees.
Ans. Sequential Consistency implies that happens before guarantee is valid here .
So the Answer is Yes. JLS ref
So i don't think optimization is invalid legally at least in the OP case.
The case where the Thread 2 while loops stucks into an infinte is also quite possible without compiler transformation.

As laid out in other answers there are two reads and two writes. Imagine the following execution (T1 and T2 denote two threads), using annotations that match the JLS statement below:
T1: a = 0 //W(r)
T2: read temp1 = a //r_initial
T1: a = 1 //w
T2: read temp2 = a //r
T2: print temp1+temp2
In a concurrrent environment this is definitely a possible thread interleaving. Your question is then: would the JVM be allowed to make r observe W(r) and read 0 instead of 1?
JLS #17.4.5 states:
A set of actions A is happens-before consistent if for all reads r in A, where W(r) is the write action seen by r, it is not the case that either hb(r, W(r)) or that there exists a write w in A such that w.v = r.v and hb(W(r), w) and hb(w, r).
The optimisation you propose (temp = a; print (2 * temp);) would violate that requirement. So your optimisation can only work if there is no intervening write between r_initial and r, which can't be guaranteed in a typical multi threaded framework.
As a side comment, note however that there is no guarantee as to how long it will take for the writes to become visible from the reading thread. See for example: Detailed semantics of volatile regarding timeliness of visibility.

Related

Is Java Memory Model Sequential Consistency different from Leslie Lamport Definition?

JLS-17.4.3 defines the program order per thread where any reordering within the program for the thread would preserve intra-thread semantics. And then it defines sequential consistency in terms of the program-order. It says that the sequential consistency is a total order of all actions consistent with the program order.
Now I have this question where the program order is defined per thread and sequential consistency is defined across all threads. Doesn't this violates the sequential consistency defined by Leslie Lapmort ? -
the result of any execution is the same as
if the operations of all the processors were
executed in some sequential order, and the operations
of each individual processor appear
in this sequence in the order specified by its
program.
For an example what if the compiler reorders store and loads looking at the code for a particular thread (The program order of the thread)
Edited : Following section pertains to this session from from youtube.
Java Memory Model Pragmatics
- 48th min
First sample shows two executions by two threads. The order of each execution preserves intra-thread semantics. The next sample shows the same set of executions but the execution on the left side has reordered it's actions for the Thread-1.
Now if we just think of the program order of Thread-1, the reordering is legal. But what the presenter says is that it violates the sequential consistency. But when I read the JLS, I get the impression that the reordered execution is valid and preserves sequential consistency because of having two legal program orders. Am I wrong on this and if so could you please explain what's wrong with that reasoning ?
int a=0, int b=0;
Thread - 1 Thread - 2
---------- -----------
r1 = a; b=2;
r2 = b; a=1;
After reordering
int a=0, int b=0;
Thread - 1 Thread - 2
---------- ----------
r2 = b;
b=2;
a=1;
r1 = a;

I don't think the definitions are inconsistent. I think they are (just) stated in different ways.
But this is moot for the Java Memory Model because of the following caveat at the end of JLS 7.4.3
"If we were to use sequential consistency as our memory model, many of the compiler and processor optimizations that we have discussed would be illegal. For example, in the trace in Table 17.3, as soon as the write of 3 to p.x occurred, subsequent reads of that location would be required to see that value."
In other words, the JMM does not use Sequential Consistency as its basis.
Regarding that example in the video. What he is saying is the following (my comments in italics):
SC is easier for the programmer to understand. His opinion, but probably true.
The example violates SC. True, but the JMM doesn't guarantee SC anyway. Indeed, the JLS itself has an example of a "surprising result" that is due to the JMM not guaranteeing SC!
Someone should submit a JEP for SC in the JMM. Debatable whether they should, but they certainly could.
Actually analyzing potential optimizations to see if they violate SC is hard. Which is maybe a good reason for the JMM to not guarantee SC. If fewer optimizations are sound with SC versus the existing JMM, then SC is liable to make JIT compiled code slower in some cases.
AFAIK, he is saying nothing that is contentious from a technical perspective.

What does "subsequent read" mean in the context of volatile variables?

Java memory visibility documentation says that:
A write to a volatile field happens-before every subsequent read of that same field.
I'm confused what does subsequent means in context of multithreading. Does this sentence implies some global clock for all processors and cores. So for example I assign value to variable in cycle c1 in some thread and then second thread is able to see this value in subsequent cycle c1 + 1?

It sounds to me like it's saying that it provides lockless acquire/release memory-ordering semantics between threads. See Jeff Preshing's article explaining the concept (mostly for C++, but the main point of the article is language neutral, about the concept of lock-free acquire/release synchronization.)
In fact Java volatile provides sequential consistency, not just acq/rel. There's no actual locking, though. See Jeff Preshing's article for an explanation of why the naming matches what you'd do with a lock.)
If a reader sees the value you wrote, then it knows that everything in the producer thread before that write has also already happened.
This ordering guarantee is only useful in combination with other guarantees about ordering within a single thread.
e.g.
int data[100];
volatile bool data_ready = false;
Producer:
data[0..99] = stuff;
// release store keeps previous ops above this line
data_ready = true;
Consumer:
while(!data_ready){} // spin until we see the write
// acquire-load keeps later ops below this line
int tmp = data[99]; // gets the value from the producer
If data_ready was not volatile, reading it wouldn't establish a happens-before relationship between two threads.
You don't have to have a spinloop, you could be reading a sequence number, or an array index from a volatile int, and then reading data[i].
I don't know Java well. I think volatile actually gives you sequential-consistency, not just release/acquire. A sequential-release store isn't allowed to reorder with later loads, so on typical hardware it needs an expensive memory barrier to make sure the local core's store buffer is flushed before any later loads are allowed to execute.
Volatile Vs Atomic explains more about the ordering volatile gives you.
Java volatile is just an ordering keyword; it's not equivalent to C11 _Atomic or C++11 std::atomic<T> which also give you atomic RMW operations. In Java, volatile_var++ is not an atomic increment, it a separate load and store, like volatile_var = volatile_var + 1. In Java, you need a class like AtomicInteger to get an atomic RMW.
And note that C/C++ volatile doesn't imply atomicity or ordering at all; it only tells the compiler to assume that the value can be modified asynchronously. This is only a small part of what you need to write lockless for anything except the simplest cases.

It means that once a certain Thread writes to a volatile field, all other Thread(s) will observe (on the next read) that written value; but this does not protect you against races though.
Threads have their caches, and those caches will be invalidated and updated with that newly written value via cache coherency protocol.
EDIT
Subsequent means whenever that happens after the write itself. Since you don't know the exact cycle/timing when that will happen, you usually say when some other thread observes the write, it will observer all the actions done before that write; thus a volatile establishes the happens-before guarantees.
Sort of like in an example:
// Actions done in Thread A
int a = 2;
volatile int b = 3;
// Actions done in Thread B
if(b == 3) { // observer the volatile write
// Thread B is guaranteed to see a = 2 here
}
You could also loop (spin wait) until you see 3 for example.

Peter's answer gives the rationale behind the design of the Java memory model.
In this answer I'm attempting to give an explanation using only the concepts defined in the JLS.
In Java every thread is composed by a set of actions.
Some of these actions have the potential to be observable by other threads (e.g. writing a shared variable), these
are called synchronization actions.
The order in which the actions of a thread are written in the source code is called the program order.
An order defines what is before and what is after (or better, not before).
Within a thread, each action has a happens-before relationship (denoted by <) with the next (in program order) action.
This relationship is important, yet hard to understand, because it's very fundamental: it guarantees that if A < B then
the "effects" of A are visible to B.
This is indeed what we expect when writing the code of a function.
Consider
Thread 1 Thread 2
A0 A'0
A1 A'1
A2 A'2
A3 A'3
Then by the program order we know A0 < A1 < A2 < A3 and that A'0 < A'1 < A'2 < A'3.
We don't know how to order all the actions.
It could be A0 < A'0 < A'1 < A'2 < A1 < A2 < A3 < A'3 or the sequence with the primes swapped.
However, every such sequence must have that the single actions of each thread are ordered according to the thread's program order.
The two program orders are not sufficient to order every action, they are partial orders, in opposition of the
total order we are looking for.
The total order that put the actions in a row according to a measurable time (like a clock) they happened is called the execution order.
It is the order in which the actions actually happened (it is only requested that the actions appear to be happened in
this order, but that's just an optimization detail).
Up until now, the actions are not ordered inter-thread (between two different threads).
The synchronization actions serve this purpose.
Each synchronization action synchronizes-with at least another synchronization action (they usually comes in pairs, like
a write and a read of a volatile variable, a lock and the unlock of a mutex).
The synchronize-with relationship is the happens-before between thread (the former implies the latter), it is exposed as
a different concept because 1) it slightly is 2) happens-before are enforced naturally by the hardware while synchronize-with
may require software intervention.
happens-before is derived from the program order, synchronize-with from the synchronization order (denoted by <<).
The synchronization order is defined in terms of two properties: 1) it is a total order 2) it is consistent with each thread's
program order.
Let's add some synchronization action to our threads:
Thread 1 Thread 2
A0 A'0
S1 A'1
A1 S'1
A2 S'2
S2 A'3
The program orders are trivial.
What is the synchronization order?
We are looking for something that by 1) includes all of S1, S2, S'1 and S'2 and by 2) must have S1 < S2 and S'1 < S'2.
Possible outcomes:
S1 < S2 < S'1 < S'2
S1 < S'1 < S'2 < S2
S'1 < S1 < S'2 < S'2
All are synchronization orders, there is not one synchronization order but many, the question of above is wrong, it
should be "What are the synchronization orders?".
If S1 and S'1 are so that S1 << S'1 than we are restricting the possible outcomes to the ones where S1 < S'2 so the
outcome S'1 < S1 < S'2 < S'2 of above is now forbidden.
If S2 << S'1 then the only possible outcome is S1 < S2 < S'1 < S'2, when there is only a single outcome I believe we have
sequential consistency (the converse is not true).
Note that if A << B these doesn't mean that there is a mechanism in the code to force an execution order where A < B.
Synchronization actions are affected by the synchronization order they do not impose any materialization of it.
Some synchronization actions (e.g. locks) impose a particular execution order (and thereby a synchronization order) but some don't (e.g. reads/writes of volatiles).
It is the execution order that create the synchronization order, this is completely orthogonal to the synchronize-with relationship.
Long story short, the "subsequent" adjective refers to any synchronization order, that is any valid (according to each thread
program order) order that encompasses all the synchronization actions.
The JLS then continues defining when a data race happens (when two conflicting accesses are not ordered by happens-before)
and what it means to be happens-before consistent.
Those are out of scope.

I'm confused what does subsequent means in context of multithreading. Does this sentence implies some global clock for all processors and cores...?
Subsequent means (according to the dictionary) coming after in time. There certainly is a global clock across all CPUs in a computer (think X Ghz) and the document is trying to say that if thread-1 did something at clock tick 1 then thread-2 does something on another CPU at clock tick 2, it's actions are considered subsequent.
A write to a volatile field happens-before every subsequent read of that same field.
The key phrase that could be added to this sentence to make it more clear is "in another thread". It might make more sense to understand it as:
A write to a volatile field happens-before every subsequent read of that same field in another thread.
What this is saying that if a read of a volatile field happens in Thread-2 after (in time) the write in Thread-1, then Thread-2 will be guaranteed to see the updated value. Further up in the documentation you point to is the section (emphasis mine):
... The results of a write by one thread are guaranteed to be visible to a read by another thread only if the write operation happens-before the read operation. The synchronized and volatile constructs, as well as the Thread.start() and Thread.join() methods, can form happens-before relationships. In particular.
Notice the highlighted phrase. The Java compiler is free to reorder instructions in any one thread's execution for optimization purposes as long as the reordering doesn't violate the definition of the language – this is called execution order and is critically different than program order.
Let's look at the following example with variables a and b that are non-volatile ints initialized to 0 with no synchronized clauses. What is shown is program order and the time in which the threads are encountering the lines of code.
Time Thread-1 Thread-2
1 a = 1;
2 b = 2;
3 x = a;
4 y = b;
5 c = a + b; z = x + y;
If Thread-1 adds a + b at Time 5, it is guaranteed to be 3. However, if Thread-2 adds x + y at Time 5, it might get 0, 1, 2, or 3 depends on race conditions. Why? Because the compiler might have reordered the instructions in Thread-1 to set a after b because of efficiency reasons. Also, Thread-1 may not have appropriately published the values of a and b so that Thread-2 might get out of date values. Even if Thread-1 gets context-switched out or crosses a write memory barrier and a and b are published, Thread-2 needs to cross a read barrier to update any cached values of a and b.
If a and b were marked as volatile then the write to a must happen-before (in terms of visibility guarantees) the subsequent read of a on line 3 and the write to b must happen-before the subsequent read of b on line 4. Both threads would get 3.
We use volatile and synchronized keywords in java to ensure happens-before guarantees. A write memory barrier is crossed when assigning a volatile or exiting a synchronized block and a read barrier is crossed when reading a volatile or entering a synchronized block. The Java compiler cannot reorder write instructions past these memory barriers so the order of updates is assured. These keywords control instruction reordering and insure proper memory synchronization.
NOTE: volatile is unnecessary in a single-threaded application because program order assures the reads and writes will be consistent. A single-threaded application might see any value of (non-volatile) a and b at times 3 and 4 but it always sees 3 at Time 5 because of language guarantees. So although use of volatile changes the reordering behavior in a single-threaded application, it is only required when you share data between threads.

This is more a definition of what will not happen rather than what will happen.
Essentially it is saying that once a write to an atomic variable has happened there cannot be any other thread that, on reading the variable, will read a stale value.
Consider the following situation.
Thread A is continuously incrementing an atomic value a.
Thread B occasionally reads A.a and exposes that value as a
non-atomic b variable.
Thread C occasionally reads both A.a and B.b.
Given that a is atomic it is possible to reason that from the point of view of C, b may occasionally be less than a but will never be greater than a.
If a was not atomic no such guarantee could be given. Under certain caching situations it would be quite possible for C to see b progress beyond a at any time.
This is a simplistic demonstration of how the Java memory model allows you to reason about what can and cannot happen in a multi-threaded environment. In real life the potential race conditions between reading and writing to data structures can be much more complex but the reasoning process is the same.

Does Java volatile read flush writes, and does volatile write update reads

I understand read-acquire(does not reorder with subsequent read/write operations after it), and write-release(does not reorder with read/write operations preceding it).
My q is:-
In case of read-acquire, do the writes preceding it get flushed?
In case of write-release, do the previous reads get updated?
Also, is read-acquire same as volatile read, and write release same as volatile write in Java?
Why this is important is that, let's take case of write-release..
y = x; // a read.. let's say x is 1 at this point
System.out.println(y);// 1 printed
//or you can also consider System.out.println(x);
write_release_barrier();
//somewhere here, some thread sets x = 2
ready = true;// this is volatile
System.out.println(y);// or maybe, println(x).. what will be printed?
At this point, is x 2 or 1?
Here, consider ready to be volatile.
I understand that all stores before volatile will first be made visible.. and then only the volatile will be made visible. Thanks.
Ref:- http://preshing.com/20120913/acquire-and-release-semantics/

No: not all writes are flushed, nor are all reads updated.
Java works on a "happens-before" basis for multithreading. Basically, if A happens-before B, and B happens-before C, then A happens-before C. So your question amounts to whether x=2 formally happens-before some action that reads x.
Happens-before edges are basically established by synchronizes-with relationships, which are defined in JLS 17.4.4. There are a few different ways to do this, but for volatiles, it's basically amounts to a write to volatile happening-before a read to that same volatile:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order).
Given that, if your thread writes ready = true, then that write alone doesn't mean anything happens-before it (as far as that write is concerned). It's actually the opposite; that write to ready happens-before things on other threads, iff they read ready.
So, if the other thread (that sets x = 2) had written to ready after it set x = 2, and this thread (that you posted above) then read ready, then it would see x = 2. That is because the write happens-before the read, and the reading thread therefore sees everything that the writing thread had done (up to and including the write). Otherwise, you have a data race, and basically all bets are off.
A couple additional notes:
If you don't have a happens-before edge, you may still see the update; it's just that you're not guaranteed to. So, don't assume that if you don't read a write to ready, then you'll still see x=1. You might see x=1, or x=2, or possibly some other write (up to and including the default value of x=0)
In your example, y is always going to be 1, because you don't re-read x after the "somewhere here" comment. For purposes of this answer, I've assumed that there's a second y=x line immediately before or after ready = true. If there's not, then y's value will be unchanged from what it was in the first println, (assuming no other thread directly changes it -- which is guaranteed if it's a local variable), because actions within a thread always appear as if they are not reordered.

The Java memory model is not specified in terms of "read-acquire" and "write-release". These terms / concepts come from other contexts, and as the article you referenced makes abundantly clear, they are often used (by different experts) to mean different things.
If you want to understand how volatiles work in Java, you need to understand the Java memory model and the Java terminology ... which is (fortunately) well-founded and precisely specified1. Trying to map the Java memory model onto "read-acquire" and "write-release" semantics is a bad idea because:
"read-acquire" and "write-release" terminology and semantics are not well specified, and
a hypothetical JMM -> "read-acquire" / "write-release" semantic mapping is only one possible implementation of the JMM. Others mappings may exist with different, and equally valid semantics.
1 - ... modulo that experts have noted flaws in some versions of the JMM. But the point is that a serious attempt has been made to provide a theoretically sound specification ... as part of the Java Language Specification.

No, reading a volatile variable will not flush preceding writes. Visible actions will ensure that preceding actions are visible, but reading a volatile variable is not visible to other threads.
No, writing to a volatile variable will not clear the cache of previously read values. It is only guaranteed to flush previous writes.
In your example, clearly y will still be 1 on the last line. Only one assignment has been made to y, and that was 1, according to the preceding output. Perhaps that was a typo, and you meant to write println(x), but even then, the value of 2 is not guaranteed to be visible.

For your 1st question, answer is that FIFO order
For your 2nd question: pls check Volatile Vs Static in java

The volatile key word and memory consistency errors

In the oracle Java documentation located here, the following is said:
Atomic actions cannot be interleaved, so they can be used without fear of thread interference. However, this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible. Using volatile variables reduces the risk of memory consistency errors, because any write to a volatile variable establishes a happens-before relationship with subsequent reads of that same variable. This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
It also says:
Reads and writes are atomic for reference variables and for most
primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long
and double variables).
I have two questions regarding these statements:
"Using volatile variables reduces the risk of memory consistency errors" - What do they mean by "reduces the risk", and how is a memory consistency error still possible when using volatile?
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads? I ask this since it seems that those variables already have atomic reads.

What do they mean by "reduces the risk"?
Atomicity is one issue addressed by the Java Memory Model. However, more important than Atomicity are the following issues:
memory architecture, e.g. impact of CPU caches on read and write operations
CPU optimizations, e.g. reordering of loads and stores
compiler optimizations, e.g. added and removed loads and stores
The following listing contains a frequently used example. The operations on x and y are atomic. Still, the program can print both lines.
int x = 0, y = 0;
// thread 1
x = 1
if (y == 0) System.out.println("foo");
// thread 2
y = 1
if (x == 0) System.out.println("bar");
However, if you declare x and y as volatile, only one of the two lines can be printed.
How is a memory consistency error still possible when using volatile?
The following example uses volatile. However, updates might still get lost.
volatile int x = 0;
// thread 1
x += 1;
// thread 2
x += 1;
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads?
Happens-before is often misunderstood. The consistency model defined by happens-before is weak and difficult to use correctly. This can be demonstrated with the following example, that is known as Independent Reads of Independent Writes (IRIW):
volatile int x = 0, y = 0;
// thread 1
x = 1;
// thread 2
y = 1;
// thread 3
if (x == 1) System.out.println(y);
// thread 4
if (y == 1) System.out.println(x);
Only with happens-before, two 0s would be valid result. However, that's apparently counter-intuitive. For that reason, Java provides a stricter consistency model, that forbids this relativity issue, and that is known as sequential consistency. You can find it in sections §17.4.3 and §17.4.5 of the Java Language Specification. The most important part is:
A program is correctly synchronized if and only if all sequentially consistent executions are free of data races. If a program is correctly synchronized, then all executions of the program will appear to be sequentially consistent (§17.4.3).
That means, volatile gives you more than happens-before. It gives you sequential consistency if used for all conflicting accesses (§17.4.3).

The usual example:
while(!condition)
sleep(10);
if condition is volatile, this behaves as expected. If it is not, the compiler is allowed to optimize this to
if(!condition)
for(;;)
sleep(10);
This is completely orthogonal to atomicity: if condition is of a hypothetical integer type that is not atomic, then the sequence
thread 1 writes upper half to 0
thread 2 reads upper half (0)
thread 2 reads lower half (0)
thread 1 writes lower half (1)
can happen while the variable is updated from a nonzero value that just happens to have a lower half of zero to a nonzero value that has an upper half of zero; in this case, thread 2 reads the variable as zero. The volatile keyword in this case makes sure that thread 2 really reads the variable instead of using its local copy, but it does not affect timing.
Third, atomicity does not protect against
thread 1 reads value (0)
thread 2 reads value (0)
thread 1 writes incremented value (1)
thread 2 writes incremented value (1)
One of the best ways to use atomic volatile variables are the read and write counters of a ring buffer:
thread 1 looks at read pointer, calculates free space
thread 1 fills free space with data
thread 1 updates write pointer (which is `volatile`, so the side effects of filling the free space are also committed before)
thread 2 looks at write pointer, calculates amount of data received
...
Here, no lock is needed to synchronize the threads, atomicity guarantees that the read and write pointers will always be accessed consistently and volatile enforces the necessary ordering.

For question 1, the risk is only reduced (and not eliminated) because volatile only applies to a single read/write operation and not more complex operations such as increment, decrement, etc.
For question 2, the effect of volatile is to make changes immediately visible to other threads. As the quoted passage states "this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible." Simply because reads are atomic does not mean that they are thread safe. So establishing a happens before relationship is almost a (necessary) side-effect of guaranteeing memory consistency across threads.

Ad 1: With a volatile variable, the variable is always checked against a master copy and all threads see a consistent state. But if you use that volatility variable in a non-atomic operation writing back the result (say a = f(a)) then you might still create a memory inconsistency. That's how I would understand the remark "reduces the risk". A volatile variable is consistent at the time of read, but you still might need to use a synchronize.
Ad 2: I don't know. But: If your definition of "happens before" includes the remark
This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
I would not dare to rely on any other property except that volatile ensures this. What else do you expect from it?!

Assume that you have a CPU with a CPU cache or CPU registers. Independent from your CPU architecture in terms of number of cores it has, volatile does NOT guarantee you a perfect inconsistency. The only way to achieve this is to use synchronized or atomic references with a performance price.
For example you have multiple threads (Thread A & Thread B) working on a shared data. Assume that Thread A wants to update the shared data and it's is started .For performance reasons, Thread A's stack was moved to CPU cache or registers. Then Thread A updated the shared data. But the problem with those places is that actually they don't flush back the updated value to the main memory immediately. This is where inconsistency's offered because up to the flash back operation, Thread B might have wanted to play with the same data, which would have taken it from the main memory - yet unupdated value.
If you use volatile all the operations will be perfomed on the main memory so you don't have a flush back latency. But, this time you may suffer from thread pipeline. In the middle of write operation (composed of number of atomic operations), Thread B may have been executed by the os to perform a read operation and that's it! Thread B will read the unupdated value again. That's why it's said it reduces the risk.
Hope you got it.

when coming to concurrency, you might want to ensure 2 things:
atomic operations: a set of operations is atomic - this is usually achieved with
"synchronized" (higher level constructs). Also with volatile for instance for read/write on long and double.
visibility: a thread B sees a modification made by a thread A. Even if an operation is atomic, like a write to an int variable, a second thread can still see a non-up-to-date value of the variable, due to processor caches. Putting a variable as volatile ensures that the second thread does see the up-to-date value of that variable. More than that, it ensures that the second thread sees an up-to-date value of ALL the variables written by the first thread before the write to the volatile variable.

Interpretation of "program order rule" in Java concurrency

Program order rule states "Each action in a thread happens-before every action in that thread that comes later in the program order"
1.I read in another thread that an action is
reads and writes to variables
locks and unlocks of monitors
starting and joining with threads
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
2.What does "program order" mean?
Explanation with an examples would be really helpful.
Additional related question
Suppose I have the following code:
long tick = System.nanoTime(); //Line1: Note the time
//Block1: some code whose time I wish to measure goes here
long tock = System.nanoTime(); //Line2: Note the time
Firstly, it's a single threaded application to keep things simple. Compiler notices that it needs to check the time twice and also notices a block of code that has no dependency with surrounding time-noting lines, so it sees a potential to reorganize the code, which could result in Block1 not being surrounded by the timing calls during actual execution (for instance, consider this order Line1->Line2->Block1). But, I as a programmer can see the dependency between Line1,2 and Block1. Line1 should immediately precede Block1, Block1 takes a finite amount of time to complete, and immediately succeeded by Line2.
So my question is: Am I measuring the block correctly?
If yes, what is preventing the compiler from rearranging the order.
If no, (which is think is correct after going through Enno's answer) what can I do to prevent it.
P.S.: I stole this code from another question I asked in SO recently.

It probably helps to explain why such rule exist in the first place.
Java is a procedural language. I.e. you tell Java how to do something for you. If Java executes your instructions not in the order you wrote, it would obviously not work. E.g. in the below example, if Java would do 2 -> 1 -> 3 then the stew would be ruined.
1. Take lid off
2. Pour salt in
3. Cook for 3 hours
So, why does the rule not simply say "Java executes what you wrote in the order you wrote"? In a nutshell, because Java is clever. Take the following example:
1. Take eggs out of the freezer
2. Take lid off
3. Take milk out of the freezer
4. Pour egg and milk in
5. Cook for 3 hours
If Java was like me, it'll just execute it in order. However Java is clever enough to understand that it's more efficient AND that the end result would be the same should it do 1 -> 3 -> 2 -> 4 -> 5 (you don't have to walk to the freezer again, and that doesn't change the recipe).
So what the rule "Each action in a thread happens-before every action in that thread that comes later in the program order" is trying to say is, "In a single thread, your program will run as if it was executed in the exact order you wrote it. We might change the ordering behind the scene but we make sure that none of that would change the output.
So far so good. Why does it not do the same across multiple threads? In multi-thread programming, Java isn't clever enough to do it automatically. It will for some operations (e.g. joining threads, starting threads, when a lock (monitor) is used etc.) but for other stuff you need to explicitly tell it to not do reordering that would change the program output (e.g. volatile marker on fields, use of locks etc.).
Note:
Quick addendum about "happens-before relationship". This is a fancy way of saying no matter what reordering Java might do, stuff A will happen before stuff B. In our weird later stew example, "Step 1 & 3 happens-before step 4 "Pour egg and milk in" ". Also for example, "Step 1 & 3 do not need a happens-before relationship because they don't depend on each other in any way"
On the additional question & response to the comment
First, let us establish what "time" means in the programming world. In programming, we have the notion of "absolute time" (what's the time in the world now?) and the notion of "relative time" (how much time has passed since x?). In an ideal world, time is time but unless we have an atomic clock built in, the absolute time would have to be corrected time to time. On the other hand, for relative time we don't want corrections as we are only interested in the differences between events.
In Java, System.currentTime() deals with absolute time and System.nanoTime() deals with relative time. This is why the Javadoc of nanoTime states, "This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time".
In practice, both currentTimeMillis and nanoTime are native calls and thus the compiler can't practically prove if a reordering won't affect the correctness, which means it will not reorder the execution.
But let us imagine we want to write a compiler implementation that actually looks into native code and reorders everything as long as it's legal. When we look at the JLS, all that it tells us is that "You can reorder anything as long as it cannot be detected". Now as the compiler writer, we have to decide if the reordering would violate the semantics. For relative time (nanoTime), it would clearly be useless (i.e. violates the semantics) if we'd reorder the execution. Now, would it violate the semantics if we'd reorder for absolute time (currentTimeMillis)? As long as we can limit the difference from the source of the world's time (let's say the system clock) to whatever we decide (like "50ms")*, I say no. For the below example:
long tick = System.currentTimeMillis();
result = compute();
long tock = System.currentTimeMillis();
print(result + ":" + tick - tock);
If the compiler can prove that compute() takes less than whatever maximum divergence from the system clock we can permit, then it would be legal to reorder this as follows:
long tick = System.currentTimeMillis();
long tock = System.currentTimeMillis();
result = compute();
print(result + ":" + tick - tock);
Since doing that won't violate the spec we defined, and thus won't violate the semantics.
You also asked why this is not included in the JLS. I think the answer would be "to keep the JLS short". But I don't know much about this realm so you might want to ask a separate question for that.
*: In actual implementations, this difference is platform dependent.

The program order rule guarantees that, within individual threads, reordering optimizations introduced by the compiler cannot produce different results from what would have happened if the program had been executed in serial fashion. It makes no guarantees about what order the thread's actions may appear to occur in to any other threads if its state is observed by those threads without synchronization.
Note that this rule speaks only to the ultimate results of the program, and not to the order of individual executions within that program. For instance, if we have a method which makes the following changes to some local variables:
x = 1;
z = z + 1;
y = 1;
The compiler remains free to reorder these operations however it sees best fit to improve performance. One way to think of this is: if you could reorder these ops in your source code and still obtain the same results, the compiler is free to do the same. (And in fact, it can go even further and completely discard operations which are shown to have no results, such as invocations of empty methods.)
With your second bullet point the monitor lock rule comes into play: "An unlock on a monitor happens-before every subsequent lock on that main monitor lock." (Java Concurrency in Practice p. 341) This means that a thread acquiring a given lock will have a consistent view of the actions which occurred in other threads before releasing that lock. However, note that this guarantee only applies when two different threads release or acquire the same lock. If Thread A does a bunch of stuff before releasing Lock X, and then Thread B acquires Lock Y, Thread B is not assured to have a consistent view of A's pre-X actions.
It is possible for reads and writes to variables to be reordered with start and join if a.) doing so doesn't break within-thread program order, and b.) the variables have not had other "happens-before" thread synchronization semantics applied to them, say by storing them in volatile fields.
A simple example:
class ThreadStarter {
Object a = null;
Object b = null;
Thread thread;
ThreadStarter(Thread threadToStart) {
this.thread = threadToStart;
}
public void aMethod() {
a = new BeforeStartObject();
b = new BeforeStartObject();
thread.start();
a = new AfterStartObject();
b = new AfterStartObject();
a.doSomeStuff();
b.doSomeStuff();
}
}
Since the fields a and b and the method aMethod() are not synchronized in any way, and the action of starting thread does not change the results of the writes to the fields (or the doing of stuff with those fields), the compiler is free to reorder thread.start() to anywhere in the method. The only thing it could not do with the order of aMethod() would be to move the order of writing one of the BeforeStartObjects to a field after writing an AfterStartObject to that field, or to move one of the doSomeStuff() invocations on a field before the AfterStartObject is written to it. (That is, assuming that such reordering would change the results of the doSomeStuff() invocation in some way.)
The critical thing to bear in mind here is that, in the absence of synchronization, the thread started in aMethod() could theoretically observe either or both of the fields a and b in any of the states which they take on during the execution of aMethod() (including null).
Additional question answer
The assignments to tick and tock cannot be reordered with respect to the code in Block1 if they are to be actually used in any measurements, for example by calculating the difference between them and printing the result as output. Such reordering would clearly break Java's within-thread as-if-serial semantics. It changes the results from what would have been obtained by executing instructions in the specified program order. If the assignments aren't used for any measurements and have no side-effects of any kind on the program result, they'll likely be optimized away as no-ops by the compiler rather than being reordered.

Before I answer the question,
reads and writes to variables
Should be
volatile reads and volatile writes (of the same field)
Program order doesn't guarantee this happens before relationship, rather the happens-before relationship guarantees program order
To your questions:
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
The answer actually depends on what action happens first and what action happens second. Take a look at the JSR 133 Cookbook for Compiler Writers. There is a Can Reorder grid that lists the allowed compiler reordering that can occur.
For instance a Volatile Store can be re-ordered above or below a Normal Store but a Volatile Store cannot be be reordered above or below a Volatile Load. This is all assuming intrathread semantics still hold.
What does "program order" mean?
This is from the JLS
Among all the inter-thread actions performed by each thread t, the
program order of t is a total order that reflects the order in which
these actions would be performed according to the intra-thread
semantics of t.
In other words, if you can change the writes and loads of a variable in such a way that it will preform exactly the same way as you wrote it then it maintains program order.
For instance
public static Object getInstance(){
if(instance == null){
instance = new Object();
}
return instance;
}
Can be reordered to
public static Object getInstance(){
Object temp = instance;
if(instance == null){
temp = instance = new Object();
}
return temp;
}

it simply mean though the thread may be multiplxed, but the internal order of the thread's action/operation/instruction would remain constant (relatively)
thread1: T1op1, T1op2, T1op3...
thread2: T2op1, T2op2, T2op3...
though the order of operation (Tn'op'M) among thread may vary, but operations T1op1, T1op2, T1op3 within a thread will always be in this order, and so as the T2op1, T2op2, T2op3
for ex:
T2op1, T1op1, T1op2, T2op2, T2op3, T1op3

Java tutorial http://docs.oracle.com/javase/tutorial/essential/concurrency/memconsist.html says that happens-before relationship is simply a guarantee that memory writes by one specific statement are visible to another specific statement. Here is an illustration
int x;
synchronized void x() {
x += 1;
}
synchronized void y() {
System.out.println(x);
}
synchronized creates a happens-before relationship, if we remove it there will be no guarantee that after thread A increments x thread B will print 1, it may print 0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.