Does Java volatile read flush writes, and does volatile write update reads

Does Java volatile read flush writes, and does volatile write update reads - java

I understand read-acquire(does not reorder with subsequent read/write operations after it), and write-release(does not reorder with read/write operations preceding it).
My q is:-
In case of read-acquire, do the writes preceding it get flushed?
In case of write-release, do the previous reads get updated?
Also, is read-acquire same as volatile read, and write release same as volatile write in Java?
Why this is important is that, let's take case of write-release..
y = x; // a read.. let's say x is 1 at this point
System.out.println(y);// 1 printed
//or you can also consider System.out.println(x);
write_release_barrier();
//somewhere here, some thread sets x = 2
ready = true;// this is volatile
System.out.println(y);// or maybe, println(x).. what will be printed?
At this point, is x 2 or 1?
Here, consider ready to be volatile.
I understand that all stores before volatile will first be made visible.. and then only the volatile will be made visible. Thanks.
Ref:- http://preshing.com/20120913/acquire-and-release-semantics/

No: not all writes are flushed, nor are all reads updated.
Java works on a "happens-before" basis for multithreading. Basically, if A happens-before B, and B happens-before C, then A happens-before C. So your question amounts to whether x=2 formally happens-before some action that reads x.
Happens-before edges are basically established by synchronizes-with relationships, which are defined in JLS 17.4.4. There are a few different ways to do this, but for volatiles, it's basically amounts to a write to volatile happening-before a read to that same volatile:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order).
Given that, if your thread writes ready = true, then that write alone doesn't mean anything happens-before it (as far as that write is concerned). It's actually the opposite; that write to ready happens-before things on other threads, iff they read ready.
So, if the other thread (that sets x = 2) had written to ready after it set x = 2, and this thread (that you posted above) then read ready, then it would see x = 2. That is because the write happens-before the read, and the reading thread therefore sees everything that the writing thread had done (up to and including the write). Otherwise, you have a data race, and basically all bets are off.
A couple additional notes:
If you don't have a happens-before edge, you may still see the update; it's just that you're not guaranteed to. So, don't assume that if you don't read a write to ready, then you'll still see x=1. You might see x=1, or x=2, or possibly some other write (up to and including the default value of x=0)
In your example, y is always going to be 1, because you don't re-read x after the "somewhere here" comment. For purposes of this answer, I've assumed that there's a second y=x line immediately before or after ready = true. If there's not, then y's value will be unchanged from what it was in the first println, (assuming no other thread directly changes it -- which is guaranteed if it's a local variable), because actions within a thread always appear as if they are not reordered.

The Java memory model is not specified in terms of "read-acquire" and "write-release". These terms / concepts come from other contexts, and as the article you referenced makes abundantly clear, they are often used (by different experts) to mean different things.
If you want to understand how volatiles work in Java, you need to understand the Java memory model and the Java terminology ... which is (fortunately) well-founded and precisely specified1. Trying to map the Java memory model onto "read-acquire" and "write-release" semantics is a bad idea because:
"read-acquire" and "write-release" terminology and semantics are not well specified, and
a hypothetical JMM -> "read-acquire" / "write-release" semantic mapping is only one possible implementation of the JMM. Others mappings may exist with different, and equally valid semantics.
1 - ... modulo that experts have noted flaws in some versions of the JMM. But the point is that a serious attempt has been made to provide a theoretically sound specification ... as part of the Java Language Specification.

No, reading a volatile variable will not flush preceding writes. Visible actions will ensure that preceding actions are visible, but reading a volatile variable is not visible to other threads.
No, writing to a volatile variable will not clear the cache of previously read values. It is only guaranteed to flush previous writes.
In your example, clearly y will still be 1 on the last line. Only one assignment has been made to y, and that was 1, according to the preceding output. Perhaps that was a typo, and you meant to write println(x), but even then, the value of 2 is not guaranteed to be visible.

For your 1st question, answer is that FIFO order
For your 2nd question: pls check Volatile Vs Static in java

Related

volatile vs not volatile

Let's consider the following piece of code in Java
int x = 0;
int who = 1
Thread #1:
(1) x++;
(2) who = 2;
Thread #2
while(who == 1);
x++;
print x; ( the value should be equal to 2 but, perhaps, it is not* )
(I don't know Java memory models- let assume that it is strong memory model- I mean: (1) and (2) will be doesn't swapped)
Java memory model guarantees that access/store to the 32 bit variables is atomic so our program is safe. But, nevertheless we should use a attribute volatile because *. The value of x may be equal to 1 because x can be kept in register when Thread#2 read it. To resolve it we should make the x variable volatile. It is clear.
But, what about that situation:
int x = 0;
mutex m; ( just any mutex)
Thread #1:
mutex.lock()
x++;
mutex.unlock()
Thread #2
mutex.lock()
x++;
print x; // the value is always 2, why**?
mutex.unlock()
The value of x is always 2 though we don't make it volatile. Do I correctly understand that locking/unlocking mutex is connected with inserting memory barriers?

I'll try to tackle this. The Java memory model is kind of involved and hard to contain in a single StackOverflow post. Please refer to Brian Goetz's Java Concurrency in Practice for the full story.
The value of x is always 2 though we don't make it volatile. Do I correctly understand that locking/unlocking mutex is connected with inserting memory barriers?
First if you want to understand the Java memory model, it's always Chapter 17 of the spec you want to read through.
That spec says:
An unlock on a monitor happens-before every subsequent lock on that monitor.
So yes, there's a memory visibility event at the unlock of your monitor. (I assume by "mutex" you mean monitor. Most of the locks and other classes in the java.utils.concurrent package also have happens-before semantics, check the documentation.)
Happens-before is what Java means when it guarantees not just that the events are ordered, but also that memory visibility is guaranteed.
We say that a read r of a variable v is allowed to observe a write w
to v if, in the happens-before partial order of the execution trace:
r is not ordered before w (i.e., it is not the case that
hb(r, w)), and
there is no intervening write w' to v (i.e. no write w' to v such
that hb(w, w') and hb(w', r)).
Informally, a read r is allowed to see the result of a write w if there
is no happens-before ordering to prevent that read.
This is all from 17.4.5. It's a little confusing to read through, but the info is all there if you do read through it.

Let's go over some things. The following statement is true: Java memory model guarantees that access/store to the 32 bit variables is atomic. However, it does not follow that the first pseudoprogram you listed is safe. Simply because two statements are ordered syntactically does not mean that the visibility of their updates are also so ordered as viewed by other threads. Thread #2 may see the update caused by who=2 before the increment in x is visible. Making x volatile would still not make the program correct. Instead, making the variable 'who' voliatile would make the program correct. That is because volatile interacts with the java memory model in specific ways.
I feel like there is some notion of 'writing back to main memory' at the core of a common sense understanding of volatile which is incorrect. Volatile does not write back the value to main memory in Java. What reading from and writing to a volatile variable does is create what's called a happens-before relationship. When thread #1 writes to a volatile variable you're creating a relationship that ensures that any other threads #2 viewing that volatile variable will also be able to 'view' all the actions thread #1 has taken before that. In your example that means making 'who' volatile. By writing the value 2 to 'who' you are creating a happens-before relationship so that when thread #2 views who=2 it will similarly see an updated version of x.
In your second example (assuming you meant to have the 'who' variable too) the mutex unlocking creates a happens-before relationship as I specified above. Since that means other threads viewing the unlock of the mutex (ie. they are able to lock it themselves) they will see the updated version of x.

Could the JIT collapse two volatile reads as one in certain expressions?

Suppose we have a volatile int a. One thread does
while (true) {
a = 1;
a = 0;
}
and another thread does
while (true) {
System.out.println(a+a);
}
Now, would it be illegal for a JIT compiler to emit assembly corresponding to 2*a instead of a+a?
On one hand the very purpose of a volatile read is that it should always be fresh from memory.
On the other hand, there's no synchronization point between the two reads, so I can't see that it would be illegal to treat a+a atomically, in which case I don't see how an optimization such as 2*a would break the spec.
References to JLS would be appreciated.

Short answer:
Yes, this optimization is allowed. Collapsing two sequential read operations produes the observable behavior of the sequence being atomic, but does not appear as a reordering of operations. Any sequence of actions performed on a single thread of execution can be executed as an atomic unit. In general, it is difficult to ensure a sequence of operations executes atomically, and it rarely results in a performance gain because most execution environments introduce overhead to execute items atomically.
In the example given by the original question, the sequence of operations in question is the following:
read(a)
read(a)
Performing these operations atomically guarantees that the value read on the first line is equal to the value read on the second line. Furthermore, it means the value read on the second line is the value contained in a at the time the first read was executed (and vice versa, because atomic both read operations occurred at the same time according to the observable execution state of the program). The optimization in question, which is reusing the value of the first read for the second read, is equivalent to the compiler and/or JIT executing the sequence atomically, and is thus valid.
Original longer answer:
The Java Memory Model describes operations using a happens-before partial ordering. In order to express the restriction that the first read r1 and second read r2 of a cannot be collapsed, you need to show that some operation is semantically required to appear between them.
The operations on the thread with r1 and r2 is the following:
--> r(a) --> r(a) --> add -->
To express the requirement that something (say y) lie between r1 and r2, you need to require that r1 happens-before y and y happens-before r2. As it happens, there is no rule where a read operation appears on the left side of a happens-before relationship. The closest you could get is saying y happens-before r2, but the partial order would allow y to also occur before r1, thus collapsing the read operations.
If no scenario exists which requires an operation to fall between r1 and r2, then you can declare that no operation ever appears between r1 and r2 and not violate the required semantics of the language. Using a single read operation would be equivalent to this claim.
Edit My answer is getting voted down, so I'm going to go into additional details.
Here are some related questions:
Is the Java compiler or JVM required to collapse these read operations?
No. The expressions a and a used in the add expression are not constant expressions, so there is no requirement that they be collapsed.
Does the JVM collapse these read operations?
To this, I'm not sure of the answer. By compiling a program and using javap -c, it's easy to see that the Java compiler does not collapse these read operations. Unfortunately it's not as easy to prove the JVM does not collapse the operations (or even tougher, the processor itself).
Should the JVM collapse these read operations?
Probably not. Each optimization takes time to execute, so there is a balance between the time it takes to analyze the code and the benefit you expect to gain. Some optimizations, such as array bounds check elimination or checking for null references, have proven to have extensive benefits for real-world applications. The only case where this particular optimization has the possibility of improving performance is cases where two identical read operations appear sequentially.
Furthermore, as shown by the response to this answer along with the other answers, this particular change would result in an unexpected behavior change for certain applications which users may not desire.
Edit 2: Regarding Rafael's description of a claim that two read operations that cannot be reordered. This statement is designed to highlight the fact that caching the read operation of a in the following sequence could produce an incorrect result:
a1 = read(a)
b1 = read(b)
a2 = read(a)
result = op(a1, b1, a2)
Suppose initially a and b have their default value 0. Then you execute just the first read(a).
Now suppose another thread executes the following sequence:
a = 1
b = 1
Finally, suppose the first thread executes the line read(b). If you were to cache the originally read value of a, you would end up with the following call:
op(0, 1, 0)
This is not correct. Since the updated value of a was stored before writing to b, there is no way to read the value b1 = 1 and then read the value a2 = 0. Without caching, the correct sequence of events leads to the following call.
op(0, 1, 1)
However, if you were to ask the question "Is there any way to allow the read of a to be cached?", the answer is yes. If you can execute all three read operations in the first thread sequence as an atomic unit, then caching the value is allowed. While synchronizing across multiple variables is difficult and rarely provides an opportunistic optimization advantage, it is certainly conceivable to encounter an exception. For example, suppose a and b are each 4 bytes, and they appear sequentially in memory with a aligned on an 8-byte boundary. A 64-bit process could implement the sequence read(a) read(b) as an atomic 64-bit load operation, which would allow the value of a to be cached (effectively treating all three read operations as an atomic operation instead of just the first two).

In my original answer, I argued against the legality of the suggested optimization. I backed this mainly from information of the JSR-133 cookbook where it states that a volatile read must not be reordered with another volatile read and where it further states that a cached read is to be treated as a reordering. The latter statement is however formulated with some ambiguouity which is why I went through the formal definition of the JMM where I did not find such indication. Therefore, I would now argue that the optimization is allowed. However, the JMM is quite complex and the discussion on this page indicates that this corner case might be decided differently by someone with a more thorough understanding of the formalism.
Denoting thread 1 to execute
while (true) {
System.out.println(a // r_1
+ a); // r_2
}
and thread 2 to execute:
while (true) {
a = 0; // w_1
a = 1; // w_2
}
The two reads r_i and two writes w_i of a are synchronization actions as a is volatile (JSR 17.4.2). They are external actions as variable a is used in several threads. These actions are contained in the set of all actions A. There exists a total order of all synchronization actions, the synchronization order which is consistent with program order for thread 1 and thread 2 (JSR 17.4.4). From the definition of the synchronizes-with partial order, there is no edge defined for this order in the above code. As a consequence, the happens-before order only reflects the intra-thread semantics of each thread (JSR 17.4.5).
With this, we define W as a write-seen function where W(r_i) = w_2 and a value-written function V(w_i) = w_2 (JLS 17.4.6). I took some freedom and eliminated w_1 as it makes this outline of a formal proof even simpler. The question is of this proposed execution E is well-formed (JLS 17.5.7). The proposed execution E obeys intra-thread semantics, is happens-before consistent, obeys the synchronized-with order and each read observes a consistent write. Checking the causality requirements is trivial (JSR 17.4.8). I do neither see why the rules for non-terminating executions would be relevant as the loop covers the entire discussed code (JLS 17.4.9) and we do not need to distinguish observable actions.
For all this, I cannot find any indication of why this optimization would be forbidden. Nevertheless, it is not applied for volatile reads by the HotSpot VM as one can observe using -XX:+PrintAssembly. I assume that the performance benefits are however minor and this pattern is not normally observed.
Remark: After watching the Java memory model pragmatics (multiple times), I am pretty sure, this reasoning is correct.

On one hand the very purpose of a volatile read is that it should always be fresh from memory.
That is not how the Java Language Specification defines volatile. The JLS simply says:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order).
Therefore, a write to a volatile variable happens-before (and is visible to) any subsequent reads of that same variable.
This constraint is trivially satisfied for a read that is not subsequent. That is, volatile only ensures visibility of a write if the read is known to occur after the write.
This is not the case in your program. For every well formed execution that observes a to be 1, I can construct another well formed execution where a is observed to be 0, simply be moving the read after the write. This is possible because the happens-before relation looks as follows:
write 1 --> read 1 write 1 --> read 1
| | | |
| v v |
v --> read 1 write 0 v
write 0 | vs. | --> read 0
| | | |
v v v v
write 1 --> read 1 write 1 --> read 1
That is, all the JMM guarantees for your program is that a+a will yield 0, 1 or 2. That is satisfied if a+a always yields 0. Just as the operating system is permitted to execute this program on a single core, and always interrupt thread 1 before the same instruction of the loop, the JVM is permitted to reuse the value - after all, the observable behavior remains the same.
In general, moving the read across the write violates happens-before consistency, because some other synchronization action is "in the way". In the absence of such intermediary synchronization actions, a volatile read can be satisfied from a cache.

Modified the OP Problem a little
volatile int a
//thread 1
while (true) {
a = some_oddNumber;
a = some_evenNumber;
}
// Thread 2
while (true) {
if(isOdd(a+a)) {
break;
}
}
If the above code have been executed Sequentially, then there exist a valid Sequential Consistent Execution which will break the thread2 while loop.
Whereas if compiler optimizes a+a to 2a then thread2 while loop will never exist.
So the above optimization will prohibit one particular execution if it had been Sequentially Executed Code.
Main Question, is this optimization a Problem ?
Q. Is the Transformed code Sequentially Consistent.
Ans. A program is correctly synchronized if, when it is executed in a sequentially consistent manner, there are no data races. Refer Example 17.4.8-1 from JLS chapter 17
Sequential consistency: the result of any execution is the same as
if the read and write operations by all processes were executed in
some sequential order and the operations of each individual
process appear in this sequence in the order specified by its
program [Lamport, 1979].
Also see http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4.3
Sequential Consistency is a strong guarantee. The Execution Path where compiler optimizes a+a as 2a is also a valid Sequentially Consistent Execution.
So the Answer is Yes.
Q. Is the code violates happens before guarantees.
Ans. Sequential Consistency implies that happens before guarantee is valid here .
So the Answer is Yes. JLS ref
So i don't think optimization is invalid legally at least in the OP case.
The case where the Thread 2 while loops stucks into an infinte is also quite possible without compiler transformation.

As laid out in other answers there are two reads and two writes. Imagine the following execution (T1 and T2 denote two threads), using annotations that match the JLS statement below:
T1: a = 0 //W(r)
T2: read temp1 = a //r_initial
T1: a = 1 //w
T2: read temp2 = a //r
T2: print temp1+temp2
In a concurrrent environment this is definitely a possible thread interleaving. Your question is then: would the JVM be allowed to make r observe W(r) and read 0 instead of 1?
JLS #17.4.5 states:
A set of actions A is happens-before consistent if for all reads r in A, where W(r) is the write action seen by r, it is not the case that either hb(r, W(r)) or that there exists a write w in A such that w.v = r.v and hb(W(r), w) and hb(w, r).
The optimisation you propose (temp = a; print (2 * temp);) would violate that requirement. So your optimisation can only work if there is no intervening write between r_initial and r, which can't be guaranteed in a typical multi threaded framework.
As a side comment, note however that there is no guarantee as to how long it will take for the writes to become visible from the reading thread. See for example: Detailed semantics of volatile regarding timeliness of visibility.

The volatile key word and memory consistency errors

In the oracle Java documentation located here, the following is said:
Atomic actions cannot be interleaved, so they can be used without fear of thread interference. However, this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible. Using volatile variables reduces the risk of memory consistency errors, because any write to a volatile variable establishes a happens-before relationship with subsequent reads of that same variable. This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
It also says:
Reads and writes are atomic for reference variables and for most
primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long
and double variables).
I have two questions regarding these statements:
"Using volatile variables reduces the risk of memory consistency errors" - What do they mean by "reduces the risk", and how is a memory consistency error still possible when using volatile?
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads? I ask this since it seems that those variables already have atomic reads.

What do they mean by "reduces the risk"?
Atomicity is one issue addressed by the Java Memory Model. However, more important than Atomicity are the following issues:
memory architecture, e.g. impact of CPU caches on read and write operations
CPU optimizations, e.g. reordering of loads and stores
compiler optimizations, e.g. added and removed loads and stores
The following listing contains a frequently used example. The operations on x and y are atomic. Still, the program can print both lines.
int x = 0, y = 0;
// thread 1
x = 1
if (y == 0) System.out.println("foo");
// thread 2
y = 1
if (x == 0) System.out.println("bar");
However, if you declare x and y as volatile, only one of the two lines can be printed.
How is a memory consistency error still possible when using volatile?
The following example uses volatile. However, updates might still get lost.
volatile int x = 0;
// thread 1
x += 1;
// thread 2
x += 1;
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads?
Happens-before is often misunderstood. The consistency model defined by happens-before is weak and difficult to use correctly. This can be demonstrated with the following example, that is known as Independent Reads of Independent Writes (IRIW):
volatile int x = 0, y = 0;
// thread 1
x = 1;
// thread 2
y = 1;
// thread 3
if (x == 1) System.out.println(y);
// thread 4
if (y == 1) System.out.println(x);
Only with happens-before, two 0s would be valid result. However, that's apparently counter-intuitive. For that reason, Java provides a stricter consistency model, that forbids this relativity issue, and that is known as sequential consistency. You can find it in sections §17.4.3 and §17.4.5 of the Java Language Specification. The most important part is:
A program is correctly synchronized if and only if all sequentially consistent executions are free of data races. If a program is correctly synchronized, then all executions of the program will appear to be sequentially consistent (§17.4.3).
That means, volatile gives you more than happens-before. It gives you sequential consistency if used for all conflicting accesses (§17.4.3).

The usual example:
while(!condition)
sleep(10);
if condition is volatile, this behaves as expected. If it is not, the compiler is allowed to optimize this to
if(!condition)
for(;;)
sleep(10);
This is completely orthogonal to atomicity: if condition is of a hypothetical integer type that is not atomic, then the sequence
thread 1 writes upper half to 0
thread 2 reads upper half (0)
thread 2 reads lower half (0)
thread 1 writes lower half (1)
can happen while the variable is updated from a nonzero value that just happens to have a lower half of zero to a nonzero value that has an upper half of zero; in this case, thread 2 reads the variable as zero. The volatile keyword in this case makes sure that thread 2 really reads the variable instead of using its local copy, but it does not affect timing.
Third, atomicity does not protect against
thread 1 reads value (0)
thread 2 reads value (0)
thread 1 writes incremented value (1)
thread 2 writes incremented value (1)
One of the best ways to use atomic volatile variables are the read and write counters of a ring buffer:
thread 1 looks at read pointer, calculates free space
thread 1 fills free space with data
thread 1 updates write pointer (which is `volatile`, so the side effects of filling the free space are also committed before)
thread 2 looks at write pointer, calculates amount of data received
...
Here, no lock is needed to synchronize the threads, atomicity guarantees that the read and write pointers will always be accessed consistently and volatile enforces the necessary ordering.

For question 1, the risk is only reduced (and not eliminated) because volatile only applies to a single read/write operation and not more complex operations such as increment, decrement, etc.
For question 2, the effect of volatile is to make changes immediately visible to other threads. As the quoted passage states "this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible." Simply because reads are atomic does not mean that they are thread safe. So establishing a happens before relationship is almost a (necessary) side-effect of guaranteeing memory consistency across threads.

Ad 1: With a volatile variable, the variable is always checked against a master copy and all threads see a consistent state. But if you use that volatility variable in a non-atomic operation writing back the result (say a = f(a)) then you might still create a memory inconsistency. That's how I would understand the remark "reduces the risk". A volatile variable is consistent at the time of read, but you still might need to use a synchronize.
Ad 2: I don't know. But: If your definition of "happens before" includes the remark
This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
I would not dare to rely on any other property except that volatile ensures this. What else do you expect from it?!

Assume that you have a CPU with a CPU cache or CPU registers. Independent from your CPU architecture in terms of number of cores it has, volatile does NOT guarantee you a perfect inconsistency. The only way to achieve this is to use synchronized or atomic references with a performance price.
For example you have multiple threads (Thread A & Thread B) working on a shared data. Assume that Thread A wants to update the shared data and it's is started .For performance reasons, Thread A's stack was moved to CPU cache or registers. Then Thread A updated the shared data. But the problem with those places is that actually they don't flush back the updated value to the main memory immediately. This is where inconsistency's offered because up to the flash back operation, Thread B might have wanted to play with the same data, which would have taken it from the main memory - yet unupdated value.
If you use volatile all the operations will be perfomed on the main memory so you don't have a flush back latency. But, this time you may suffer from thread pipeline. In the middle of write operation (composed of number of atomic operations), Thread B may have been executed by the os to perform a read operation and that's it! Thread B will read the unupdated value again. That's why it's said it reduces the risk.
Hope you got it.

when coming to concurrency, you might want to ensure 2 things:
atomic operations: a set of operations is atomic - this is usually achieved with
"synchronized" (higher level constructs). Also with volatile for instance for read/write on long and double.
visibility: a thread B sees a modification made by a thread A. Even if an operation is atomic, like a write to an int variable, a second thread can still see a non-up-to-date value of the variable, due to processor caches. Putting a variable as volatile ensures that the second thread does see the up-to-date value of that variable. More than that, it ensures that the second thread sees an up-to-date value of ALL the variables written by the first thread before the write to the volatile variable.

Interpretation of "program order rule" in Java concurrency

Program order rule states "Each action in a thread happens-before every action in that thread that comes later in the program order"
1.I read in another thread that an action is
reads and writes to variables
locks and unlocks of monitors
starting and joining with threads
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
2.What does "program order" mean?
Explanation with an examples would be really helpful.
Additional related question
Suppose I have the following code:
long tick = System.nanoTime(); //Line1: Note the time
//Block1: some code whose time I wish to measure goes here
long tock = System.nanoTime(); //Line2: Note the time
Firstly, it's a single threaded application to keep things simple. Compiler notices that it needs to check the time twice and also notices a block of code that has no dependency with surrounding time-noting lines, so it sees a potential to reorganize the code, which could result in Block1 not being surrounded by the timing calls during actual execution (for instance, consider this order Line1->Line2->Block1). But, I as a programmer can see the dependency between Line1,2 and Block1. Line1 should immediately precede Block1, Block1 takes a finite amount of time to complete, and immediately succeeded by Line2.
So my question is: Am I measuring the block correctly?
If yes, what is preventing the compiler from rearranging the order.
If no, (which is think is correct after going through Enno's answer) what can I do to prevent it.
P.S.: I stole this code from another question I asked in SO recently.

It probably helps to explain why such rule exist in the first place.
Java is a procedural language. I.e. you tell Java how to do something for you. If Java executes your instructions not in the order you wrote, it would obviously not work. E.g. in the below example, if Java would do 2 -> 1 -> 3 then the stew would be ruined.
1. Take lid off
2. Pour salt in
3. Cook for 3 hours
So, why does the rule not simply say "Java executes what you wrote in the order you wrote"? In a nutshell, because Java is clever. Take the following example:
1. Take eggs out of the freezer
2. Take lid off
3. Take milk out of the freezer
4. Pour egg and milk in
5. Cook for 3 hours
If Java was like me, it'll just execute it in order. However Java is clever enough to understand that it's more efficient AND that the end result would be the same should it do 1 -> 3 -> 2 -> 4 -> 5 (you don't have to walk to the freezer again, and that doesn't change the recipe).
So what the rule "Each action in a thread happens-before every action in that thread that comes later in the program order" is trying to say is, "In a single thread, your program will run as if it was executed in the exact order you wrote it. We might change the ordering behind the scene but we make sure that none of that would change the output.
So far so good. Why does it not do the same across multiple threads? In multi-thread programming, Java isn't clever enough to do it automatically. It will for some operations (e.g. joining threads, starting threads, when a lock (monitor) is used etc.) but for other stuff you need to explicitly tell it to not do reordering that would change the program output (e.g. volatile marker on fields, use of locks etc.).
Note:
Quick addendum about "happens-before relationship". This is a fancy way of saying no matter what reordering Java might do, stuff A will happen before stuff B. In our weird later stew example, "Step 1 & 3 happens-before step 4 "Pour egg and milk in" ". Also for example, "Step 1 & 3 do not need a happens-before relationship because they don't depend on each other in any way"
On the additional question & response to the comment
First, let us establish what "time" means in the programming world. In programming, we have the notion of "absolute time" (what's the time in the world now?) and the notion of "relative time" (how much time has passed since x?). In an ideal world, time is time but unless we have an atomic clock built in, the absolute time would have to be corrected time to time. On the other hand, for relative time we don't want corrections as we are only interested in the differences between events.
In Java, System.currentTime() deals with absolute time and System.nanoTime() deals with relative time. This is why the Javadoc of nanoTime states, "This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time".
In practice, both currentTimeMillis and nanoTime are native calls and thus the compiler can't practically prove if a reordering won't affect the correctness, which means it will not reorder the execution.
But let us imagine we want to write a compiler implementation that actually looks into native code and reorders everything as long as it's legal. When we look at the JLS, all that it tells us is that "You can reorder anything as long as it cannot be detected". Now as the compiler writer, we have to decide if the reordering would violate the semantics. For relative time (nanoTime), it would clearly be useless (i.e. violates the semantics) if we'd reorder the execution. Now, would it violate the semantics if we'd reorder for absolute time (currentTimeMillis)? As long as we can limit the difference from the source of the world's time (let's say the system clock) to whatever we decide (like "50ms")*, I say no. For the below example:
long tick = System.currentTimeMillis();
result = compute();
long tock = System.currentTimeMillis();
print(result + ":" + tick - tock);
If the compiler can prove that compute() takes less than whatever maximum divergence from the system clock we can permit, then it would be legal to reorder this as follows:
long tick = System.currentTimeMillis();
long tock = System.currentTimeMillis();
result = compute();
print(result + ":" + tick - tock);
Since doing that won't violate the spec we defined, and thus won't violate the semantics.
You also asked why this is not included in the JLS. I think the answer would be "to keep the JLS short". But I don't know much about this realm so you might want to ask a separate question for that.
*: In actual implementations, this difference is platform dependent.

The program order rule guarantees that, within individual threads, reordering optimizations introduced by the compiler cannot produce different results from what would have happened if the program had been executed in serial fashion. It makes no guarantees about what order the thread's actions may appear to occur in to any other threads if its state is observed by those threads without synchronization.
Note that this rule speaks only to the ultimate results of the program, and not to the order of individual executions within that program. For instance, if we have a method which makes the following changes to some local variables:
x = 1;
z = z + 1;
y = 1;
The compiler remains free to reorder these operations however it sees best fit to improve performance. One way to think of this is: if you could reorder these ops in your source code and still obtain the same results, the compiler is free to do the same. (And in fact, it can go even further and completely discard operations which are shown to have no results, such as invocations of empty methods.)
With your second bullet point the monitor lock rule comes into play: "An unlock on a monitor happens-before every subsequent lock on that main monitor lock." (Java Concurrency in Practice p. 341) This means that a thread acquiring a given lock will have a consistent view of the actions which occurred in other threads before releasing that lock. However, note that this guarantee only applies when two different threads release or acquire the same lock. If Thread A does a bunch of stuff before releasing Lock X, and then Thread B acquires Lock Y, Thread B is not assured to have a consistent view of A's pre-X actions.
It is possible for reads and writes to variables to be reordered with start and join if a.) doing so doesn't break within-thread program order, and b.) the variables have not had other "happens-before" thread synchronization semantics applied to them, say by storing them in volatile fields.
A simple example:
class ThreadStarter {
Object a = null;
Object b = null;
Thread thread;
ThreadStarter(Thread threadToStart) {
this.thread = threadToStart;
}
public void aMethod() {
a = new BeforeStartObject();
b = new BeforeStartObject();
thread.start();
a = new AfterStartObject();
b = new AfterStartObject();
a.doSomeStuff();
b.doSomeStuff();
}
}
Since the fields a and b and the method aMethod() are not synchronized in any way, and the action of starting thread does not change the results of the writes to the fields (or the doing of stuff with those fields), the compiler is free to reorder thread.start() to anywhere in the method. The only thing it could not do with the order of aMethod() would be to move the order of writing one of the BeforeStartObjects to a field after writing an AfterStartObject to that field, or to move one of the doSomeStuff() invocations on a field before the AfterStartObject is written to it. (That is, assuming that such reordering would change the results of the doSomeStuff() invocation in some way.)
The critical thing to bear in mind here is that, in the absence of synchronization, the thread started in aMethod() could theoretically observe either or both of the fields a and b in any of the states which they take on during the execution of aMethod() (including null).
Additional question answer
The assignments to tick and tock cannot be reordered with respect to the code in Block1 if they are to be actually used in any measurements, for example by calculating the difference between them and printing the result as output. Such reordering would clearly break Java's within-thread as-if-serial semantics. It changes the results from what would have been obtained by executing instructions in the specified program order. If the assignments aren't used for any measurements and have no side-effects of any kind on the program result, they'll likely be optimized away as no-ops by the compiler rather than being reordered.

Before I answer the question,
reads and writes to variables
Should be
volatile reads and volatile writes (of the same field)
Program order doesn't guarantee this happens before relationship, rather the happens-before relationship guarantees program order
To your questions:
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
The answer actually depends on what action happens first and what action happens second. Take a look at the JSR 133 Cookbook for Compiler Writers. There is a Can Reorder grid that lists the allowed compiler reordering that can occur.
For instance a Volatile Store can be re-ordered above or below a Normal Store but a Volatile Store cannot be be reordered above or below a Volatile Load. This is all assuming intrathread semantics still hold.
What does "program order" mean?
This is from the JLS
Among all the inter-thread actions performed by each thread t, the
program order of t is a total order that reflects the order in which
these actions would be performed according to the intra-thread
semantics of t.
In other words, if you can change the writes and loads of a variable in such a way that it will preform exactly the same way as you wrote it then it maintains program order.
For instance
public static Object getInstance(){
if(instance == null){
instance = new Object();
}
return instance;
}
Can be reordered to
public static Object getInstance(){
Object temp = instance;
if(instance == null){
temp = instance = new Object();
}
return temp;
}

it simply mean though the thread may be multiplxed, but the internal order of the thread's action/operation/instruction would remain constant (relatively)
thread1: T1op1, T1op2, T1op3...
thread2: T2op1, T2op2, T2op3...
though the order of operation (Tn'op'M) among thread may vary, but operations T1op1, T1op2, T1op3 within a thread will always be in this order, and so as the T2op1, T2op2, T2op3
for ex:
T2op1, T1op1, T1op2, T2op2, T2op3, T1op3

Java tutorial http://docs.oracle.com/javase/tutorial/essential/concurrency/memconsist.html says that happens-before relationship is simply a guarantee that memory writes by one specific statement are visible to another specific statement. Here is an illustration
int x;
synchronized void x() {
x += 1;
}
synchronized void y() {
System.out.println(x);
}
synchronized creates a happens-before relationship, if we remove it there will be no guarantee that after thread A increments x thread B will print 1, it may print 0

What operations are atomic operations

I am little confused...
Is it true that reading\writing from several threads all except long and double are atomic operations and it's need to use volatile only with long and double?

It sounds like you're referring to this section of the JLS. It is guaranteed for all primitive types -- except double and long -- that all threads will see some value that was actually written to that variable. (With double and long, the first four bytes might have been written by one thread, and the last four bytes by another thread, as specified in that section of the JLS.) But they won't necessarily see the same value at the same time unless the variable is marked volatile.
Even using volatile, x += 3 is not atomic, because it's x = x + 3, which does a read and a write, and there might be writes to x between the read and the write. That's why we have things like AtomicInteger and the other utilities in java.util.concurrent.

Let's not confuse atomic with thread-safe. Long and double writes are not atomic underneath because each is two separate 32 bit stores. Storing and loading non long/double fields are perfectly atomic assuming they are not a compound writes (i++ for example).
By atomic I mean you will not read some garbled object as a result of many threads writing different objects to the same field.
From Java Concurrency In Practice 3.1.2
Out-of-thin-aire safety: When a thread reads a variable without
synchronization, it may see a stale value, but at least it sees a
value that was actually placed there by some thread rather than some
random value. This is true for all variables, except 64-bit long and
double, which are not volatile. The JVM is permitted to treat 64-bit
read or write as two seperate 32-bit operations which are not atomic.

That doesn't sound right.
An atomic operation is one that forces all threads to wait to access a resource until another thread is done with it. I don't see why other data types would be atomic, and others not.

volatile has other semantics than just writing the value atomically
it means that other threads can see the updated value immediately (and that it can't be optimized out)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.