The volatile key word and memory consistency errors

The volatile key word and memory consistency errors - java

In the oracle Java documentation located here, the following is said:
Atomic actions cannot be interleaved, so they can be used without fear of thread interference. However, this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible. Using volatile variables reduces the risk of memory consistency errors, because any write to a volatile variable establishes a happens-before relationship with subsequent reads of that same variable. This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
It also says:
Reads and writes are atomic for reference variables and for most
primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long
and double variables).
I have two questions regarding these statements:
"Using volatile variables reduces the risk of memory consistency errors" - What do they mean by "reduces the risk", and how is a memory consistency error still possible when using volatile?
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads? I ask this since it seems that those variables already have atomic reads.

What do they mean by "reduces the risk"?
Atomicity is one issue addressed by the Java Memory Model. However, more important than Atomicity are the following issues:
memory architecture, e.g. impact of CPU caches on read and write operations
CPU optimizations, e.g. reordering of loads and stores
compiler optimizations, e.g. added and removed loads and stores
The following listing contains a frequently used example. The operations on x and y are atomic. Still, the program can print both lines.
int x = 0, y = 0;
// thread 1
x = 1
if (y == 0) System.out.println("foo");
// thread 2
y = 1
if (x == 0) System.out.println("bar");
However, if you declare x and y as volatile, only one of the two lines can be printed.
How is a memory consistency error still possible when using volatile?
The following example uses volatile. However, updates might still get lost.
volatile int x = 0;
// thread 1
x += 1;
// thread 2
x += 1;
Would it be true to say that the only effect of placing volatile on a non-double, non-long primitive is to enable the "happens-before" relationship with subsequent reads from other threads?
Happens-before is often misunderstood. The consistency model defined by happens-before is weak and difficult to use correctly. This can be demonstrated with the following example, that is known as Independent Reads of Independent Writes (IRIW):
volatile int x = 0, y = 0;
// thread 1
x = 1;
// thread 2
y = 1;
// thread 3
if (x == 1) System.out.println(y);
// thread 4
if (y == 1) System.out.println(x);
Only with happens-before, two 0s would be valid result. However, that's apparently counter-intuitive. For that reason, Java provides a stricter consistency model, that forbids this relativity issue, and that is known as sequential consistency. You can find it in sections §17.4.3 and §17.4.5 of the Java Language Specification. The most important part is:
A program is correctly synchronized if and only if all sequentially consistent executions are free of data races. If a program is correctly synchronized, then all executions of the program will appear to be sequentially consistent (§17.4.3).
That means, volatile gives you more than happens-before. It gives you sequential consistency if used for all conflicting accesses (§17.4.3).

The usual example:
while(!condition)
sleep(10);
if condition is volatile, this behaves as expected. If it is not, the compiler is allowed to optimize this to
if(!condition)
for(;;)
sleep(10);
This is completely orthogonal to atomicity: if condition is of a hypothetical integer type that is not atomic, then the sequence
thread 1 writes upper half to 0
thread 2 reads upper half (0)
thread 2 reads lower half (0)
thread 1 writes lower half (1)
can happen while the variable is updated from a nonzero value that just happens to have a lower half of zero to a nonzero value that has an upper half of zero; in this case, thread 2 reads the variable as zero. The volatile keyword in this case makes sure that thread 2 really reads the variable instead of using its local copy, but it does not affect timing.
Third, atomicity does not protect against
thread 1 reads value (0)
thread 2 reads value (0)
thread 1 writes incremented value (1)
thread 2 writes incremented value (1)
One of the best ways to use atomic volatile variables are the read and write counters of a ring buffer:
thread 1 looks at read pointer, calculates free space
thread 1 fills free space with data
thread 1 updates write pointer (which is `volatile`, so the side effects of filling the free space are also committed before)
thread 2 looks at write pointer, calculates amount of data received
...
Here, no lock is needed to synchronize the threads, atomicity guarantees that the read and write pointers will always be accessed consistently and volatile enforces the necessary ordering.

For question 1, the risk is only reduced (and not eliminated) because volatile only applies to a single read/write operation and not more complex operations such as increment, decrement, etc.
For question 2, the effect of volatile is to make changes immediately visible to other threads. As the quoted passage states "this does not eliminate all need to synchronize atomic actions, because memory consistency errors are still possible." Simply because reads are atomic does not mean that they are thread safe. So establishing a happens before relationship is almost a (necessary) side-effect of guaranteeing memory consistency across threads.

Ad 1: With a volatile variable, the variable is always checked against a master copy and all threads see a consistent state. But if you use that volatility variable in a non-atomic operation writing back the result (say a = f(a)) then you might still create a memory inconsistency. That's how I would understand the remark "reduces the risk". A volatile variable is consistent at the time of read, but you still might need to use a synchronize.
Ad 2: I don't know. But: If your definition of "happens before" includes the remark
This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.
I would not dare to rely on any other property except that volatile ensures this. What else do you expect from it?!

Assume that you have a CPU with a CPU cache or CPU registers. Independent from your CPU architecture in terms of number of cores it has, volatile does NOT guarantee you a perfect inconsistency. The only way to achieve this is to use synchronized or atomic references with a performance price.
For example you have multiple threads (Thread A & Thread B) working on a shared data. Assume that Thread A wants to update the shared data and it's is started .For performance reasons, Thread A's stack was moved to CPU cache or registers. Then Thread A updated the shared data. But the problem with those places is that actually they don't flush back the updated value to the main memory immediately. This is where inconsistency's offered because up to the flash back operation, Thread B might have wanted to play with the same data, which would have taken it from the main memory - yet unupdated value.
If you use volatile all the operations will be perfomed on the main memory so you don't have a flush back latency. But, this time you may suffer from thread pipeline. In the middle of write operation (composed of number of atomic operations), Thread B may have been executed by the os to perform a read operation and that's it! Thread B will read the unupdated value again. That's why it's said it reduces the risk.
Hope you got it.

when coming to concurrency, you might want to ensure 2 things:
atomic operations: a set of operations is atomic - this is usually achieved with
"synchronized" (higher level constructs). Also with volatile for instance for read/write on long and double.
visibility: a thread B sees a modification made by a thread A. Even if an operation is atomic, like a write to an int variable, a second thread can still see a non-up-to-date value of the variable, due to processor caches. Putting a variable as volatile ensures that the second thread does see the up-to-date value of that variable. More than that, it ensures that the second thread sees an up-to-date value of ALL the variables written by the first thread before the write to the volatile variable.

Related

What does "subsequent read" mean in the context of volatile variables?

Java memory visibility documentation says that:
A write to a volatile field happens-before every subsequent read of that same field.
I'm confused what does subsequent means in context of multithreading. Does this sentence implies some global clock for all processors and cores. So for example I assign value to variable in cycle c1 in some thread and then second thread is able to see this value in subsequent cycle c1 + 1?

It sounds to me like it's saying that it provides lockless acquire/release memory-ordering semantics between threads. See Jeff Preshing's article explaining the concept (mostly for C++, but the main point of the article is language neutral, about the concept of lock-free acquire/release synchronization.)
In fact Java volatile provides sequential consistency, not just acq/rel. There's no actual locking, though. See Jeff Preshing's article for an explanation of why the naming matches what you'd do with a lock.)
If a reader sees the value you wrote, then it knows that everything in the producer thread before that write has also already happened.
This ordering guarantee is only useful in combination with other guarantees about ordering within a single thread.
e.g.
int data[100];
volatile bool data_ready = false;
Producer:
data[0..99] = stuff;
// release store keeps previous ops above this line
data_ready = true;
Consumer:
while(!data_ready){} // spin until we see the write
// acquire-load keeps later ops below this line
int tmp = data[99]; // gets the value from the producer
If data_ready was not volatile, reading it wouldn't establish a happens-before relationship between two threads.
You don't have to have a spinloop, you could be reading a sequence number, or an array index from a volatile int, and then reading data[i].
I don't know Java well. I think volatile actually gives you sequential-consistency, not just release/acquire. A sequential-release store isn't allowed to reorder with later loads, so on typical hardware it needs an expensive memory barrier to make sure the local core's store buffer is flushed before any later loads are allowed to execute.
Volatile Vs Atomic explains more about the ordering volatile gives you.
Java volatile is just an ordering keyword; it's not equivalent to C11 _Atomic or C++11 std::atomic<T> which also give you atomic RMW operations. In Java, volatile_var++ is not an atomic increment, it a separate load and store, like volatile_var = volatile_var + 1. In Java, you need a class like AtomicInteger to get an atomic RMW.
And note that C/C++ volatile doesn't imply atomicity or ordering at all; it only tells the compiler to assume that the value can be modified asynchronously. This is only a small part of what you need to write lockless for anything except the simplest cases.

It means that once a certain Thread writes to a volatile field, all other Thread(s) will observe (on the next read) that written value; but this does not protect you against races though.
Threads have their caches, and those caches will be invalidated and updated with that newly written value via cache coherency protocol.
EDIT
Subsequent means whenever that happens after the write itself. Since you don't know the exact cycle/timing when that will happen, you usually say when some other thread observes the write, it will observer all the actions done before that write; thus a volatile establishes the happens-before guarantees.
Sort of like in an example:
// Actions done in Thread A
int a = 2;
volatile int b = 3;
// Actions done in Thread B
if(b == 3) { // observer the volatile write
// Thread B is guaranteed to see a = 2 here
}
You could also loop (spin wait) until you see 3 for example.

Peter's answer gives the rationale behind the design of the Java memory model.
In this answer I'm attempting to give an explanation using only the concepts defined in the JLS.
In Java every thread is composed by a set of actions.
Some of these actions have the potential to be observable by other threads (e.g. writing a shared variable), these
are called synchronization actions.
The order in which the actions of a thread are written in the source code is called the program order.
An order defines what is before and what is after (or better, not before).
Within a thread, each action has a happens-before relationship (denoted by <) with the next (in program order) action.
This relationship is important, yet hard to understand, because it's very fundamental: it guarantees that if A < B then
the "effects" of A are visible to B.
This is indeed what we expect when writing the code of a function.
Consider
Thread 1 Thread 2
A0 A'0
A1 A'1
A2 A'2
A3 A'3
Then by the program order we know A0 < A1 < A2 < A3 and that A'0 < A'1 < A'2 < A'3.
We don't know how to order all the actions.
It could be A0 < A'0 < A'1 < A'2 < A1 < A2 < A3 < A'3 or the sequence with the primes swapped.
However, every such sequence must have that the single actions of each thread are ordered according to the thread's program order.
The two program orders are not sufficient to order every action, they are partial orders, in opposition of the
total order we are looking for.
The total order that put the actions in a row according to a measurable time (like a clock) they happened is called the execution order.
It is the order in which the actions actually happened (it is only requested that the actions appear to be happened in
this order, but that's just an optimization detail).
Up until now, the actions are not ordered inter-thread (between two different threads).
The synchronization actions serve this purpose.
Each synchronization action synchronizes-with at least another synchronization action (they usually comes in pairs, like
a write and a read of a volatile variable, a lock and the unlock of a mutex).
The synchronize-with relationship is the happens-before between thread (the former implies the latter), it is exposed as
a different concept because 1) it slightly is 2) happens-before are enforced naturally by the hardware while synchronize-with
may require software intervention.
happens-before is derived from the program order, synchronize-with from the synchronization order (denoted by <<).
The synchronization order is defined in terms of two properties: 1) it is a total order 2) it is consistent with each thread's
program order.
Let's add some synchronization action to our threads:
Thread 1 Thread 2
A0 A'0
S1 A'1
A1 S'1
A2 S'2
S2 A'3
The program orders are trivial.
What is the synchronization order?
We are looking for something that by 1) includes all of S1, S2, S'1 and S'2 and by 2) must have S1 < S2 and S'1 < S'2.
Possible outcomes:
S1 < S2 < S'1 < S'2
S1 < S'1 < S'2 < S2
S'1 < S1 < S'2 < S'2
All are synchronization orders, there is not one synchronization order but many, the question of above is wrong, it
should be "What are the synchronization orders?".
If S1 and S'1 are so that S1 << S'1 than we are restricting the possible outcomes to the ones where S1 < S'2 so the
outcome S'1 < S1 < S'2 < S'2 of above is now forbidden.
If S2 << S'1 then the only possible outcome is S1 < S2 < S'1 < S'2, when there is only a single outcome I believe we have
sequential consistency (the converse is not true).
Note that if A << B these doesn't mean that there is a mechanism in the code to force an execution order where A < B.
Synchronization actions are affected by the synchronization order they do not impose any materialization of it.
Some synchronization actions (e.g. locks) impose a particular execution order (and thereby a synchronization order) but some don't (e.g. reads/writes of volatiles).
It is the execution order that create the synchronization order, this is completely orthogonal to the synchronize-with relationship.
Long story short, the "subsequent" adjective refers to any synchronization order, that is any valid (according to each thread
program order) order that encompasses all the synchronization actions.
The JLS then continues defining when a data race happens (when two conflicting accesses are not ordered by happens-before)
and what it means to be happens-before consistent.
Those are out of scope.

I'm confused what does subsequent means in context of multithreading. Does this sentence implies some global clock for all processors and cores...?
Subsequent means (according to the dictionary) coming after in time. There certainly is a global clock across all CPUs in a computer (think X Ghz) and the document is trying to say that if thread-1 did something at clock tick 1 then thread-2 does something on another CPU at clock tick 2, it's actions are considered subsequent.
A write to a volatile field happens-before every subsequent read of that same field.
The key phrase that could be added to this sentence to make it more clear is "in another thread". It might make more sense to understand it as:
A write to a volatile field happens-before every subsequent read of that same field in another thread.
What this is saying that if a read of a volatile field happens in Thread-2 after (in time) the write in Thread-1, then Thread-2 will be guaranteed to see the updated value. Further up in the documentation you point to is the section (emphasis mine):
... The results of a write by one thread are guaranteed to be visible to a read by another thread only if the write operation happens-before the read operation. The synchronized and volatile constructs, as well as the Thread.start() and Thread.join() methods, can form happens-before relationships. In particular.
Notice the highlighted phrase. The Java compiler is free to reorder instructions in any one thread's execution for optimization purposes as long as the reordering doesn't violate the definition of the language – this is called execution order and is critically different than program order.
Let's look at the following example with variables a and b that are non-volatile ints initialized to 0 with no synchronized clauses. What is shown is program order and the time in which the threads are encountering the lines of code.
Time Thread-1 Thread-2
1 a = 1;
2 b = 2;
3 x = a;
4 y = b;
5 c = a + b; z = x + y;
If Thread-1 adds a + b at Time 5, it is guaranteed to be 3. However, if Thread-2 adds x + y at Time 5, it might get 0, 1, 2, or 3 depends on race conditions. Why? Because the compiler might have reordered the instructions in Thread-1 to set a after b because of efficiency reasons. Also, Thread-1 may not have appropriately published the values of a and b so that Thread-2 might get out of date values. Even if Thread-1 gets context-switched out or crosses a write memory barrier and a and b are published, Thread-2 needs to cross a read barrier to update any cached values of a and b.
If a and b were marked as volatile then the write to a must happen-before (in terms of visibility guarantees) the subsequent read of a on line 3 and the write to b must happen-before the subsequent read of b on line 4. Both threads would get 3.
We use volatile and synchronized keywords in java to ensure happens-before guarantees. A write memory barrier is crossed when assigning a volatile or exiting a synchronized block and a read barrier is crossed when reading a volatile or entering a synchronized block. The Java compiler cannot reorder write instructions past these memory barriers so the order of updates is assured. These keywords control instruction reordering and insure proper memory synchronization.
NOTE: volatile is unnecessary in a single-threaded application because program order assures the reads and writes will be consistent. A single-threaded application might see any value of (non-volatile) a and b at times 3 and 4 but it always sees 3 at Time 5 because of language guarantees. So although use of volatile changes the reordering behavior in a single-threaded application, it is only required when you share data between threads.

This is more a definition of what will not happen rather than what will happen.
Essentially it is saying that once a write to an atomic variable has happened there cannot be any other thread that, on reading the variable, will read a stale value.
Consider the following situation.
Thread A is continuously incrementing an atomic value a.
Thread B occasionally reads A.a and exposes that value as a
non-atomic b variable.
Thread C occasionally reads both A.a and B.b.
Given that a is atomic it is possible to reason that from the point of view of C, b may occasionally be less than a but will never be greater than a.
If a was not atomic no such guarantee could be given. Under certain caching situations it would be quite possible for C to see b progress beyond a at any time.
This is a simplistic demonstration of how the Java memory model allows you to reason about what can and cannot happen in a multi-threaded environment. In real life the potential race conditions between reading and writing to data structures can be much more complex but the reasoning process is the same.

volatile vs not volatile

Let's consider the following piece of code in Java
int x = 0;
int who = 1
Thread #1:
(1) x++;
(2) who = 2;
Thread #2
while(who == 1);
x++;
print x; ( the value should be equal to 2 but, perhaps, it is not* )
(I don't know Java memory models- let assume that it is strong memory model- I mean: (1) and (2) will be doesn't swapped)
Java memory model guarantees that access/store to the 32 bit variables is atomic so our program is safe. But, nevertheless we should use a attribute volatile because *. The value of x may be equal to 1 because x can be kept in register when Thread#2 read it. To resolve it we should make the x variable volatile. It is clear.
But, what about that situation:
int x = 0;
mutex m; ( just any mutex)
Thread #1:
mutex.lock()
x++;
mutex.unlock()
Thread #2
mutex.lock()
x++;
print x; // the value is always 2, why**?
mutex.unlock()
The value of x is always 2 though we don't make it volatile. Do I correctly understand that locking/unlocking mutex is connected with inserting memory barriers?

I'll try to tackle this. The Java memory model is kind of involved and hard to contain in a single StackOverflow post. Please refer to Brian Goetz's Java Concurrency in Practice for the full story.
The value of x is always 2 though we don't make it volatile. Do I correctly understand that locking/unlocking mutex is connected with inserting memory barriers?
First if you want to understand the Java memory model, it's always Chapter 17 of the spec you want to read through.
That spec says:
An unlock on a monitor happens-before every subsequent lock on that monitor.
So yes, there's a memory visibility event at the unlock of your monitor. (I assume by "mutex" you mean monitor. Most of the locks and other classes in the java.utils.concurrent package also have happens-before semantics, check the documentation.)
Happens-before is what Java means when it guarantees not just that the events are ordered, but also that memory visibility is guaranteed.
We say that a read r of a variable v is allowed to observe a write w
to v if, in the happens-before partial order of the execution trace:
r is not ordered before w (i.e., it is not the case that
hb(r, w)), and
there is no intervening write w' to v (i.e. no write w' to v such
that hb(w, w') and hb(w', r)).
Informally, a read r is allowed to see the result of a write w if there
is no happens-before ordering to prevent that read.
This is all from 17.4.5. It's a little confusing to read through, but the info is all there if you do read through it.

Let's go over some things. The following statement is true: Java memory model guarantees that access/store to the 32 bit variables is atomic. However, it does not follow that the first pseudoprogram you listed is safe. Simply because two statements are ordered syntactically does not mean that the visibility of their updates are also so ordered as viewed by other threads. Thread #2 may see the update caused by who=2 before the increment in x is visible. Making x volatile would still not make the program correct. Instead, making the variable 'who' voliatile would make the program correct. That is because volatile interacts with the java memory model in specific ways.
I feel like there is some notion of 'writing back to main memory' at the core of a common sense understanding of volatile which is incorrect. Volatile does not write back the value to main memory in Java. What reading from and writing to a volatile variable does is create what's called a happens-before relationship. When thread #1 writes to a volatile variable you're creating a relationship that ensures that any other threads #2 viewing that volatile variable will also be able to 'view' all the actions thread #1 has taken before that. In your example that means making 'who' volatile. By writing the value 2 to 'who' you are creating a happens-before relationship so that when thread #2 views who=2 it will similarly see an updated version of x.
In your second example (assuming you meant to have the 'who' variable too) the mutex unlocking creates a happens-before relationship as I specified above. Since that means other threads viewing the unlock of the mutex (ie. they are able to lock it themselves) they will see the updated version of x.

Does Java volatile read flush writes, and does volatile write update reads

I understand read-acquire(does not reorder with subsequent read/write operations after it), and write-release(does not reorder with read/write operations preceding it).
My q is:-
In case of read-acquire, do the writes preceding it get flushed?
In case of write-release, do the previous reads get updated?
Also, is read-acquire same as volatile read, and write release same as volatile write in Java?
Why this is important is that, let's take case of write-release..
y = x; // a read.. let's say x is 1 at this point
System.out.println(y);// 1 printed
//or you can also consider System.out.println(x);
write_release_barrier();
//somewhere here, some thread sets x = 2
ready = true;// this is volatile
System.out.println(y);// or maybe, println(x).. what will be printed?
At this point, is x 2 or 1?
Here, consider ready to be volatile.
I understand that all stores before volatile will first be made visible.. and then only the volatile will be made visible. Thanks.
Ref:- http://preshing.com/20120913/acquire-and-release-semantics/

No: not all writes are flushed, nor are all reads updated.
Java works on a "happens-before" basis for multithreading. Basically, if A happens-before B, and B happens-before C, then A happens-before C. So your question amounts to whether x=2 formally happens-before some action that reads x.
Happens-before edges are basically established by synchronizes-with relationships, which are defined in JLS 17.4.4. There are a few different ways to do this, but for volatiles, it's basically amounts to a write to volatile happening-before a read to that same volatile:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order).
Given that, if your thread writes ready = true, then that write alone doesn't mean anything happens-before it (as far as that write is concerned). It's actually the opposite; that write to ready happens-before things on other threads, iff they read ready.
So, if the other thread (that sets x = 2) had written to ready after it set x = 2, and this thread (that you posted above) then read ready, then it would see x = 2. That is because the write happens-before the read, and the reading thread therefore sees everything that the writing thread had done (up to and including the write). Otherwise, you have a data race, and basically all bets are off.
A couple additional notes:
If you don't have a happens-before edge, you may still see the update; it's just that you're not guaranteed to. So, don't assume that if you don't read a write to ready, then you'll still see x=1. You might see x=1, or x=2, or possibly some other write (up to and including the default value of x=0)
In your example, y is always going to be 1, because you don't re-read x after the "somewhere here" comment. For purposes of this answer, I've assumed that there's a second y=x line immediately before or after ready = true. If there's not, then y's value will be unchanged from what it was in the first println, (assuming no other thread directly changes it -- which is guaranteed if it's a local variable), because actions within a thread always appear as if they are not reordered.

The Java memory model is not specified in terms of "read-acquire" and "write-release". These terms / concepts come from other contexts, and as the article you referenced makes abundantly clear, they are often used (by different experts) to mean different things.
If you want to understand how volatiles work in Java, you need to understand the Java memory model and the Java terminology ... which is (fortunately) well-founded and precisely specified1. Trying to map the Java memory model onto "read-acquire" and "write-release" semantics is a bad idea because:
"read-acquire" and "write-release" terminology and semantics are not well specified, and
a hypothetical JMM -> "read-acquire" / "write-release" semantic mapping is only one possible implementation of the JMM. Others mappings may exist with different, and equally valid semantics.
1 - ... modulo that experts have noted flaws in some versions of the JMM. But the point is that a serious attempt has been made to provide a theoretically sound specification ... as part of the Java Language Specification.

No, reading a volatile variable will not flush preceding writes. Visible actions will ensure that preceding actions are visible, but reading a volatile variable is not visible to other threads.
No, writing to a volatile variable will not clear the cache of previously read values. It is only guaranteed to flush previous writes.
In your example, clearly y will still be 1 on the last line. Only one assignment has been made to y, and that was 1, according to the preceding output. Perhaps that was a typo, and you meant to write println(x), but even then, the value of 2 is not guaranteed to be visible.

For your 1st question, answer is that FIFO order
For your 2nd question: pls check Volatile Vs Static in java

Why and how does volatile imply atomic reads/writes?

First off, I'm aware that volatile does not make multiple operations (as i++) atomic. This question is about a single read or write operation.
My initial understanding was that volatile only enforces a memory barrier (i.e. other threads will be able to see updated values).
Now I've noticed that JLS section 17.7 says that volatile additionally makes a single read or write atomic. For instance, given two threads, both writing a different value to a volatile long x, then x will finally represent exactly one of the values.
I'm curious how this is possible. On a 32 bit system, if two threads write to a 64 bit location in parallel and without "proper" synchronization (i.e. some kind of lock), it should be possible for the result to be a mixup. For clarity, let's use an example in which thread 1 writes 0L while thread 2 writes -1L to the same 64 bit memory location.
T1 writes lower 32 bit
T2 writes lower 32 bit
T2 writes upper 32 bit
T1 writes upper 32 bit
The result could then be 0x0000FFFF, which is undesirable. How does volatile prevent this scenario?
I've also read elsewhere that this does, typically, not degrade performance. How is it possible to synchronize writes with only a minor speed impact?

Your statement that volatile only enforces a memory barrier (in the meaning, flushes the processor cache) is false. It also implies a happens-before relationship of read-write combinations of volatile values. For example:
class Foo {
volatile boolean x;
boolean y;
void qux() {
x = true; // volatile write
y = true;
}
void baz() {
System.out.print(x); // volatile read
System.out.print(" ");
System.out.print(y);
}
}
When you run both methods from two threads, the above code will either print true false, true true or false false but never false true. Without the volatile keyword, you are not guaranteed the later condition because the JIT compiler might reorder statements.
The same way as the JIT compiler can assure this condition, is can guard 64-bit value reads and writes in the assembly. volatile values are treated explicitly by the JIT compiler to assure their atomicity. Some processor instruction sets support this directly by specific 64-bit instructions, otherwise the JIT compiler emulates it.
The JVM is more complex as you might expect it to be and it is often explained without full scope. Consider reading this excellent article which covers all the details.

volatile assures that what a thread reads is the latest values at that point, but it doesn't synchronize two writes.
If a thread writes a normal variable, it keeps the values within the thread until some certain events happen. If a thread writes a volatile variable, it change the memory of the variable immediately.

On a 32 bit system, if two threads write to a 64 bit location in parallel and without "proper" synchronization (i.e. some kind of lock), it should be possible for the result to be a mixup
This is indeed what can happen if a variable isn't marked volatile. Now, what does the system do if the field is marked volatile? Here is a resource that explains this: http://gee.cs.oswego.edu/dl/jmm/cookbook.html
Nearly all processors support at least a coarse-grained barrier instruction, often just called a Fence, that guarantees that all loads and stores initiated before the fence will be strictly ordered before any load or store initiated after the fence [...] if available, you can implement volatile store as an atomic instruction (for example XCHG on x86) and omit the barrier. This may be more efficient if atomic instructions are cheaper than StoreLoad barriers
Essentially the processors provide facilities to implement the guarantee, and what facility is available depends on the processor.

What operations are atomic operations

I am little confused...
Is it true that reading\writing from several threads all except long and double are atomic operations and it's need to use volatile only with long and double?

It sounds like you're referring to this section of the JLS. It is guaranteed for all primitive types -- except double and long -- that all threads will see some value that was actually written to that variable. (With double and long, the first four bytes might have been written by one thread, and the last four bytes by another thread, as specified in that section of the JLS.) But they won't necessarily see the same value at the same time unless the variable is marked volatile.
Even using volatile, x += 3 is not atomic, because it's x = x + 3, which does a read and a write, and there might be writes to x between the read and the write. That's why we have things like AtomicInteger and the other utilities in java.util.concurrent.

Let's not confuse atomic with thread-safe. Long and double writes are not atomic underneath because each is two separate 32 bit stores. Storing and loading non long/double fields are perfectly atomic assuming they are not a compound writes (i++ for example).
By atomic I mean you will not read some garbled object as a result of many threads writing different objects to the same field.
From Java Concurrency In Practice 3.1.2
Out-of-thin-aire safety: When a thread reads a variable without
synchronization, it may see a stale value, but at least it sees a
value that was actually placed there by some thread rather than some
random value. This is true for all variables, except 64-bit long and
double, which are not volatile. The JVM is permitted to treat 64-bit
read or write as two seperate 32-bit operations which are not atomic.

That doesn't sound right.
An atomic operation is one that forces all threads to wait to access a resource until another thread is done with it. I don't see why other data types would be atomic, and others not.

volatile has other semantics than just writing the value atomically
it means that other threads can see the updated value immediately (and that it can't be optimized out)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.