I've been reading about Java memory model and I'm aware that it's possible for compiler to reorganize statements to optimize code.
Suppose I had a the following code:
long tick = System.nanoTime();
function_or_block_whose_time_i_intend_to_measure();
long tock = System.nanoTime();
would the compiler ever reorganize the code in a way that what I intend to measure is not executed between tick and tock? For example,
long tick = System.nanoTime();
long tock = System.nanoTime();
function_or_block_whose_time_i_intend_to_measure();
If so, what's the right way to preserve execution order?
EDIT:
Example illustrating out-of-order execution with nanoTime :
public class Foo {
public static void main(String[] args) {
while (true) {
long x = 0;
long tick = System.nanoTime();
for (int i = 0; i < 10000; i++) { // This for block takes ~15sec on my machine
for (int j = 0; j < 600000; j++) {
x = x + x * x;
}
}
long tock = System.nanoTime();
System.out.println("time=" + (tock - tick));
x = 0;
}
}
}
Output of above code:
time=3185600
time=16176066510
time=16072426522
time=16297989268
time=16063363358
time=16101897865
time=16133391254
time=16170513289
time=16249963612
time=16263027561
time=16239506975
In the above example, the time measured in first iteration is significantly lower than the measured time in the subsequent runs. I thought this was due to out of order execution. What am I doing wrong with the first iteration?
would the compiler ever reorganize the code in a way that what I intend to measure is not executed between tick and tock?
Nope. That would never happen. If that compiler optimization ever messed up, it would be a very serious bug. Quoting a statement from wiki.
The runtime (which, in this case, usually refers to the dynamic compiler, the processor and the memory subsystem) is free to introduce any useful execution optimizations as long as the result of the thread in isolation is guaranteed to be exactly the same as it would have been had all the statements been executed in the order the statements occurred in the program (also called program order).
So the optimization may be done as long as the result is the same as when executed in program order. In the case that you cited I would assume the optimization is local and that there are no other threads that would be interested in this data. These optimizations are done to reduce the number of trips made to main memory which can be costly. You will only run into trouble with these optimizations when multiple threads are involved and they need to know each other's state.
Now if 2 threads need to see each other's state consistently, they can use volatile variables or a memory barrier (synchronized) to force serialization of writes / reads to main memory. Infoq ran a nice article on this that might interest you.
Java Memory Model (JMM) defines a partial ordering called happens-before on all actions with the program. There are seven rules defined to ensure happens-before ordering. One of them is called Program order rule:
Program order rule. Each action in a thread happens-before every action in that thread that comes later in the program order.
According to this rule, your code will not be re-ordered by the compiler.
The book Java Concurrency in Practice gives an excellent explanation on this topic.
Related
I measure performance with the example code at the end.
If I call the checkPerformanceResult method with the parameter numberOfTimes set to 100 the parallel stream outperforms the sequential stream significant(sequential=346, parallel=78).
If I set the parameter to 1000, the sequential stream outperforms the parallel stream significant(sequential=3239, parallel=9337).
I did a lot of runs and the result is the same.
Can someone explain me this behaviour and what is going on under the hood here?
public class ParallelStreamExample {
public static long checkPerformanceResult(Supplier<Integer> s, int numberOfTimes) {
long startTime = System.currentTimeMillis();
for(int i = 0; i < numberOfTimes; i++) {
s.get();
}
long endTime = System.currentTimeMillis();
return endTime - startTime;
}
public static int sumSequentialStreamThread() {
IntStream.rangeClosed(1, 10000000).sum();
return 0;
}
public static int sumParallelStreamThread() {
IntStream.rangeClosed(1, 10000000)
.parallel().sum();
return 0;
}
public static void main(String[] args) {
System.out.println(checkPerformanceResult(ParallelStreamExample::sumSequentialStreamThread, 1000));
System.out.println("break");
System.out.println(checkPerformanceResult(ParallelStreamExample::sumParallelStreamThread, 1000));
}
}
Using threads doesn't always make the code run faster
when working with a few threads there is always an overhead of managing each thread (assigning CPU time by to OS to each thread, managing the next line of code that needs to run in case of a context switch etc...)
In this specific case
each thread created in sumParallelStreamThread does very simple in memory operations (calling a function that returns a number).
so the difference between sumSequentialStreamThread and sumParallelStreamThread is that in sumParallelStreamThread each simple operation has the overhead of creating a thread and running it (assuming that there isn't any thread optimization happening in the background).
and sumSequentialStreamThread does the same thing without the overhead of managing all the threads, that's why it runs faster.
When to use a threads
The most common use case for working with threads is when you need to perform a bunch of I/O tasks.
what is considered an I/O task?
it depends on several factors, you can find a debate on it here.
but i think generally people will agree that making and HTTP request to somewhere or executing a database query can be considered an I/O operation.
why is it more suitable?
because I/O operations usually have some period of time of waiting for a response involved with them.
for example when querying a database the thread performing the query will wait for the database to return the response (even if its less than half a second) while this thread is waiting a different thread can perform other actions and that is where we can gain performance.
I find that usually running tasks that involve only RAM memory and CPU operations in different threads makes the code run slower than with one thread.
Benchmark discussion
regarding the benchmark remarks is see in the comments, i am not sure if they are correct or not, but in these type of situations i would double check my benchmark against any profiling tool (or just use it to begin with) like JProfiler or YoutKit they are usually very accurate.
I am working on someone's code and came across the equivalent of this:
for (int i = 0; i < someVolatileMember; i++) {
// Removed for SO
}
Where someVolatileMember is defined like this:
private volatile int someVolatileMember;
If some thread, A, is running the for loop and another thread, B, writes to someVolatileMember then I assume the number of iterations to do would change while thread A is running the loop which is not great. I assume this would fix it:
final int someLocalVar = someVolatileMember;
for (int i = 0; i < someLocalVar; i++) {
// Removed for SO
}
My questions are:
Just to confirm that the number of iterations thread A does can be
changed while the for loop is active if thread B modifies
someVolatileMember
That the local non-volatile copy is sufficient to make sure that when
thread A runs the loop thread B cannot change the number of
iterations
Your understanding is correct:
Per the Java Language Specification, the semantics of a volatile field ensure consistency between values seen after updates done between different threads:
The Java programming language provides a second mechanism, volatile fields, that is more convenient than locking for some purposes.
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (ยง17.4).
Note that even without the volatile modifier, the loop count is likely to change depending on many factors.
Once a final variable is assigned, its value is never changed so the loop count will not change.
Well first of all that field is private (unless you omitted some methods that actually might alter it)...
That loop is a bit on non-sense, the way it is written and assuming there are methods that actually might alter someVolatileMember; it is so because you might never know when if finishes, or if does at all. That might even turn out to be a much more expensive loop as having a non-volatile field, because volatile means invalidating caches and draining buffers at the CPU level much more often than usual variables.
Your solution to first read a volatile and use that is actually a very common pattern; it's also given birth to a very common anti-pattern too : "check then act"... You read it into a local variable because if it later changes, you don't care - you are working with the freshest copy you had at the moment. So yes, your solution to copy it locally is fine.
There are also performance implications, since the value of volatile is never fetched from the most local cache but additional steps are being taken by the CPU to ensure that modifications are propagated (it could be cache coherence protocols, deferring reads to L3 cache, or reading from RAM). There are also implications to other variables in scope where volatile variable is used (these get synced with main memory too, however i am not demonstrating it here).
Regarding performance, following code:
private static volatile int limit = 1_000_000_000;
public static void main(String[] args) {
long start = System.nanoTime();
for (int i = 0; i < limit; i++ ) {
limit--; //modifying and reading, otherwise compiler will optimise volatile out
}
System.out.println(limit + " took " + (System.nanoTime() - start) / 1_000_000 + "ms");
}
... prints 500000000 took 4384ms
Removing volatile keyword from above will result in output 500000000 took 275ms.
Consider the following piece of code (which isn't quite what it seems at first glance).
static class NumberContainer {
int value = 0;
void increment() {
value++;
}
int getValue() {
return value;
}
}
public static void main(String[] args) {
List<NumberContainer> list = new ArrayList<>();
int numElements = 100000;
for (int i = 0; i < numElements; i++) {
list.add(new NumberContainer());
}
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
}
list.forEach(container -> {
if (container.getValue() != numIterations) {
System.out.println("Problem!!!");
}
});
}
My question is: In order to be absolutely certain that "Problem!!!" won't be printed, does the "value" variable in the NumberContainer class need to be marked volatile?
Let me explain how I currently understand this.
In the first parallel stream, NumberContainer-123 (say) is incremented by ForkJoinWorker-1 (say). So ForkJoinWorker-1 will have an up-to-date cache of NumberContainer-123.value, which is 1. (Other fork-join workers, however, will have out-of-date caches of NumberContainer-123.value - they will store the value 0. At some point, these other workers' caches will be updated, but this doesn't happen straight away.)
The first parallel stream finishes, but the common fork-join pool worker threads aren't killed. The second parallel stream then starts, using the very same common fork-join pool worker threads.
Suppose, now, that in the second parallel stream, the task of incrementing NumberContainer-123 is assigned to ForkJoinWorker-2 (say). ForkJoinWorker-2 will have its own cached value of NumberContainer-123.value. If a long period of time has elapsed between the first and second increments of NumberContainer-123, then presumably ForkJoinWorker-2's cache of NumberContainer-123.value will be up-to-date, i.e. the value 1 will be stored, and everything is good. But what if the time elapsed between first and second increments if NumberContainer-123 is extremely short? Then perhaps ForkJoinWorker-2's cache of NumberContainer-123.value might be out of date, storing the value 0, causing the code to fail!
Is my description above correct? If so, can anyone please tell me what kind of time delay between the two incrementing operations is required to guarantee cache consistency between the threads? Or if my understanding is wrong, then can someone please tell me what mechanism causes the thread-local caches to be "flushed" in between the first parallel stream and the second parallel stream?
It should not need any delay. By the time you're out of ParallelStream's forEach, all the tasks have finished. That establishes a happens-before relation between the increment and the end of forEach. All the forEach calls are ordered by being called from the same thread, and the check, similarly, happens-after all the forEach calls.
int numIterations = 10000;
for (int j = 0; j < numIterations; j++) {
list.parallelStream().forEach(NumberContainer::increment);
// here, everything is "flushed", i.e. the ForkJoinTask is finished
}
Back to your question about the threads, the trick here is, the threads are irrelevant. The memory model hinges on the happens-before relation, and the fork-join task ensures happens-before relation between the call to forEach and the operation body, and between the operation body and the return from forEach (even if the returned value is Void)
See also Memory visibility in Fork-join
As #erickson mentions in comments,
If you can't establish correctness through happens-before relationships,
no amount of time is "enough." It's not a wall-clock timing issue; you
need to apply the Java memory model correctly.
Moreover, thinking about it in terms of "flushing" the memory is wrong, as there are many more things that can affect you. Flushing, for instance, is trivial: I have not checked, but can bet that there's just a memory barrier on the task completion; but you can get wrong data because the compiler decided to optimise non-volatile reads away (the variable is not volatile, and is not changed in this thread, so it's not going to change, so we can allocate it to a register, et voila), reorder the code in any way allowed by the happens-before relation, etc.
Most importantly, all those optimizations can and will change over time, so even if you went to the generated assembly (which may vary depending on the load pattern) and checked all the memory barriers, it does not guarantee that your code will work unless you can prove that your reads happen-after your writes, in which case Java Memory Model is on your side (assuming there's no bug in JVM).
As for the great pain, it's the very goal of ForkJoinTask to make the synchronization trivial, so enjoy. It was (it seems) done by marking the java.util.concurrent.ForkJoinTask#status volatile, but that's an implementation detail you should not care about or rely upon.
Some principles of clean code are:
functions should do one thing at one abstraction level
functions should be at most 20 lines long
functions should never have more than 2 input parameters
How many cpu cycles are "lost" by adding an extra function call in Java?
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
E.g.
void foo() {
bar1()
bar2()
}
void bar1() {
a();
b();
}
void bar2() {
c();
d();
}
Would become
void foo() {
a();
b();
c();
d();
}
How many cpu cycles are "lost" by adding an extra function call in Java?
This depends on whether it is inlined or not. If it's inline it will be nothing (or a notional amount)
If it is not compiled at runtime, it hardly matters because the cost of interperting is more important than a micro optimisation, and it is likely to be not called enough to matter (which is why it wasn't optimised)
The only time it really matters is when the code is called often, however for some reason it is prevented from being optimised. I would only assume this is the case because you have a profiler telling you this is a performance issue, and in this case manual inlining might be the answer.
I designed, develop and optimise latency sensitive code in Java and I choose to manually inline methods much less than 1% of time, but only after a profiler e.g. Flight Recorder suggests there is a significant performance problem.
In the rare event it matters, how much difference does it make?
I would estimate between 0.03 and 0.1 micros-seconds in real applications for each extra call, in a micro-benchmark it would be far less.
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
Yes, in fact what could happen is not only are all these method inlined, but the methods which call them are inlined as well and none of them matter at runtime, but only if the code is called enough to be optimised. i.e. not only is a,b, c and d inlined and their code but foo is inlined as well.
By default the Oracle JVM can line to a depth of 9 levels (until the code gets more than 325 bytes of byte code)
Will clean code help performance
The JVM runtime optimiser has common patterns it optimises for. Clean, simple code is generally easier to optimise and when you try something tricky or not obvious, you can end up being much slower. If it harder to understand for a human, there is a good chance it's hard for the optimiser to understand/optimise.
Runtime behavior and cleanliness of code (a compile time or life time property of code) belong to different requirement categories. There might be cases where optimizing for one category is detrimental to the other.
The question is: which category really needs you attention?
In my view cleanliness of code (or malleability of software) suffers from a huge lack of attention. You should focus on that first. And only if other requirements start to fall behind (e.g. performance) you inquire as to whether that's due to how clean the code is. That means you need to really compare, you need to measure the difference it makes. With regard to performance use a profiler of your choice: run the "dirty" code variant and the clean variant and check the difference. Is it markedly? Only if the "dirty" variant is significantly faster should you lower the cleanliness.
Consider the following piece of code, which compares a code that does 3 things in one for loop to another that has 3 different for loops for each task.
#Test
public void singleLoopVsMultiple() {
for (int j = 0; j < 5; j++) {
//single loop
int x = 0, y = 0, z = 0;
long l = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
x++;
y++;
z++;
}
l = System.currentTimeMillis() - l;
//multiple loops doing the same thing
int a = 0, b = 0, c = 0;
long m = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
a++;
}
for (int i = 0; i < 100000000; i++) {
b++;
}
for (int i = 0; i < 100000000; i++) {
c++;
}
m = System.currentTimeMillis() - m;
System.out.println(String.format("%d,%d", l, m));
}
}
When I run it, here is the output I get for time in milliseconds.
6,5
8,0
0,0
0,0
0,0
After a few runs, JVM is able to identify hotspots of intensive code and optimises parts of the code to make them significantly faster. In our previous example, after 2 runs, the JVM had already optimised the code so much that the discussion around for-loops became redundant.
Unless we know what's happening inside, we cannot predict the performance implications of changes like introduction of for-loops. The only way to actually improve the performance of a system is by measuring it and focusing only on fixing the actual bottlenecks.
There is a chance that cleaning your code may make it faster for the JVM. But even if that is not the case, every performance optimisation, comes with added code complexity. Ask yourself whether the added complexity is worth the future maintenance effort. After all, the most expensive resource on any team is the developer, not the servers, and any additional complexity slows the developer, adding to the project cost.
The way to deal it is to figure out your benchmarks, what kind of application you're making, what are the bottlenecks. If you're making a web-app, perhaps the DB is taking most of the time, and reducing the number of functions will not make a difference. On the other hand, if its an app running on a system where performance is everything, every small thing counts.
Refer to this blog and this topic.
It seems the code will be reorder even in single thread ?
public int hashCode() {
if (hash == 0) { // (1)
int off = offset;
char val[] = value;
int len = count;
int h = 0;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return hash; // (2)
}
But its really confusing to me, why (2) could return 0 and (1) could be non-zero ?
If i use the code in single thread, this will even doen't work, how could it happens ?
The first point of java memory model is:
Each action in a thread happens before every action in that thread
that comes later in the program's order.
That's why reordering in single thread is impossible. As long as code is not synchronized such guaranties are not provided for multiple threads.
Have a look at String hashCode implementation. It first loads hash to a local variable and only then performs check and return. That's how such reorderings are prevented. But this does not save us from multiple hashCode calculations.
First question:
Will reordering of instructions happen in single threaded execution?
Answer:
Reordering of instructions is a compiler optimization. The order of instructions in one thread will be the same no matter how many threads involved. Or: Yes in single threaded, too.
Second question:
Why could this lead to a problem in multi-threading but not with one thread?
Answer:
The rules for this reordering are designed to gurantee that there are no strange effects in single threaded or correctly synchronized code. That means: If we write code that's neither single threaded nor correctly synchronized there might be strange effects and we have to understand the rules and take care to avoid those effects.
So again as the auhor of the orginal blog said: Don't try if you're not really sure to understand those rules. And every compiler will be tested not to break String.hashCode() but compliers won't be tested with your code.
Edit:
Third question:
And again what is really happening?
Answer:
As we look at the code it will deal fine with not seeing changes of another thread.So the first thing we have to understand is: A method doesn't return a variable nor a constanst nor a literal. No a method return what's on top of the stack when the programm counter is reset. This has to be initialized at some point in time and it can be overwritten later on. This means it can be initialized first with the content of hash (0 now) then another thread finishes calculation and set hash to something and then the check hash == 0 happens. In turn the return value is not overwritten anymore and 0 is returned.
So the point is: The return value can change independently of the returned variable as it is not the same. Modern programming language make it look the same to make our lives easier. But this abstraction as wholes when you don't adhere to the rules.