Java specification guarantees primitive variable assignments are always atomic (expect for long and double types.
On the contrary, Fetch-and-Add operation corresponding to the famous i++ increment operation, would be non-atomic because leading to a read-modify-write operation.
Assuming this code:
public void assign(int b) {
int a = b;
}
The generated bytecode is:
public void assign(int);
Code:
0: iload_1
1: istore_2
2: return
Thus, we see the assignment is composed of two steps (loading and storing).
Assuming this code:
public void assign(int b) {
int i = b++;
}
Bytecode:
public void assign(int);
Code:
0: iload_1
1: iinc 1, 1 //extra step here regarding the previous sample
4: istore_2
5: return
Knowing that X86 processor can (at least modern ones), operates increment operation atomically, as said:
In computer science, the fetch-and-add CPU instruction is a special
instruction that atomically modifies the contents of a memory
location. It is used to implement mutual exclusion and concurrent
algorithms in multiprocessor systems, a generalization of semaphores.
Thus, first question: Despite of the fact that bytecode requires both steps (loading and storage), does Java rely on the fact that assignment operation is an operation always carried out atomically whatever the processor's architecture and so can ensure permanent atomicity (for primitive assignments) in its specification?
Second question: Is it wrong to confirm that with very modern X86 processor and without sharing compiled code across different architectures, there's no need at all to synchronize the i++ operation (or AtomicInteger)? Considering it already atomic.
Even if the i++ would translate into an X86 Fetch-And-Add instruction would change nothing because the memory mentionned in the Fetch-And-Add instruction refers to the local memory registres of the CPU and not to the general memory of the device/application. On a modern CPU, this property will extend to the local memory caches of the CPU and can even extend to the various caches used by the different cores for a multicores CPU but in the case of a multithreading application; there is absolutely no garanty that this distribution will extend to the copy of the memory used by the threads themselves.
In clear, in a multithread application, if a variable can be modified by different threads running at the same time then you must use some synchronisation mecanism provided by the system and you cannot rely on the fact that the instruction i++ occupies a single line of java code to be atomic.
Considering the Second question.
You imply that i++ will translate into the X86 Fetch-And-Add instruction which is not true. If the code is compiled and optimized by the JVM it may be true (would have to check the source code of JVM to confirm that), but that code can also run in interpreted mode, where the fetch and add are seperated and not synchronized.
Out of curiosity I checked what assembly code is generated for this Java code:
public class Main {
volatile int a;
static public final void main (String[] args) throws Exception {
new Main ().run ();
}
private void run () {
for (int i = 0; i < 1000000; i++) {
increase ();
}
}
private void increase () {
a++;
}
}
I used Java HotSpot(TM) Server VM (17.0-b12-fastdebug) for windows-x86 JRE (1.6.0_20-ea-fastdebug-b02), built on Apr 1 2010 03:25:33 version of JVM (this one I had somewhere on my drive).
These is the crucial output of running it (java -server -XX:+PrintAssembly -cp . Main):
At first it is compiled into this:
00c PUSHL EBP
SUB ESP,8 # Create frame
013 MOV EBX,[ECX + #8] # int ! Field VolatileMain.a
016 MEMBAR-acquire ! (empty encoding)
016 MEMBAR-release ! (empty encoding)
016 INC EBX
017 MOV [ECX + #8],EBX ! Field VolatileMain.a
01a MEMBAR-volatile (unnecessary so empty encoding)
01a LOCK ADDL [ESP + #0], 0 ! membar_volatile
01f ADD ESP,8 # Destroy frame
POPL EBP
TEST PollPage,EAX ! Poll Safepoint
029 RET
Then it is inlined and compiled into this:
0a8 B11: # B11 B12 <- B10 B11 Loop: B11-B11 inner stride: not constant post of N161 Freq: 0.999997
0a8 MOV EBX,[ESI] # int ! Field VolatileMain.a
0aa MEMBAR-acquire ! (empty encoding)
0aa MEMBAR-release ! (empty encoding)
0aa INC EDI
0ab INC EBX
0ac MOV [ESI],EBX ! Field VolatileMain.a
0ae MEMBAR-volatile (unnecessary so empty encoding)
0ae LOCK ADDL [ESP + #0], 0 ! membar_volatile
0b3 CMP EDI,#1000000
0b9 Jl,s B11 # Loop end P=0.500000 C=126282.000000
As you can see it does not use Fetch-And-Add instructions for a++.
Regarding your first question: the read and the write are atomic, but the read/write operation is not. I could not find a specific reference on primitives but the JLS #17.7 says something similar regarding references:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
So in your case, both the iload and istore are atomic, but the whole (iload, istore) operation is not.
Is it wrong to [consider that] there's no need at all to synchronize the i++ operation?
Regarding your second question, the code below prints 982 on my x86 machine (and not 1,000) which shows that some ++ got lost in translation ==> you need to properly synchronize a ++ operation even on a processor architecture that supports a fetch-and-add instruction.
public class Test1 {
private static int i = 0;
public static void main(String args[]) throws InterruptedException {
ExecutorService executor = Executors.newFixedThreadPool(10);
final CountDownLatch start = new CountDownLatch(1);
final Set<Integer> set = new ConcurrentSkipListSet<>();
Runnable r = new Runnable() {
#Override
public void run() {
try {
start.await();
} catch (InterruptedException ignore) {}
for (int j = 0; j < 100; j++) {
set.add(i++);
}
}
};
for (int j = 0; j < 10; j++) {
executor.submit(r);
}
start.countDown();
executor.shutdown();
executor.awaitTermination(1, TimeUnit.SECONDS);
System.out.println(set.size());
}
}
Related
Why is i++ not atomic in Java?
To get a bit deeper in Java I tried to count how often the loop in threads are executed.
So I used a
private static int total = 0;
in the main class.
I have two threads.
Thread 1: Prints System.out.println("Hello from Thread 1!");
Thread 2: Prints System.out.println("Hello from Thread 2!");
And I count the lines printed by thread 1 and thread 2. But the lines of thread 1 + lines of thread 2 don't match the total number of lines printed out.
Here is my code:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Test {
private static int total = 0;
private static int countT1 = 0;
private static int countT2 = 0;
private boolean run = true;
public Test() {
ExecutorService newCachedThreadPool = Executors.newCachedThreadPool();
newCachedThreadPool.execute(t1);
newCachedThreadPool.execute(t2);
try {
Thread.sleep(1000);
}
catch (InterruptedException ex) {
Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
}
run = false;
try {
Thread.sleep(1000);
}
catch (InterruptedException ex) {
Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
}
System.out.println((countT1 + countT2 + " == " + total));
}
private Runnable t1 = new Runnable() {
#Override
public void run() {
while (run) {
total++;
countT1++;
System.out.println("Hello #" + countT1 + " from Thread 2! Total hello: " + total);
}
}
};
private Runnable t2 = new Runnable() {
#Override
public void run() {
while (run) {
total++;
countT2++;
System.out.println("Hello #" + countT2 + " from Thread 2! Total hello: " + total);
}
}
};
public static void main(String[] args) {
new Test();
}
}
i++ is probably not atomic in Java because atomicity is a special requirement which is not present in the majority of the uses of i++. That requirement has a significant overhead: there is a large cost in making an increment operation atomic; it involves synchronization at both the software and hardware levels that need not be present in an ordinary increment.
You could make the argument that i++ should have been designed and documented as specifically performing an atomic increment, so that a non-atomic increment is performed using i = i + 1. However, this would break the "cultural compatibility" between Java, and C and C++. As well, it would take away a convenient notation which programmers familiar with C-like languages take for granted, giving it a special meaning that applies only in limited circumstances.
Basic C or C++ code like for (i = 0; i < LIMIT; i++) would translate into Java as for (i = 0; i < LIMIT; i = i + 1); because it would be inappropriate to use the atomic i++. What's worse, programmers coming from C or other C-like languages to Java would use i++ anyway, resulting in unnecessary use of atomic instructions.
Even at the machine instruction set level, an increment type operation is usually not atomic for performance reasons. In x86, a special instruction "lock prefix" must be used to make the inc instruction atomic: for the same reasons as above. If inc were always atomic, it would never be used when a non-atomic inc is required; programmers and compilers would generate code that loads, adds 1 and stores, because it would be way faster.
In some instruction set architectures, there is no atomic inc or perhaps no inc at all; to do an atomic inc on MIPS, you have to write a software loop which uses the ll and sc: load-linked, and store-conditional. Load-linked reads the word, and store-conditional stores the new value if the word has not changed, or else it fails (which is detected and causes a re-try).
i++ involves two operations :
read the current value of i
increment the value and assign it to i
When two threads perform i++ on the same variable at the same time, they may both get the same current value of i, and then increment and set it to i+1, so you'll get a single incrementation instead of two.
Example :
int i = 5;
Thread 1 : i++;
// reads value 5
Thread 2 : i++;
// reads value 5
Thread 1 : // increments i to 6
Thread 2 : // increments i to 6
// i == 6 instead of 7
Java specification
The important thing is the JLS (Java Language Specification) rather than how various implementations of the JVM may or may not have implemented a certain feature of the language.
The JLS defines the ++ postfix operator in clause 15.14.2 which says i.a. "the value 1 is added to the value of the variable and the sum is stored back into the variable". Nowhere does it mention or hint at multithreading or atomicity.
For multithreading or atomicity, the JLS provides volatile and synchronized. Additionally, there are the Atomic… classes.
Why is i++ not atomic in Java?
Let's break the increment operation into multiple statements:
Thread 1 & 2 :
Fetch value of total from memory
Add 1 to the value
Write back to the memory
If there is no synchronization then let's say Thread one has read the value 3 and incremented it to 4, but has not written it back. At this point, the context switch happens. Thread two reads the value 3, increments it and the context switch happens. Though both threads have incremented the total value, it will still be 4 - race condition.
i++ is a statement which simply involves 3 operations:
Read current value
Write new value
Store new value
These three operations are not meant to be executed in a single step or in other words i++ is not a compound operation. As a result all sorts of things can go wrong when more than one threads are involved in a single but non-compound operation.
Consider the following scenario:
Time 1:
Thread A fetches i
Thread B fetches i
Time 2:
Thread A overwrites i with a new value say -foo-
Thread B overwrites i with a new value say -bar-
Thread B stores -bar- in i
// At this time thread B seems to be more 'active'. Not only does it overwrite
// its local copy of i but also makes it in time to store -bar- back to
// 'main' memory (i)
Time 3:
Thread A attempts to store -foo- in memory effectively overwriting the -bar-
value (in i) which was just stored by thread B in Time 2.
Thread B has nothing to do here. Its work was done by Time 2. However it was
all for nothing as -bar- was eventually overwritten by another thread.
And there you have it. A race condition.
That's why i++ is not atomic. If it was, none of this would have happened and each fetch-update-store would happen atomically. That's exactly what AtomicInteger is for and in your case it would probably fit right in.
P.S.
An excellent book covering all of those issues and then some is this:
Java Concurrency in Practice
In the JVM, an increment involves a read and a write, so it's not atomic.
If the operation i++ would be atomic you wouldn't have the chance to read the value from it. This is exactly what you want to do using i++ (instead of using ++i).
For example look at the following code:
public static void main(final String[] args) {
int i = 0;
System.out.println(i++);
}
In this case we expect the output to be: 0
(because we post increment, e.g. first read, then update)
This is one of the reasons the operation can't be atomic, because you need to read the value (and do something with it) and then update the value.
The other important reason is that doing something atomically usually takes more time because of locking. It would be silly to have all the operations on primitives take a little bit longer for the rare cases when people want to have atomic operations. That is why they've added AtomicInteger and other atomic classes to the language.
There are two steps:
fetch i from memory
set i+1 to i
so it's not atomic operation.
When thread1 executes i++, and thread2 executes i++, the final value of i may be i+1.
In JVM or any VM, the i++ is equivalent to the following:
int temp = i; // 1. read
i = temp + 1; // 2. increment the value then 3. write it back
that is why i++ is non-atomic.
Concurrency (the Thread class and such) is an added feature in v1.0 of Java. i++ was added in the beta before that, and as such is it still more than likely in its (more or less) original implementation.
It is up to the programmer to synchronize variables. Check out Oracle's tutorial on this.
Edit: To clarify, i++ is a well defined procedure that predates Java, and as such the designers of Java decided to keep the original functionality of that procedure.
The ++ operator was defined in B (1969) which predates java and threading by just a tad.
We know that it is expensive to catch exceptions. But, is it also expensive to use a try-catch block in Java even if an exception is never thrown?
I found the Stack Overflow question/answer Why are try blocks expensive?, but it is for .NET.
try has almost no expense at all. Instead of doing the work of setting up the try at runtime, the code's metadata is structured at compile time such that when an exception is thrown, it now does a relatively expensive operation of walking up the stack and seeing if any try blocks exist that would catch this exception. From a layman's perspective, try may as well be free. It's actually throwing the exception that costs you - but unless you're throwing hundreds or thousands of exceptions, you still won't notice the cost.
try has some minor costs associated with it. Java cannot do some optimizations on code in a try block that it would otherwise do. For example, Java will often re-arrange instructions in a method to make it run faster - but Java also needs to guarantee that if an exception is thrown, the method's execution is observed as though its statements, as written in the source code, executed in order up to some line.
Because in a try block an exception can be thrown (at any line in the try block! Some exceptions are thrown asynchronously, such as by calling stop on a Thread (which is deprecated), and even besides that OutOfMemoryError can happen almost anywhere) and yet it can be caught and code continue to execute afterwards in the same method, it is more difficult to reason about optimizations that can be made, so they are less likely to happen. (Someone would have to program the compiler to do them, reason about and guarantee correctness, etc. It'd be a big pain for something meant to be 'exceptional') But again, in practice you won't notice things like this.
Let's measure it, shall we?
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 100000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
#Override
public String toString() {
return name + "\t" + time() + " ns";
}
public static void main(String[] args) throws Exception {
Benchmark[] benchmarks = {
new Benchmark("try") {
#Override int run(int iterations) throws Throwable {
int x = 0;
for (int i = 0; i < iterations; i++) {
try {
x += i;
} catch (Exception e) {
e.printStackTrace();
}
}
return x;
}
}, new Benchmark("no try") {
#Override int run(int iterations) throws Throwable {
int x = 0;
for (int i = 0; i < iterations; i++) {
x += i;
}
return x;
}
}
};
for (Benchmark bm : benchmarks) {
System.out.println(bm);
}
}
}
On my computer, this prints something like:
try 0.598 ns
no try 0.601 ns
At least in this trivial example, the try statement had no measurable impact on performance. Feel free to measure more complex ones.
Generally speaking, I recommend not to worry about the performance cost of language constructs until you have evidence of an actual performance problem in your code. Or as Donald Knuth put it: "premature optimization is the root of all evil".
try/catch may have some impact on performance. This is because it prevents JVM from doing some optimizations. Joshua Bloch, in "Effective Java," said the following:
• Placing code inside a try-catch block inhibits certain optimizations that modern JVM implementations might otherwise perform.
Yep, as the others have said, a try block inhibits some optimizations across the {} characters surrounding it. In particular, the optimizer must assume that an exception could occur at any point within the block, so there's no assurance that statements get executed.
For example:
try {
int x = a + b * c * d;
other stuff;
}
catch (something) {
....
}
int y = a + b * c * d;
use y somehow;
Without the try, the value calculated to assign to x could be saved as a "common subexpression" and reused to assign to y. But because of the try there is no assurance that the first expression was ever evaluated, so the expression must be recomputed. This isn't usually a big deal in "straight-line" code, but can be significant in a loop.
It should be noted, however, that this applies ONLY to JITCed code. javac does only a piddling amount of optimization, and there is zero cost to the bytecode interpreter to enter/leave a try block. (There are no bytecodes generated to mark the block boundaries.)
And for bestsss:
public class TryFinally {
public static void main(String[] argv) throws Throwable {
try {
throw new Throwable();
}
finally {
System.out.println("Finally!");
}
}
}
Output:
C:\JavaTools>java TryFinally
Finally!
Exception in thread "main" java.lang.Throwable
at TryFinally.main(TryFinally.java:4)
javap output:
C:\JavaTools>javap -c TryFinally.class
Compiled from "TryFinally.java"
public class TryFinally {
public TryFinally();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]) throws java.lang.Throwable;
Code:
0: new #2 // class java/lang/Throwable
3: dup
4: invokespecial #3 // Method java/lang/Throwable."<init>":()V
7: athrow
8: astore_1
9: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
12: ldc #5 // String Finally!
14: invokevirtual #6 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
17: aload_1
18: athrow
Exception table:
from to target type
0 9 8 any
}
No "GOTO".
To understand why the optimizations cannot be performed, It is useful to understand the underlying mechanisms. The most succinct example I could find was implemented in C macros at: http://www.di.unipi.it/~nids/docs/longjump_try_trow_catch.html
#include <stdio.h>
#include <setjmp.h>
#define TRY do{ jmp_buf ex_buf__; switch( setjmp(ex_buf__) ){ case 0: while(1){
#define CATCH(x) break; case x:
#define FINALLY break; } default:
#define ETRY } }while(0)
#define THROW(x) longjmp(ex_buf__, x)
Compilers often have difficulty determining if a jump can be localized to X, Y and Z so they skip optimizations that they can't guarantee to be safe, but the implementation itself is rather light.
Yet another microbenchmark (source).
I created a test in which I measure try-catch and no-try-catch code version based on an exception percentage. 10% percentage means that 10% of the test cases had division by zero cases. In one situation it is handled by a try-catch block, in the other by a conditional operator. Here is my results table:
OS: Windows 8 6.2 x64
JVM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 23.25-b01
Percentage | Result (try/if, ns)
0% | 88/90
1% | 89/87
10% | 86/97
90% | 85/83
Which says that there is no significant difference between any of these cases.
I have found catching NullPointException quite expensive. For 1.2k operations the time was 200ms and 12ms when I handeled it the same way with if(object==null) which was pretty improvement for me.
I'm attempting to test some benchmarking tools by running them against a simple program which increments a variable as many times as possible for 1000 milliseconds.
How many incrementations of a single 64 bit number should I expect to be able to perform on an intel i7 chip on the JDK for Mac OS X ?
My current methodology is :
start thread (t2) that continually increments "i" in an infinite loop (for(;;;)).
let the main thread (call it t1) sleep for 1000 milliseconds.
have t1 interrupt (or stop, since this deprecated method works on Apple's JDK 6) t2.
Currently, I am reproducibly getting about 2E8 incrementations (this is tabulated below: the value shown is the value that is printed when the incrementing thread is interrupted after a 1000 millisecond sleep() in the calling thread).
217057470
223302277
212337757
215177075
214785738
213849329
215645992
215651712
215363726
216135710
How can I know wether this benchmark is reasonable or not, i.e., what is the theoretical fastest speed at which an i7 chip should be able to increment a single 64-bit digit? This code is running in the JVM and is below:
package net.rudolfcode.jvm;
/**
* How many instructions can the JVM exeucte in a second?
* #author jayunit100
*/
public class Example3B {
public static void main(String[] args){
for(int i =0 ; i < 10 ; i++){
Thread addThread = createThread();
runForASecond(addThread,1000);
}
}
private static Thread createThread() {
Thread addThread = new Thread(){
Long i =0L;
public void run() {
boolean t=true;
for (;;) {
try {
i++;
}
catch (Exception e) {
e.printStackTrace();
}
}
}
#Override
public void interrupt() {
System.out.println(i);
super.interrupt();
}
};
return addThread;
}
private static void runForASecond(Thread addThread, int milli) {
addThread.start();
try{
Thread.sleep(milli);
}
catch(Exception e){
}
addThread.interrupt();
//stop() works on some JVMs...
addThread.stop();
}
}
Theoretically, making some assumptions which are probably not valid:
Assume that a number can be incremented in 1 instruction (probably not, because you're running in a JVM and not natively)
Assume that a 2.5 GHz processor can execute 2,500,000,000 instructions per second (but in reality, it's more complicated than that)
Then you could say that 2,500,000,000 increments in 1 second is a "reasonable" upper bound based on the simplest possible back-of-the-envelope estimation.
How far off is that from your measurement?
2,500,000,000 is O(1,000,000,000)
2E8 is O(100,000,000)
So we're only off by 1 order of magnitude. Given the wildly unfounded assumptions – sounds reasonable to me.
First of all beware of JVM optimisations! You must be sure you measure exactly what you think you do. Since Long i =0L; is not volatile and it's effectively useless (nothing is done to intermediate values) JIT can do pretty nasty stuff.
As for the estimation you can think of not more then X*10^9 operations per second on X GHz machine. You can safely divide this value for 10 for probably because instructions aren't mapped 1:1.
So you're pretty close :)
I've been told at school that it's a bad practice to modify the index variable of a for loop:
Example :
for(int i = 0 ; i < limit ; i++){
if(something){
i+=2; //bad
}
if(something){
limit+=2; //bad
}
}
The argument was that some compiler optimization can optimize the loop and not recalculate the index and bound at each loop.
I 've made some test in java and it seems that by default index and bound are recalculate each time.
I'm wondering if it's possible to activate this kind of feature in the JVM HotSpot?
For example to optimize this kind of loop :
for(int i = 0 ; i < foo.getLength() ; i++){ }
without having to write :
int length = foo.getLength()
for(int i = 0 ; i < length ; i++){ }
It's just an example I'm curious to try and see the improvments.
EDIT
According to Peter Lawrey answer why in this simple example the JVM don't inline getLength() method? :
public static void main(String[] args) {
Too t = new Too();
for(int j=0; j<t.getLength();j++){
}
}
class Too {
int l = 10;
public Too() {
}
public int getLength(){
//System.out.println("test");
return l;
}
}
In the output "test" is print 10 times.
I think it could be nice to optimize this kind of execution.
EDIT 2 :
Seems I made a misunderstood...
I have remove the println and indeed the profiler tell me that the method getLength() is not even call once in this case.
I've made some test in java and it seems that by default index and bound are recalculate each time.
According to the Java Language Specification, this:
for(int i = 0 ; i < foo.getLength() ; i++){ }
means that getLength() is called on each loop iteration. Java compilers are only allowed to move the getLength() call out of the loop if they can effectively prove that it does not alter the observable behavior.
(For instance, if getLength() just returns the value of some variable, then there is a chance that the JIT compiler can inline the call. If after inlining it can deduce that the variable won't change (under certain assumptions) it can apply a hoisting optimization. On the other hand, if getLength() involves getting the length of a concurrent or synchronized collection, the chances are slim to none that the hoisting optimization will be permitted ... because of potential actions of other threads.)
So that's what a compiler is allowed to do.
I'm wondering if it's possible to activate this kind of feature in the JVM HotSpot?
The simple answer is No.
You seem to be suggesting a compiler switch that tells / allows the compiler to ignore the JLS rules. There is no such switch. Such a switch would be a BAD IDEA. It would be liable to cause correct/valid/working programs to break. Consider this:
class Test {
int count;
int test(String[] arg) {
for (int i = 0; i < getLength(arg); i++) {
// ...
}
return count;
}
int getLength(String[] arg) {
count++;
return arg.length;
}
}
If the compiler was permitted to move the getLength(arg) call out of the loop, it would change the number of times that the method was called, and therefore change the value returned by the test method.
Java optimizations that change the behaviour of a properly written Java program are not valid optimizations. (Note that multi-threading tends to muddy the waters. The JLS, and specifically the memory model rules, permit a compiler to perform optimizations that could result in different threads seeing inconsistent versions of the application's state ... if they don't synchronize properly, resulting in behaviour that is incorrect from the developer's perspective. But the real problem is with the application, not the compiler.)
By the way, a more convincing reason that you shouldn't change the loop variable in the loop body is that it makes your code harder to understand.
It depends on what foo.getLength() does. If it can be inlined, it can be effectively the same thing. If it cannot be inlined, the JVM cannot determine whether the result is the same.
BTW you can write for a one liner.
for(int i = 0, length = foo.getLength(); i < length; i++){ }
EDIT: It is worth nothing that;
methods and loops are usually not optimised until they have been called 10,000 times.
profilers sub-sample invocations to reduce overhead. They might count every 10 or 100 or more so a trivial example may not show up.
The main reason not to do that is that it makes it much harder to understand and maintain the code.
Whatever the JVM optimizes, it won't compromise the correctness of the program. If it can't do an optimization because the index is modified inside the loop, then it won't optimize it. I fail to see how a Java test could show if there is or not such an optimization.
Anyway, Hotspot will optimize a whole lot of things for you. And your second example is a kind of explicit optimization that Hotspot will happily do for you.
Before we go into more reasoning why the field access isn't inlined. Maybe we should show that yes, if you know what you're looking for (which really is non-trivial in Java) the field access is inlined just fine.
First we need a basic understanding of how the JIT works - and I really can't do that in one answer. Suffice to say that the JIT only works after a function has been called often enough (>10k usually)
So we use the following code for actual testing stuff:
public class Test {
private int length;
public Test() {
length = 10000;
}
public static void main(String[] args) {
for (int i = 0; i < 14000; i++) {
foo();
}
}
public static void foo() {
Test bar = new Test();
int sum = 0;
for (int i = 0; i < bar.getLength(); i++) {
sum += i;
}
System.out.println(sum);
}
public int getLength() {
System.out.print("_");
return length;
}
}
Now we compile this code and run it with java.exe -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Test.foo Test >Test.txt Which results in a unholy long output, but the interesting part is:
0x023de0e7: mov %esi,0x24(%esp)
0x023de0eb: mov %edi,0x28(%esp)
0x023de0ef: mov $0x38fba220,%edx ; {oop(a 'java/lang/Class' = 'java/lang/System')}
0x023de0f4: mov 0x6c(%edx),%ecx ;*getstatic out
; - Test::getLength#0 (line 24)
; - Test::foo#14 (line 17)
0x023de0f7: cmp (%ecx),%eax ;*invokevirtual print
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; implicit exception: dispatches to 0x023de29b
0x023de0f9: mov $0x3900e9d0,%edx ;*invokespecial write
; - java.io.PrintStream::print#9
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; {oop("_")}
0x023de0fe: nop
0x023de0ff: call 0x0238d1c0 ; OopMap{[32]=Oop off=132}
;*invokespecial write
; - java.io.PrintStream::print#9
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; {optimized virtual_call}
0x023de104: mov 0x20(%esp),%eax
0x023de108: mov 0x8(%eax),%ecx ;*getfield length
; - Test::getLength#9 (line 25)
; - Test::foo#14 (line 17)
0x023de10b: mov 0x24(%esp),%esi
0x023de10f: cmp %ecx,%esi
0x023de111: jl 0x023de0d8 ;*if_icmpge
; - Test::foo#17 (line 17)
which is the inner loop we're actually executing. Note that the following 0x023de108: mov 0x8(%eax),%ecx loads the length value in a register - the stuff above it is for the System.out call (I'd have removed it since it makes it more complicated, but since more than one person thought this would hinder inlining I left it in there). Even if you aren't that fit in x86 assembly you can clearly see: No call instruction anywhere except for the native write call.
All,
While going through some of the files in Java API, I noticed many instances where the looping counter is being decremented rather than increment. i.e. in for and while loops in String class. Though this might be trivial, is there any significance for decrementing the counter rather than increment?
I've compiled two simple loops with eclipse 3.6 (java 6) and looked at the byte code whether we have some differences. Here's the code:
for(int i = 2; i >= 0; i--){}
for(int i = 0; i <= 2; i++){}
And this is the bytecode:
// 1st for loop - decrement 2 -> 0
0 iconst_2
1 istore_1 // i:=2
2 goto 8
5 inc 1 -1 // i+=(-1)
8 iload_1
9 ifge 5 // if (i >= 0) goto 5
// 2nd for loop - increment 0 -> 2
12 iconst_0
13 istore_1 // i:=0
14 goto 20
17 inc 1 1 // i+=1
20 iload_1
21 iconst 2
22 if_icmple 17 // if (i <= 2) goto 17
The increment/decrement operation should make no difference, it's either +1 or +(-1). The main difference in this typical(!) example is that in the first example we compare to 0 (ifge i), in the second we compare to a value (if_icmple i 2). And the comaprision is done in each iteration. So if there is any (slight) performance gain, I think it's because it's less costly to compare with 0 then to compare with other values. So I guess it's not incrementing/decrementing that makes the difference but the stop criteria.
So if you're in need to do some micro-optimization on source code level, try to write your loops in a way that you compare with zero, otherwise keep it as readable as possible (and incrementing is much easier to understand):
for (int i = 0; i <= 2; i++) {} // readable
for (int i = -2; i <= 0; i++) {} // micro-optimized and "faster" (hopefully)
Addition
Yesterday I did a very basic test - just created a 2000x2000 array and populated the cells based on calculations with the cell indices, once counting from 0->1999 for both rows and cells, another time backwards from 1999->0. I wasn't surprised that both scenarios had a similiar performance (185..210 ms on my machine).
So yes, there is a difference on byte code level (eclipse 3.6) but, hey, we're in 2010 now, it doesn't seem to make a significant difference nowadays. So again, and using Stephens words, "don't waste your time" with this kind of optimization. Keep the code readable and understandable.
When in doubt, benchmark.
public class IncDecTest
{
public static void main(String[] av)
{
long up = 0;
long down = 0;
long upStart, upStop;
long downStart, downStop;
long upStart2, upStop2;
long downStart2, downStop2;
upStart = System.currentTimeMillis();
for( long i = 0; i < 100000000; i++ )
{
up++;
}
upStop = System.currentTimeMillis();
downStart = System.currentTimeMillis();
for( long j = 100000000; j > 0; j-- )
{
down++;
}
downStop = System.currentTimeMillis();
upStart2 = System.currentTimeMillis();
for( long k = 0; k < 100000000; k++ )
{
up++;
}
upStop2 = System.currentTimeMillis();
downStart2 = System.currentTimeMillis();
for( long l = 100000000; l > 0; l-- )
{
down++;
}
downStop2 = System.currentTimeMillis();
assert (up == down);
System.out.println( "Up: " + (upStop - upStart));
System.out.println( "Down: " + (downStop - downStart));
System.out.println( "Up2: " + (upStop2 - upStart2));
System.out.println( "Down2: " + (downStop2 - downStart2));
}
}
With the following JVM:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Has the following output (ran it multiple times to make sure the JVM was loaded and to make sure the numbers settled down a little).
$ java -ea IncDecTest
Up: 86
Down: 84
Up2: 83
Down2: 84
These all come extremely close to one another and I have a feeling that any discrepancy is a fault of the JVM loading some code at some points and not others, or a background task happening, or simply falling over and getting rounded down on a millisecond boundary.
While at one point (early days of Java) there might have been some performance voodoo to be had, it seems to me that that is no longer the case.
Feel free to try running/modifying the code to see for yourself.
It is possible that this is a result of Sun engineers doing a whole lot of profiling and micro-optimization, and those examples that you found are the result of that. It is also possible that they are the result of Sun engineers "optimizing" based on deep knowledge of the JIT compilers ... or based on shallow / incorrect knowledge / voodoo thinking.
It is possible that these sequences:
are faster than the increment loops,
are no faster or slower than increment loops, or
are slower than increment loops for the latest JVMs, and the code is no longer optimal.
Either way, you should not emulate this practice in your code, unless thorough profiling with the latest JVMs demonstrates that:
your code really will benefit from optimization, and
the decrementing loop really is faster than the incrementing loop for your particular application.
And even then, you may find that your carefully hand optimized code is less than optimal on other platforms ... and that you need to repeat the process all over again.
These days, it is generally recognized that the best first strategy is to write simple code and leave optimization to the JIT compiler. Writing complicated code (such as loops that run in reverse) may actually foil the JIT compiler's attempts to optimize.