Why is the max recursion depth I can reach non-deterministic?

Why is the max recursion depth I can reach non-deterministic? - java

I decided to try a few experiments to see what I could discover about the size of stack frames, and how far through the stack the currently executing code was. There are two interesting questions we might investigate here:
How many levels deep into the stack is the current code?
How many levels of recursion can the current method reach before it hits a StackOverflowError?
Stack depth of currently executing code
Here's the best I could come up with for this:
public static int levelsDeep() {
try {
throw new SomeKindOfException();
} catch (SomeKindOfException e) {
return e.getStackTrace().length;
}
}
This seems a bit hacky. It generates and catches an exception, and then looks to see what the length of the stack trace is.
Unfortunately it also seems to have a fatal limitation, which is that the maximum length of the stack trace returned is 1024. Anything beyond that gets axed, so the maximum that this method can return is 1024.
Question: Is there a better way of doing this that isn't so hacky and doesn't have this limitation?
For what it's worth, my guess is that there isn't: Throwable.getStackTraceDepth() is a native call, which suggests (but doesn't prove) that it can't be done in pure Java.
Determining how much more recursion depth we have left
The number of levels we can reach will be determined by (a) size of stack frame, and (b) amount of stack left. Let's not worry about size of stack frame, and just see how many levels we can reach before we hit a StackOverflowError.
Here's my code for doing this:
public static int stackLeft() {
try {
return 1+stackLeft();
} catch (StackOverflowError e) {
return 0;
}
}
It does its job admirably, even if it's linear in the amount of stack remaining. But here is the very, very weird part. On 64-bit Java 7 (OpenJDK 1.7.0_65), the results are perfectly consistent: 9,923, on my machine (Ubuntu 14.04 64-bit). But Oracle's Java 8 (1.8.0_25) gives me non-deterministic results: I get a recorded depth of anywhere between about 18,500 and 20,700.
Now why on earth would it be non-deterministic? There's supposed to be a fixed stack size, isn't there? And all of the code looks deterministic to me.
I wondered whether it was something weird with the error trapping, so I tried this instead:
public static long badSum(int n) {
if (n==0)
return 0;
else
return 1+badSum(n-1);
}
Clearly this will either return the input it was given, or overflow.
Again, the results I get are non-deterministic on Java 8. If I call badSum(14500), it will give me a StackOverflowError about half the time, and return 14500 the other half. but on Java 7 OpenJDK, it's consistent: badSum(9160) completes fine, and badSum(9161) overflows.
Question: Why is the maximum recursion depth non-deterministic on Oracle's Java 8? And why is it deterministic on OpenJDK 7?

The observed behavior is affected by the HotSpot optimizer, however it is not the only cause. When I run the following code
public static void main(String[] argv) {
System.out.println(System.getProperty("java.version"));
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
}
static int countDepth() {
try { return 1+countDepth(); }
catch(StackOverflowError err) { return 0; }
}
with JIT enabled, I get results like:
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2097
4195
4195
4195
12587
12587
12587
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2095
4193
4193
4193
12579
12579
12579
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2087
4177
4177
12529
12529
12529
12529
Here, the effect of the JIT is clearly visible, obviously the optimized code needs less stack space, and it’s shown that tiered compilation is enabled (indeed, using -XX:-TieredCompilation shows a single jump if the program runs long enough).
In contrast, with disabled JIT I get the following results:
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2104
2104
2104
2104
2104
2104
2104
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2076
2076
2076
2076
2076
2076
2076
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2105
2105
2105
2105
2105
2105
2105
The values still vary, but not within the single runtime thread and with a lesser magnitude.
So, there is a (rather small) difference that becomes much larger if the optimizer can reduce the stack space required per method invocation, e.g. due to inlining.
What can cause such a difference? I don’t know how this JVM does it but one scenario could be that the way a stack limit is enforced requires a certain alignment of the stack end address (e.g. matching memory page sizes) while the memory allocation returns memory with a start address that has a weaker alignment guaranty. Combine such a scenario with ASLR and there might be always a difference, within the size of the alignment requirement.

Why is the maximum recursion depth non-deterministic on Oracle's Java 8? And why is it deterministic on OpenJDK 7?
About that, possibly relates to changes in garbage collection. Java can choose a different mode for gc each time. http://vaskoz.wordpress.com/2013/08/23/java-8-garbage-collectors/

It's deprecated, but you could try Thread.countStackFrames() like
public static int levelsDeep() {
return Thread.currentThread().countStackFrames();
}
Per the Javadoc,
Deprecated. The definition of this call depends on suspend(), which is deprecated. Further, the results of this call were never well-defined.
Counts the number of stack frames in this thread. The thread must be suspended.
As for why you observe non-deterministic behaviour, I can only assume it is some combination of the JIT and garbage collector.

You don't need to catch the exception, just create one like this:
new Throwable().getStackTrace()
Or:
Thread.currentThread().getStackTrace()
It's still hacky as the result is JVM implementation specific. And JVM may decide to trim the result for better performance.

Related

How does a large -Xss setting affect server performance? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a Java server that reads a large serialised file at startup. This requires a large -Xss setting solely for the main thread at startup. All threads that handle server requests require much less stack space. (Xss is 20M).
How will this (continually increasing) Xss value affect the server's memory usage?

The answer is complicated. You're also asking the wrong question - make sure you read the entire answer all the way to the end.
Answering your question: How bad is large -Xss?
The amount of RAM a JVM needs is, basically, heap+metaspace+MAX_THREADS*STACK_SIZE.
Heap is simple: That's what the -Xmx parameter is for. metaspace is a more or less constant (I'm oversimplifying) and not particularly large amount.
Furthermore, assuming it's the usual server setup where you've set things up such that the JVM gets a static amount of memory (it's a server - it has a set amount of RAM and the best option, usually, is to spend it all. Give every major process running on the system a locked in configured amount of RAM. If the JVM is the only major software running on there (e.g. there is a database involved but it runs on another machine), and you have 8GB in the box, then give the JVM ~7GB. Why wouldn't you? Use -Xmx and -Xms`, set to the same value, and make it large. If postgres is also running on it, give the JVM perhaps 3GB and postgres 4GB (depends on how db-heavy your app is, of course). etcetera.
The point is, if you have both a large stacksize and a decently large max threads, let's say an -Xss of 20MB and max-threads of 100, then you lose 2GB of your allocated 7: On a box with 8GB installed and only the JVM as major consumer of resources, this setting:
java -Xmx7g -Xms7g -Xss20m
would be completely wrong and cause all sorts of trouble - that adds up to 9 GB, and I haven't even started accounting for metaspace yet, or the needs of the kernel. The box doesn't have that much! Instead you should be doing perhaps -Xmx5g -Xms5g -Xss20m.
And now you know what the performance cost is of this: The cost is having to reduce your -Xmx -Xms value from 7 to 5. It gets disastrously worse if you had to knock it down from 3 to 1 because it's a 4GB box - at that point what you're doing is basically impossible unless you first launch a new server with more ram in it.
Actually helping you solve your problem
Forget about all of the above, that's the wrong way to solve this problem. Keep your -Xss nice and low, or don't set it.
Instead, take your init code and isolate it, then run it in a separately set up thread (and then just .join() on that thread to wait for it to complete and flush all the fields your init code modified back; yield() sets up HB/HA as needed). Use this thread constructor:
Runnable initCode = () -> {
// your init stuff goes here
};
ThreadGroup tg = Thread.currentThread().getThreadGroup();
Thread initThread = new Thread(tg, runnable, "init", 20L * 1024L * 1024L);
initThread.start();
initThread.join();
But, do some research first. The API of Thread is horribly designed and makes all sorts of grave errors. In particular, the stack size number (20MB here) is just a hint and the javadoc says any VM is free to just completely ignore it. Good API design would have of course specced that an exception is thrown instead if your requested stacksize is not doable by the VM.
I've done a quick test; adoptopenjdk 11 on a mac seems to have no problem with it.
Here's my test setup:
> cat Test.java
public class Test {
public static void main(String[] args) throws Exception {
Runnable r = () -> {
System.out.println("Allowed stack depth: " + measure());
};
r.run();
r.run();
Thread t = new Thread(Thread.currentThread().getThreadGroup(), r, "init", 1024L * 1024L);
t.start();
t.join();
r.run();
}
public static int measure() {
int min = 1;
int max = 50000;
while (min < max) {
int mid = (max + min) / 2;
try {
attempt(mid);
if (min == mid) return min;
min = mid;
} catch (StackOverflowError e) {
max = mid;
}
}
return min;
}
public static void attempt(int depth) {
if (depth == 0) return;
attempt(depth - 1);
}
}
> javac Test.java; java -Xss200k Test
Allowed stack depth: 2733
Allowed stack depth: 6549
Allowed stack depth: 49999
Allowed stack depth: 6549
You can't check the size of the stack trace, as the JVM has a hard limit and won't store more than 1024 stack trace elements, thus the binary search for the answer.
I can't quite explain why the value isn't constant (it hops from 2733 to 6549), or even why an -Xss of 150k produces higher numbers for a real What The Heck???? - I'll ask a question about that right after posting this answer, but it does show that the thread that was made with a larger stack does indeed let you have a far deeper method callstack.
Run this test code on the target environment with the target JDK just to be sure it'll work, and then you have your actual solution :)

Why does the count of calls of a recursive method causing a StackOverflowError vary between program runs? [duplicate]

This question already has answers here:
Why is the max recursion depth I can reach non-deterministic?
(4 answers)
Closed 5 years ago.
A simple class for demonstration purposes:
public class Main {
private static int counter = 0;
public static void main(String[] args) {
try {
f();
} catch (StackOverflowError e) {
System.out.println(counter);
}
}
private static void f() {
counter++;
f();
}
}
I executed the above program 5 times, the results are:
22025
22117
15234
21993
21430
Why are the results different each time?
I tried setting the max stack size (for example -Xss256k). The results were then a bit more consistent but again not equal each time.
Java version:
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
EDIT
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.

The observed variance is caused by background JIT compilation.
This is how the process looks like:
Method f() starts execution in interpreter.
After a number of invocations (around 250) the method is scheduled for compilation.
The compiler thread works in parallel to the application thread. Meanwhile the method continues execution in interpreter.
As soon as the compiler thread finishes compilation, the method entry point is replaced, so the next call to f() will invoke the compiled version of the method.
There is basically a race between applcation thread and JIT compiler thread. Interpreter may perform different number of calls before the compiled version of the method is ready. At the end there is a mix of interpreted and compiled frames.
No wonder that compiled frame layout differs from interpreted one. Compiled frames are usually smaller; they don't need to store all the execution context on the stack (method reference, constant pool reference, profiler data, all arguments, expression variables etc.)
Futhermore, there is even more race possibilities with Tiered Compilation (default since JDK 8). There can be a combination of 3 types of frames: interpreter, C1 and C2 (see below).
Let's have some fun experiments to support the theory.
Pure interpreted mode. No JIT compilation.
No races => stable results.
$ java -Xint Main
11895
11895
11895
Disable background compilation. JIT is ON, but is synchronized with the application thread.
No races again, but the number of calls is now higher due to compiled frames.
$ java -XX:-BackgroundCompilation Main
23462
23462
23462
Compile everything with C1 before execution. Unlike previous case there will be no interpreted frames on the stack, so the number will be a bit higher.
$ java -Xcomp -XX:TieredStopAtLevel=1 Main
23720
23720
23720
Now compile everything with C2 before execution. This will produce the most optimized code with the smallest frame. The number of calls will be the highest.
$ java -Xcomp -XX:-TieredCompilation Main
59300
59300
59300
Since the default stack size is 1M, this should mean the frame now is only 16 bytes long. Is it?
$ java -Xcomp -XX:-TieredCompilation -XX:CompileCommand=print,Main.f Main
0x00000000025ab460: mov %eax,-0x6000(%rsp) ; StackOverflow check
0x00000000025ab467: push %rbp ; frame link
0x00000000025ab468: sub $0x10,%rsp
0x00000000025ab46c: movabs $0xd7726ef0,%r10 ; r10 = Main.class
0x00000000025ab476: addl $0x2,0x68(%r10) ; Main.counter += 2
0x00000000025ab47b: callq 0x00000000023c6620 ; invokestatic f()
0x00000000025ab480: add $0x10,%rsp
0x00000000025ab484: pop %rbp ; pop frame
0x00000000025ab485: test %eax,-0x23bb48b(%rip) ; safepoint poll
0x00000000025ab48b: retq
In fact, the frame here is 32 bytes, but JIT has inlined one level of recursion.
Finally, let's look at the mixed stack trace. In order to get it, we'll crash JVM on StackOverflowError (option available in debug builds).
$ java -XX:AbortVMOnException=java.lang.StackOverflowError Main
The crash dump hs_err_pid.log contains the detailed stack trace where we can find interpreted frames at the bottom, C1 frames in the middle and lastly C2 frames on the top.
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5958 [0x00007f21251a5900+0x0000000000000058]
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
// ... repeated 19787 times ...
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
// ... repeated 1866 times ...
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
j Main.f()V+8
j Main.f()V+8
// ... repeated 1839 times ...
j Main.f()V+8
j Main.main([Ljava/lang/String;)V+0
v ~StubRoutines::call_stub

First of all, the following has not been researched. I have not "deep dived" the OpenJDK source code to validate any of the following, and I don't have access to any inside knowledge.
I tried to validate your results by running your test on my machine:
$ java -version
openjdk version "1.8.0_71"
OpenJDK Runtime Environment (build 1.8.0_71-b15)
OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)
I get the "count" varying over a range of ~250. (Not as much as you are seeing)
First some background. A thread stack in a typical Java implementation is a contiguous region of memory that is allocated before the thread is started, and that is never grown or moved. A stack overflow happens when the JVM tries to create a stack frame to make a method call, and the frame goes beyond the limits of the memory region. The test could be done by testing the SP explicitly, but my understanding is that it is normally implemented using a clever trick with the memory page settings.
When a stack region is allocated, the JVM makes a syscall to tell the OS to mark a "red zone" page at the end of the stack region read-only or non-accessible. When a thread makes a call that overflows the stack, it accesses memory in the "red zone" which triggers a memory fault. The OS tells the JVM via a "signal", and the JVM's signal handler maps it to a StackOverflowError that is "thrown" on the thread's stack.
So here are a couple of possible explanations for the variability:
The granularity of hardware-based memory protection is the page boundary. So if the thread stack has been allocated using malloc, the start of the region is not going to be page aligned. Therefore the distance from the start of the stack frame to the first word of the "red zone" (which >is< page aligned) is going to be variable.
The "main" stack is potentially special, because that region may be used while the JVM is bootstrapping. That might lead to some "stuff" being left on the stack from before main was called. (This is not convincing ... and I'm not convinced.)
Having said this, the "large" variability that you are seeing is baffling. Page sizes are too small to explain a difference of ~7000 in the counts.
UPDATE
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
Interesting. Among other things, that could cause stack limit checking to be done differently.
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Plausible. The size of the stackframe could well be different after the f() method has been JIT compiled. Assuming f() was JIT compiled at some point you stack will have a mixture of "old" and "new" frames. If the JIT compilation occurred at different points, then the ratio will be different ... and hence the count will be different when you hit the limit.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
Little chance of that, I'm afraid ... unless you are prepared to PAY someone to do a few days research for you.
1) No such (public) reference documentation exists, AFAIK. At least, I've never been able to find a definitive source for this kind of thing ... apart from deep diving the source code.
2) Looking at the JIT compiled code tells you nothing of how the bytecode interpreter handled things before the code was JIT compiled. So you won't be able to see if the frame size has changed.

The exact functioning of Java stack undocumented, but it totally depends on the memory allocated to that thread.
Just try using the Thread constructor with stacksize and see if it gets constant. I have not tried it it, so please share the results.

Tuning Java 7 to match performance of Java 6

We have a simple unit test as part of our performance test suite which we use to verify that the base system is sane and performs before we even start testing our code. This way we usually verify that a machine is suitable for running actual performance tests.
When we compare Java 6 and Java 7 using this test, Java 7 takes considerably longer to execute! We see an average of 22 seconds for Java 6 and 24 seconds for Java 7. The test only computes fibonacci, so only bytecode execution in a single thread should be relevant here and not I/O or anything else.
Currently we run it with default settings on Windows with or without "-server", with both 32 and 64 bit JVM, all runs indicate a similar degradation for Java 7.
Which tuning options may be suitable here to try to match Java 7 against Java 6?
public class BaseLinePerformance {
#Before
public void setup() throws Exception{
fib(46);
}
#Test
public void testBaseLine() throws Exception {
long start = System.currentTimeMillis();
fib(46);
fib(46);
System.out.println("Time: " + (System.currentTimeMillis() - start));
}
public static void fib(final int n) throws Exception {
for (int i = 0; i < n; i++) {
System.out.println("fib(" + i + ") = " + fib2(i));
}
}
public static int fib2(final int n) {
if (n == 0)
return 0;
else if (n == 1)
return 1;
else
return fib2(n - 2) + fib2(n - 1);
}
}
Update: I have reduced the test to not do any sleeps and followed the other suggestions from How do I write a correct micro-benchmark in Java?, I still see the same difference between Java 7 and Java 6, additional JVM options to print compilation and GC do not show any output during the actual test, only initially compilation information is printed.

One of my colleagues found out the reason for this after a bit more digging:
There is a JVM flag -XX:MaxRecursiveInlineLevel which has a default value of 1. It seems the handling of this setting was slightly incorrect in previous versions, so Sun/Oracle "fixed" this in Java 7, however it has the side-effect that sometimes the inlining now is done less aggressively and thus pure runtime/CPU time of recursive code can be longer than before.
We are testing setting it to 2 to get the same behavior as in Java 6 at least for the test in question.

This is not an easy answer, there are plenty of things that can account for those 2 seconds.
I am assuming for your comments that you are already familiar with micro benchmarking and that your benchmark is being run after warming up the JVM having your code reach an optimized JIT state and no GCs happening, also assuming that your hardware setup has not changed.
I would recommend CPU profiling your benchmark, that will help you identify where those two seconds are being accounted and perhaps act accordingly.
If you are curious about the bytecode you can take a peek at it.
To do this you can compile your class and do javap -c ClassName on both machines, this will disassemble the class file bytecode and show it to you, here you will surely see changes between both compiled classes.
In conclusion, profile and tune your application accordingly to reach 22 seconds after looking at the data, there is nothing you can do anyways about the bytecode implementation.

Stack performance in programming languages

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:
public class Fib {
public static int fib(int n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
public static void main(String[] args) {
System.out.println(fib(Integer.valueOf(args[0])));
}
}
Ok so the point is that using this algorithm with input 40 I got these timings:
C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s
They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.
I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?
How would you explain these results?
EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)
EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)

You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.
Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.
In addition, make sure you're measuring CPU time rather than clock time.
Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.
For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.
In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:
int fib(unsigned int n) {
int t, a, b;
if (n < 2) return 1;
a = b = 1;
while (n-- >= 2) {
t = a + b;
a = b;
b = t;
}
return b;
}
further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.
And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:
static int fib(unsigned int n) {
static int lookup[] = {
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
46368, 75025, 121393, 196418, 317811, 514229, 832040,
1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
24157817, 39088169, 63245986, 102334155, 165580141 };
int t, a, b;
if (n < sizeof(lookup)/sizeof(*lookup))
return lookup[n];
a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
t = a + b;
a = b;
b = t;
}
return b;
}
That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.

You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)
When I try to reproduce for OCaml I get:
let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))
let () = Format.printf "%d#." (f 40)
$ ocamlopt fib.ml
$ time ./a.out
165580141
real 0m1.643s
This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).

One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.
Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.
You may be able to see a difference by inverting the logic of the if:
public static int fib(int n) {
if (n >= 2) return fib(n-1) + fib(n-2);
return 1;
}
It's worth a try, anyway :)
As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)

I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...
As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...

Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.

I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.
The unoptimised version was 34 lines long and looked like a straight translation of the C code.
The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.

With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.
Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.

One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.
Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

C++ and Java performance

this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.

You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.

All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!

Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)

Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.

In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).

To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?

What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}

In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.