general concept-java code and cycle clocks

general concept-java code and cycle clocks - java

I am just curious how can we know how many cycle clocks CPU needs by looking through certain java code.
ex:
public class Factorial
{
public static void main(String[] args)
{ final int NUM_FACTS = 100;
for(int i = 0; i < NUM_FACTS; i++)
System.out.println( i + "! is " + factorial(i));
}
public static int factorial(int n)
{ int result = 1;
for(int i = 2; i <= n; i++)
result *= i;
return result;
}
}

I am just curious how can we know how many cycle clocks CPU needs by looking through certain java code.
If you are talking about real hardware clock cycles, the answer is "You can't know"1.
The reason that it is so hard is that a program goes through a number of complicated (and largely opaque) transformations before and during execution:
The source code is compiled to bytecodes ahead of time. This depends on the bytecode compiler used.
The bytecodes are JIT compiled to native code, at some time during the execution. This depends on the JIT compiler in the execution platform AND on the execution behavior of the application.
The number of clock cycles taken to execute a given instruction sequence depends on native code, the CPU model including such things as memory cache sizes and ... the application's memory access patterns.
On top of that, the JVM has various sources of "under the hood" non-determinism, and various launch-time tuning parameters that influence the behavior ... and cycle counts.
But fortunately, there are practical ways to examine software performance that don't depend on measuring hardware clock cycles. You can:
measure or estimate native instructions executed,
measure or estimate bytecodes executed,
estimate Java-level operations or statements executed, or
run the code and measure the time taken.
The last two are generally the most practical.
1 - ... except by running the application / JVM on an accurate hardware-level simulator for your exact hardware configuration and getting the simulator to count the clock cycles. And to be honest, I don't know if simulators that operate to that level actually exist. If they do, they are most likely proprietary to Intel, AMD and so on.

I don't think you'd be able to know the clock cycles.
But you could measure the CPU time it took to run the code.
You'd need to use the java.lang.management API.
Take a look here:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/ThreadMXBean.html

Related

java for loop performance difference

I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10

In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.

Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.

What class is loaded when creating a thread with a lambda expression? [duplicate]

I have this following code
public class BenchMark {
public static void main(String args[]) {
doLinear();
doLinear();
doLinear();
doLinear();
}
private static void doParallel() {
IntStream range = IntStream.range(1, 6).parallel();
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("parallel: " +reduce + " -- Time: " + (endTime - startTime));
}
private static void doLinear() {
IntStream range = IntStream.range(1, 6);
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("linear: " +reduce + " -- Time: " + (endTime - startTime));
}
}
I was trying to benchmark streams but came through this execution time steadily decreasing upon calling the same function again and again
Output:
linear: 120 -- Time: 57008226
linear: 120 -- Time: 23202
linear: 120 -- Time: 17192
linear: 120 -- Time: 17802
Process finished with exit code 0
There is a huge difference between first and second execution time.
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there ?
Is there anyway to avoid this optimization so I can benchmark true execution time ?

I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?
The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.
The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).
After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.
Is there anyway to avoid this optimization so I can benchmark true execution time?
You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.
For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.

UPDATE: Refer #Marko's answer for an explanation of the initial latency due to lambda linkage.
The higher execution time for the first call is probably a result of the JIT effect. In short, the JIT compilation of the byte codes into native machine code occurs during the first time your method is called. The JVM then attempts further optimization by identifying frequently-called (hot) methods, and re-generate their codes for higher performance.
Is there anyway to avoid this optimization so I can benchmark true execution time ?
You can certainly account for the JVM initial warm-up by excluding the first few result. Then increase the number of repeated calls to your method in a loop of tens of thousands of iterations, and average the results.
There are a few more options that you might want to consider adding to your execution to help reduce noises as discussed in this post. There are also some good tips from this post too.

true execution time
There's no thing like "true execution time". If you need to solve this task only once, the true execution time would be the time of the first test (along with time to startup the JVM itself). In general the time spent for execution of given piece of code depends on many things:
Whether this piece of code is interpreted, JIT-compiled by C1 or C2 compiler. Note that there are not just three options. If you call one method from another, one of them might be interpreted and another might be C2-compiled.
For C2 compiler: how this code was executed previously, so what's in branch and type profile. The polluted type profile can drastically reduce the performance.
Garbage collector state: whether it interrupts the execution or not
Compilation queue: whether JIT-compiler compiles other code simultaneously (which may slow down the execution of current code)
The memory layout: how objects located in the memory, how many cache lines should be loaded to access all the necessary data.
CPU branch predictor state which depends on the previous code execution and may increase or decrease number of branch mispredictions.
And so on and so forth. So even if you measure something in the isolated benchmark, this does not mean that the speed of the same code in the production will be the same. It may differ in the order of magnitude. So before measuring something you should ask yourself why you want to measure this thing. Usually you don't care how long some part of your program is executed. What you usually care is the latency and the throughput of the whole program. So profile the whole program and optimize the slowest parts. Probably the thing you are measuring is not the slowest.

Java VM loads a class into memory first time the class is used.
So the difference between 1st and 2nd run may be caused by class loading.

What is the current state of affairs in the world of Java timers?

From time to time I encounter mentions of System.nanoTime() being a lot slower (the call could cost up to microseconds) than System.currentTimeMillis(), but prooflinks are often outdated, or lead to some fairly opinionated blog posts that can't be really trusted, or contain information pertaining to specific platform, or this, or that and so on.
I didn't run benchmarks since I'm being realistic about my ability to conduct an experiment concerning such a sensitive matter, but my conditions are really well-defined, so I'm expecting quite a simple answer.
So, on an average 64-bit Linux (implying 64-bit JRE), Java 8 and a modern hardware, will switching to nanoTime() cost me that microseconds to call? Should I stay with currentTimeMillis()?

As always, it depends on what you're using it for. Since others are bashing nanoTime, I'll put a plug in for it. I exclusively use nanoTime to measure elapsed time in production code.
I shy away from currentTimeMillis in production because I typically need a clock that doesn't jump backwards and forwards around like the wall clock can (and does). This is critical in my systems which use important timer-based decisions. nanoTime should be monotonically increasing at the rate you'd expect.
In fact, one of my co-workers says "currentTimeMillis is only useful for human entertainment," (such as the time in debug logs, or displayed on a website) because it cannot be trusted to measure elapsed time.
But really, we try not to use time as much as possible, and attempt to keep time out of our protocols; then we try to use logical clocks; and finally if absolutely necessary, we use durations based on nanoTime.
Update: There is one place where we use currentTimeMillis as a sanity check when connecting two hosts, but we're checking if the hosts' clocks are more than 5 minutes apart.

If you are currently using currentTimeMillis() and are happy with the resolution, then you definitely shouldn't change.
According the javadoc:
This method provides nanosecond precision, but not necessarily
nanosecond resolution (that is, how frequently the value changes)
no guarantees are made except that the resolution is at least as
good as that of {#link #currentTimeMillis()}.
So depending on the OS implementation, there is no guarantee that the nano time returned is even correct! It's just the 9 digits long and has the same number of millis as currentTimeMillis().
A perfectly valid implementation could be currentTimeMillis() * 1000000
Therefore, I don't think you really gain a benefit from nano seconds even if there wasn't a performance issue.

I want to stress that even if the calls would be very cheap, you will not get the nanosecond resolution of your measurements.
Let me give you an example (code from http://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--):
long startTime = System.nanoTime();
// ... the code being measured ...
long estimatedTime = System.nanoTime() - startTime;
So while both long values will be resolved to a nanosecond, JVM is not giving you a guarantee that every call you make to nanoTime(), JVM will give you a new value.
To illustrate this, I wrote a simple program and ran it on Win7x64 (feel free to run it and report the results as well):
package testNano;
public class Main {
public static void main(String[] args) {
long attempts = 10_000_000L;
long stale = 0;
long prevTime;
for (int i = 0; i < attempts; i++) {
prevTime = System.nanoTime();
long nanoTime = System.nanoTime();
if (prevTime == nanoTime) stale++;
}
System.out.format("nanoTime() returned stale value in %d out of %d tests%n", stale, attempts);
}
}
It prints out nanoTime() returned stale value in 9117171 out of 10000000 tests.
EDIT
I also recommend to read the Oracle article on this: https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks. The conclusions of the article are:
If you are interested in measuring absolute time then always use System.currentTimeMillis(). Be aware that its resolution may be quite coarse (though this is rarely an issue for absolute times.)
If you are interested in measuring/calculating elapsed time, then always use System.nanoTime(). On most systems it will give a resolution on the order of microseconds. Be aware though, this call can also take microseconds to execute on some platforms.
Also you might find this discussion interesting: Why is System.nanoTime() way slower (in performance) than System.currentTimeMillis()?.

Running this very simple test:
public static void main(String[] args) {
// Warmup loops
long l;
for (int i=0;i<1000000;i++) {
l = System.currentTimeMillis();
}
for (int i=0;i<1000000;i++) {
l = System.nanoTime();
}
// Full loops
long start = System.nanoTime();
for (int i=0;i<10000000;i++) {
l = System.currentTimeMillis();
}
start = System.nanoTime()-start;
System.err.println("System.currentTimeMillis() "+start/1000);
start = System.nanoTime();
for (int i=0;i<10000000;i++) {
l = System.nanoTime();
}
start = System.nanoTime()-start;
System.err.println("System.nanoTime() "+start/1000);
}
On Windows 7 this shows millis to be just over 2 times as fast:
System.currentTimeMillis() 138615
System.nanoTime() 299575
On other platforms, the difference isn't as large, with nanoTime() actually being slightly (~10%) faster:
On OS X:
System.currentTimeMillis() 463065
System.nanoTime() 432896
On Linux with OpenJDK:
System.currentTimeMillis() 352722
System.nanoTime() 312960

Well the best thing to do in such situations is always to benchmark it. And since the timing depends solely on your platform and OS there's really nothing we can do for you here, particularly since you nowhere explain what you actually use the timer for.
Neither nanoTime nor currentTimeMillis generally guarantee monotonicity (nanoTime does on HotSpot for Solaris only and otherwise relies on an existing monotone time source of the OS - so for most people it will be monotonic even if currentTimeMillis is not).
Luckily for you writing benchmarks in Java is relatively easy these days thanks to jmh (java measuring harness) and even luckier for you Aleksey Shipilёv actually investigated nanoTime a while ago: See here - including source code to do the interesting benchmarking yourself (it's also a nice primer to jmh itself, if you want to write accurate benchmarks with only relatively little knowledge - that's the one to pick.. just amazing how far the engineers behind that project went to make benchmarking as straight-forward as possible to the general populace! Although you certainly can still fuck up if you're not careful ;-))
To summarize the results for a modern linux distribution or Solaris and a x86 CPU:
Precision: 30ns
Latency: 30ns best case
Windows:
Precision: Hugely variable, 370ns to 15 µs
Latency: Hugely variable, 15ns to 15 µs
But note Windows is also known to give you a precision of up to 100ms for currentTimeMillis in some rare situations soo.. pick your poison.
Mac OS X:
Precision: 1µs
Latency: 50ns
Be vary these results will differ greatly depending on your used platform (CPU/MB - there are some interesting older hardware combinations around, although they're luckily getting older) and OS. Heck obviously just running this on a 800 MHz CPU your results will be rather different when compared to a 3.6GHz server.

Stack performance in programming languages

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:
public class Fib {
public static int fib(int n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
public static void main(String[] args) {
System.out.println(fib(Integer.valueOf(args[0])));
}
}
Ok so the point is that using this algorithm with input 40 I got these timings:
C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s
They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.
I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?
How would you explain these results?
EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)
EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)

You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.
Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.
In addition, make sure you're measuring CPU time rather than clock time.
Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.
For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.
In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:
int fib(unsigned int n) {
int t, a, b;
if (n < 2) return 1;
a = b = 1;
while (n-- >= 2) {
t = a + b;
a = b;
b = t;
}
return b;
}
further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.
And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:
static int fib(unsigned int n) {
static int lookup[] = {
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
46368, 75025, 121393, 196418, 317811, 514229, 832040,
1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
24157817, 39088169, 63245986, 102334155, 165580141 };
int t, a, b;
if (n < sizeof(lookup)/sizeof(*lookup))
return lookup[n];
a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
t = a + b;
a = b;
b = t;
}
return b;
}
That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.

You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)
When I try to reproduce for OCaml I get:
let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))
let () = Format.printf "%d#." (f 40)
$ ocamlopt fib.ml
$ time ./a.out
165580141
real 0m1.643s
This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).

One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.
Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.
You may be able to see a difference by inverting the logic of the if:
public static int fib(int n) {
if (n >= 2) return fib(n-1) + fib(n-2);
return 1;
}
It's worth a try, anyway :)
As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)

I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...
As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...

Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.

I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.
The unoptimised version was 34 lines long and looked like a straight translation of the C code.
The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.

With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.
Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.

One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.
Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

C++ and Java performance

this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.

You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.

All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!

Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)

Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.

In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).

To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?

What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}

In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.