I have the following two programs:
long startTime = System.currentTimeMillis();
for (int i = 0; i < N; i++);
long endTime = System.currentTimeMillis();
System.out.println("Elapsed time: " + (endTime - startTime) + " msecs");
and
long startTime = System.currentTimeMillis();
for (long i = 0; i < N; i++);
long endTime = System.currentTimeMillis();
System.out.println("Elapsed time: " + (endTime - startTime) + " msecs");
Note: the only difference is the type of the loop variable (int and long).
When I run this, the first program consistently prints between 0 and 16 msecs, regardless of the value of N. The second takes a lot longer. For N == Integer.MAX_VALUE, it runs in about 1800 msecs on my machine. The run time appears to be more or less linear in N.
So why is this?
I suppose the JIT-compiler optimizes the int loop to death. And for good reason, because obviously it doesn't do anything. But why doesn't it do so for the long loop as well?
A colleague thought we might be measuring the JIT compiler doing its work in the long loop, but since the run time seems to be linear in N, this probably isn't the case.
I'm using JDK 1.6.0 update 17:
C:\>java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
I'm on Windows XP Professional x64 Edition, Service Pack 2, with an Intel Core2 Quad CPU at 2.40GHz.
DISCLAIMER
I know that microbenchmarks aren't useful in production. I also know that System.currentTimeMillis() isn't as accurate as its name suggests. This is just something I noticed while fooling around, and I was simply curious as to why this happens; nothing more.
It's an interesting question, but to be honest I'm not convinced that considering Hotspot's behaviour here will yield useful information. Any answers you do get are not going to be transferable in a general case (because we're looking at the optimisations that Hotspot performs in one specific situation), so they'll help you understand why one no-op is faster than another, but they won't help you write faster "real" programs.
It's also incredibly easy to write very misleading micro benchmarks around this sort of thing - see this IBM DW article for some of the common pitfalls, how to avoid them and some general commentary on what you're doing.
So really this is a "no comment" answer, but I think that's the only valid response. A compile-time-trivial no-op loop doesn't need to be fast, so the compiler isn't optimised to be fast in some of these conditions.
You are probably using a 32-bit JVM. The results will be probably different with a 64bit JVM. In a 32-bit JVM an int can be mapped to a native 32-bit integer and be incremented with a single operation. The same doesn't hold for a long, which will require more operations to increment.
See this question for a discussion about int and long sizes.
My guess - and it is only a guess - is this:
The JVM concludes that the first loop effectively does nothing, so it removes it entirely. No variable "escapes" from the for-loop.
In the second case, the loop also does nothing. But it might be that the JVM code that determines that the loop does nothing has a "if (type of i) == int" clause. In which case the optimisation to remove the do-nothing for-loop only works with int.
An optimisation that removes code has to be sure there are no side-effects. The JVM coders seem to have erred on the side of caution.
Micro-benchmarking like this does not make a lot of sense, because the results depend a lot on the internal workings of the Hotspot JIT.
Also, note that the system clock values that you get with System.currentTimeMillis() don't have a 1 ms resolution on all operating systems. You can't use this to very accurately time very short duration events.
Have a look at this article, that explains why doing micro-benchmarks in Java is not as easy as most people think: Anatomy of a flawed microbenchmark
Related
Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?
A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.
In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.
I have this following code
public class BenchMark {
public static void main(String args[]) {
doLinear();
doLinear();
doLinear();
doLinear();
}
private static void doParallel() {
IntStream range = IntStream.range(1, 6).parallel();
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("parallel: " +reduce + " -- Time: " + (endTime - startTime));
}
private static void doLinear() {
IntStream range = IntStream.range(1, 6);
long startTime = System.nanoTime();
int reduce = range
.reduce((a, item) -> a * item).getAsInt();
long endTime = System.nanoTime();
System.out.println("linear: " +reduce + " -- Time: " + (endTime - startTime));
}
}
I was trying to benchmark streams but came through this execution time steadily decreasing upon calling the same function again and again
Output:
linear: 120 -- Time: 57008226
linear: 120 -- Time: 23202
linear: 120 -- Time: 17192
linear: 120 -- Time: 17802
Process finished with exit code 0
There is a huge difference between first and second execution time.
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there ?
Is there anyway to avoid this optimization so I can benchmark true execution time ?
I'm sure JVM might be doing some tricks behind the scenes but can anybody help me understand whats really going on there?
The massive latency of the first invocation is due to the initialization of the complete lambda runtime subsystem. You pay this only once for the whole application.
The first time your code reaches any given lambda expression, you pay for the linkage of that lambda (initialization of the invokedynamic call site).
After some iterations you'll see additional speedup due to the JIT compiler optimizing your reduction code.
Is there anyway to avoid this optimization so I can benchmark true execution time?
You are asking for a contradiction here: the "true" execution time is the one you get after warmup, when all optimizations have been applied. This is the runtime an actual application would experience. The latency of the first few runs is not relevant to the wider picture, unless you are interested in single-shot performance.
For the sake of exploration you can see how your code behaves with JIT compilation disabled: pass -Xint to the java command. There are many more flags which disable various aspects of optimization.
UPDATE: Refer #Marko's answer for an explanation of the initial latency due to lambda linkage.
The higher execution time for the first call is probably a result of the JIT effect. In short, the JIT compilation of the byte codes into native machine code occurs during the first time your method is called. The JVM then attempts further optimization by identifying frequently-called (hot) methods, and re-generate their codes for higher performance.
Is there anyway to avoid this optimization so I can benchmark true execution time ?
You can certainly account for the JVM initial warm-up by excluding the first few result. Then increase the number of repeated calls to your method in a loop of tens of thousands of iterations, and average the results.
There are a few more options that you might want to consider adding to your execution to help reduce noises as discussed in this post. There are also some good tips from this post too.
true execution time
There's no thing like "true execution time". If you need to solve this task only once, the true execution time would be the time of the first test (along with time to startup the JVM itself). In general the time spent for execution of given piece of code depends on many things:
Whether this piece of code is interpreted, JIT-compiled by C1 or C2 compiler. Note that there are not just three options. If you call one method from another, one of them might be interpreted and another might be C2-compiled.
For C2 compiler: how this code was executed previously, so what's in branch and type profile. The polluted type profile can drastically reduce the performance.
Garbage collector state: whether it interrupts the execution or not
Compilation queue: whether JIT-compiler compiles other code simultaneously (which may slow down the execution of current code)
The memory layout: how objects located in the memory, how many cache lines should be loaded to access all the necessary data.
CPU branch predictor state which depends on the previous code execution and may increase or decrease number of branch mispredictions.
And so on and so forth. So even if you measure something in the isolated benchmark, this does not mean that the speed of the same code in the production will be the same. It may differ in the order of magnitude. So before measuring something you should ask yourself why you want to measure this thing. Usually you don't care how long some part of your program is executed. What you usually care is the latency and the throughput of the whole program. So profile the whole program and optimize the slowest parts. Probably the thing you are measuring is not the slowest.
Java VM loads a class into memory first time the class is used.
So the difference between 1st and 2nd run may be caused by class loading.
From time to time I encounter mentions of System.nanoTime() being a lot slower (the call could cost up to microseconds) than System.currentTimeMillis(), but prooflinks are often outdated, or lead to some fairly opinionated blog posts that can't be really trusted, or contain information pertaining to specific platform, or this, or that and so on.
I didn't run benchmarks since I'm being realistic about my ability to conduct an experiment concerning such a sensitive matter, but my conditions are really well-defined, so I'm expecting quite a simple answer.
So, on an average 64-bit Linux (implying 64-bit JRE), Java 8 and a modern hardware, will switching to nanoTime() cost me that microseconds to call? Should I stay with currentTimeMillis()?
As always, it depends on what you're using it for. Since others are bashing nanoTime, I'll put a plug in for it. I exclusively use nanoTime to measure elapsed time in production code.
I shy away from currentTimeMillis in production because I typically need a clock that doesn't jump backwards and forwards around like the wall clock can (and does). This is critical in my systems which use important timer-based decisions. nanoTime should be monotonically increasing at the rate you'd expect.
In fact, one of my co-workers says "currentTimeMillis is only useful for human entertainment," (such as the time in debug logs, or displayed on a website) because it cannot be trusted to measure elapsed time.
But really, we try not to use time as much as possible, and attempt to keep time out of our protocols; then we try to use logical clocks; and finally if absolutely necessary, we use durations based on nanoTime.
Update: There is one place where we use currentTimeMillis as a sanity check when connecting two hosts, but we're checking if the hosts' clocks are more than 5 minutes apart.
If you are currently using currentTimeMillis() and are happy with the resolution, then you definitely shouldn't change.
According the javadoc:
This method provides nanosecond precision, but not necessarily
nanosecond resolution (that is, how frequently the value changes)
no guarantees are made except that the resolution is at least as
good as that of {#link #currentTimeMillis()}.
So depending on the OS implementation, there is no guarantee that the nano time returned is even correct! It's just the 9 digits long and has the same number of millis as currentTimeMillis().
A perfectly valid implementation could be currentTimeMillis() * 1000000
Therefore, I don't think you really gain a benefit from nano seconds even if there wasn't a performance issue.
I want to stress that even if the calls would be very cheap, you will not get the nanosecond resolution of your measurements.
Let me give you an example (code from http://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--):
long startTime = System.nanoTime();
// ... the code being measured ...
long estimatedTime = System.nanoTime() - startTime;
So while both long values will be resolved to a nanosecond, JVM is not giving you a guarantee that every call you make to nanoTime(), JVM will give you a new value.
To illustrate this, I wrote a simple program and ran it on Win7x64 (feel free to run it and report the results as well):
package testNano;
public class Main {
public static void main(String[] args) {
long attempts = 10_000_000L;
long stale = 0;
long prevTime;
for (int i = 0; i < attempts; i++) {
prevTime = System.nanoTime();
long nanoTime = System.nanoTime();
if (prevTime == nanoTime) stale++;
}
System.out.format("nanoTime() returned stale value in %d out of %d tests%n", stale, attempts);
}
}
It prints out nanoTime() returned stale value in 9117171 out of 10000000 tests.
EDIT
I also recommend to read the Oracle article on this: https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks. The conclusions of the article are:
If you are interested in measuring absolute time then always use System.currentTimeMillis(). Be aware that its resolution may be quite coarse (though this is rarely an issue for absolute times.)
If you are interested in measuring/calculating elapsed time, then always use System.nanoTime(). On most systems it will give a resolution on the order of microseconds. Be aware though, this call can also take microseconds to execute on some platforms.
Also you might find this discussion interesting: Why is System.nanoTime() way slower (in performance) than System.currentTimeMillis()?.
Running this very simple test:
public static void main(String[] args) {
// Warmup loops
long l;
for (int i=0;i<1000000;i++) {
l = System.currentTimeMillis();
}
for (int i=0;i<1000000;i++) {
l = System.nanoTime();
}
// Full loops
long start = System.nanoTime();
for (int i=0;i<10000000;i++) {
l = System.currentTimeMillis();
}
start = System.nanoTime()-start;
System.err.println("System.currentTimeMillis() "+start/1000);
start = System.nanoTime();
for (int i=0;i<10000000;i++) {
l = System.nanoTime();
}
start = System.nanoTime()-start;
System.err.println("System.nanoTime() "+start/1000);
}
On Windows 7 this shows millis to be just over 2 times as fast:
System.currentTimeMillis() 138615
System.nanoTime() 299575
On other platforms, the difference isn't as large, with nanoTime() actually being slightly (~10%) faster:
On OS X:
System.currentTimeMillis() 463065
System.nanoTime() 432896
On Linux with OpenJDK:
System.currentTimeMillis() 352722
System.nanoTime() 312960
Well the best thing to do in such situations is always to benchmark it. And since the timing depends solely on your platform and OS there's really nothing we can do for you here, particularly since you nowhere explain what you actually use the timer for.
Neither nanoTime nor currentTimeMillis generally guarantee monotonicity (nanoTime does on HotSpot for Solaris only and otherwise relies on an existing monotone time source of the OS - so for most people it will be monotonic even if currentTimeMillis is not).
Luckily for you writing benchmarks in Java is relatively easy these days thanks to jmh (java measuring harness) and even luckier for you Aleksey Shipilёv actually investigated nanoTime a while ago: See here - including source code to do the interesting benchmarking yourself (it's also a nice primer to jmh itself, if you want to write accurate benchmarks with only relatively little knowledge - that's the one to pick.. just amazing how far the engineers behind that project went to make benchmarking as straight-forward as possible to the general populace! Although you certainly can still fuck up if you're not careful ;-))
To summarize the results for a modern linux distribution or Solaris and a x86 CPU:
Precision: 30ns
Latency: 30ns best case
Windows:
Precision: Hugely variable, 370ns to 15 µs
Latency: Hugely variable, 15ns to 15 µs
But note Windows is also known to give you a precision of up to 100ms for currentTimeMillis in some rare situations soo.. pick your poison.
Mac OS X:
Precision: 1µs
Latency: 50ns
Be vary these results will differ greatly depending on your used platform (CPU/MB - there are some interesting older hardware combinations around, although they're luckily getting older) and OS. Heck obviously just running this on a 800 MHz CPU your results will be rather different when compared to a 3.6GHz server.
Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:
public class Fib {
public static int fib(int n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
public static void main(String[] args) {
System.out.println(fib(Integer.valueOf(args[0])));
}
}
Ok so the point is that using this algorithm with input 40 I got these timings:
C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s
They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.
I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?
How would you explain these results?
EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)
EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)
You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.
Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.
In addition, make sure you're measuring CPU time rather than clock time.
Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.
For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.
In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:
int fib(unsigned int n) {
int t, a, b;
if (n < 2) return 1;
a = b = 1;
while (n-- >= 2) {
t = a + b;
a = b;
b = t;
}
return b;
}
further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.
And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:
static int fib(unsigned int n) {
static int lookup[] = {
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
46368, 75025, 121393, 196418, 317811, 514229, 832040,
1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
24157817, 39088169, 63245986, 102334155, 165580141 };
int t, a, b;
if (n < sizeof(lookup)/sizeof(*lookup))
return lookup[n];
a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
t = a + b;
a = b;
b = t;
}
return b;
}
That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.
You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)
When I try to reproduce for OCaml I get:
let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))
let () = Format.printf "%d#." (f 40)
$ ocamlopt fib.ml
$ time ./a.out
165580141
real 0m1.643s
This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).
One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.
Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.
You may be able to see a difference by inverting the logic of the if:
public static int fib(int n) {
if (n >= 2) return fib(n-1) + fib(n-2);
return 1;
}
It's worth a try, anyway :)
As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)
I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...
As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...
Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.
I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.
The unoptimised version was 34 lines long and looked like a straight translation of the C code.
The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.
With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.
Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.
One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.
Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.
this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.
You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.
All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!
Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)
Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).
To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?
What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}
In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.