Stack performance in programming languages

Stack performance in programming languages - java

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:
public class Fib {
public static int fib(int n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
public static void main(String[] args) {
System.out.println(fib(Integer.valueOf(args[0])));
}
}
Ok so the point is that using this algorithm with input 40 I got these timings:
C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s
They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.
I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?
How would you explain these results?
EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)
EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)

You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.
Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.
In addition, make sure you're measuring CPU time rather than clock time.
Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.
For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.
In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:
int fib(unsigned int n) {
int t, a, b;
if (n < 2) return 1;
a = b = 1;
while (n-- >= 2) {
t = a + b;
a = b;
b = t;
}
return b;
}
further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.
And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:
static int fib(unsigned int n) {
static int lookup[] = {
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
46368, 75025, 121393, 196418, 317811, 514229, 832040,
1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
24157817, 39088169, 63245986, 102334155, 165580141 };
int t, a, b;
if (n < sizeof(lookup)/sizeof(*lookup))
return lookup[n];
a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
t = a + b;
a = b;
b = t;
}
return b;
}
That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.

You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)
When I try to reproduce for OCaml I get:
let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))
let () = Format.printf "%d#." (f 40)
$ ocamlopt fib.ml
$ time ./a.out
165580141
real 0m1.643s
This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).

One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.
Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.
You may be able to see a difference by inverting the logic of the if:
public static int fib(int n) {
if (n >= 2) return fib(n-1) + fib(n-2);
return 1;
}
It's worth a try, anyway :)
As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)

I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...
As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...

Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.

I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.
The unoptimised version was 34 lines long and looked like a straight translation of the C code.
The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.

With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.
Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.

One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.
Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

Related

Why is this JOML (JVM) code so much faster than the equivalent GSL (C)?

I am attempting to optimize a small library for doing arithmetic on vectors.
To roughly check my progress, I decided to benchmark the performance of two popular vector arithmetic libraries written in two different languages, the GNU Scientific Library (GSL, C), and the Java OpenGL Math Library (JOML, JVM). I expected GSL, as a large project written in C and compiled ahead of time, to be significantly faster than JOML, with extra baggage from object management, method calls, and conforming to the Java specifications.
Surprisingly, instead JOML (JVM) ended up being around 4X faster than GSL (C). I wish to understand why this is the case.
The benchmark I performed was to compute 4,000,000 iterations of Leibniz's formula to calculate Pi, in chunks of 4 at a time via 4-dimensioned vectors. The exact algorithm doesn't matter, and doesn't have to make sense. This was just the first and simplest thing I thought of that would let me use multiple vector operations per iteration.
This is the C code in question:
#include <stdio.h>
#include <time.h>
#include <gsl/gsl_vector.h>
#include <unistd.h>
#include <math.h>
#include <string.h>
#define IT 1000000
double pibench_inplace(int it) {
gsl_vector* d = gsl_vector_calloc(4);
gsl_vector* w = gsl_vector_calloc(4);
for (int i=0; i<4; i++) {
gsl_vector_set(d, i, (double)i*2+1);
gsl_vector_set(w, i, (i%2==0) ? 1 : -1);
}
gsl_vector* b = gsl_vector_calloc(4);
double pi = 0.0;
for (int i=0; i<it; i++) {
gsl_vector_memcpy(b, d);
gsl_vector_add_constant(b, (double)i*8);
for (int i=0; i<4; i++) {
gsl_vector_set(b, i, pow(gsl_vector_get(b, i), -1.));
}
gsl_vector_mul(b, w);
pi += gsl_vector_sum(b);
}
return pi*4;
}
double pibench_fast(int it) {
double pi = 0;
int eq_it = it * 4;
for (int i=0; i<eq_it; i++) {
pi += (1 / ((double)i * 2 + 1) * ((i%2==0) ? 1 : -1));
}
return pi*4;
}
int main(int argc, char* argv[]) {
if (argc < 2) {
printf("Please specific a run mode.\n");
return 1;
}
double pi;
struct timespec start = {0,0}, end={0,0};
clock_gettime(CLOCK_MONOTONIC, &start);
if (strcmp(argv[1], "inplace") == 0) {
pi = pibench_inplace(IT);
} else if (strcmp(argv[1], "fast") == 0) {
pi = pibench_fast(IT);
} else {
sleep(1);
printf("Please specific a valid run mode.\n");
}
clock_gettime(CLOCK_MONOTONIC, &end);
printf("Pi: %f\n", pi);
printf("Time: %f\n", ((double)end.tv_sec + 1.0e-9*end.tv_nsec) - ((double)start.tv_sec + 1.0e-9*start.tv_nsec));
return 0;
}
This is how I built and ran the C code:
$ gcc GSL_pi.c -O3 -march=native -static $(gsl-config --cflags --libs) -o GSL_pi && ./GSL_pi inplace
Pi: 3.141592
Time: 0.061561
This is the JVM-platform code in question (written in Kotlin):
package joml_pi
import org.joml.Vector4d
import kotlin.time.measureTimedValue
import kotlin.time.DurationUnit
fun pibench(count: Int=1000000): Double {
val d = Vector4d(1.0, 3.0, 5.0, 7.0)
val w = Vector4d(1.0, -1.0, 1.0, -1.0)
val c = Vector4d(1.0, 1.0, 1.0, 1.0)
val scratchpad = Vector4d()
var pi = 0.0
for (i in 0..count) {
scratchpad.set(i*8.0)
scratchpad.add(d)
c.div(scratchpad, scratchpad)
scratchpad.mul(w)
pi += scratchpad.x + scratchpad.y + scratchpad.z + scratchpad.w
}
return pi * 4.0
}
#kotlin.time.ExperimentalTime
fun <T> benchmark(func: () -> T, name: String="", count: Int=5) {
val times = mutableListOf<Double>()
val results = mutableListOf<T>()
for (i in 0..count) {
val result = measureTimedValue<T>( { func() } )
results.add(result.value)
times.add(result.duration.toDouble(DurationUnit.SECONDS))
}
println(listOf<String>(
"",
name,
"Results:",
results.joinToString(", "),
"Times:",
times.joinToString(", ")
).joinToString("\n"))
}
#kotlin.time.ExperimentalTime
fun main(args: Array<String>) {
benchmark<Double>(::pibench, "pibench")
}
This is how I built and ran the JVM-platform code:
$ kotlinc -classpath joml-1.10.5.jar JOML_pi.kt && kotlin -classpath joml-1.10.5.jar:. joml_pi/JOML_piKt.class
pibench
Results:
3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464
Times:
0.026850784, 0.014998012, 0.013095291, 0.012805373, 0.012977388, 0.012948186
There are multiple possiblities I have considered for why this operation run in the JVM here is apparently several times faster than the equivalent C code. I do not think any of them are particularly compelling:
I'm doing different iteration counts by an order of magnitude in the two languages. — It's possible I'm grossly misreading the code, but I'm pretty sure this isn't the case.
I've fudged up the algorithm and am doing vastly different things in each case. — Again maybe I've misread it, but I don't think that's happening, and both cases do produce numerically correct results.
The timing mechanism I use for C introduces a lot of overhead. — I also tested simpler and no-op functions. They completed and were measured as expected in much less time.
The JVM code is parallelized across multiple processor cores — With many more iterations, I watched my CPU use over a longer period and it did not exceed one core.
The JVM code makes better use of SIMD/vectorization. — I compiled the C with -O3 and -march=native, statically linking against libraries from Debian packages. In another case I even tried the -floop/-ftree parallelization flags. Either way performance didn't really change.
GSL has extra features that add overhead in this particular test. — I also have another version, with the vector class implemented and used through Cython, that does only the basics (iterating over a pointer), and performs roughly equivalently to GSL (with slightly more overhead, as expected). So that seems to be the limit for native code.
JOML is actually using native code. — The README says it makes no JNI calls, I'm importing directly from a multi-platform .jar file that I've checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn't help anyway at such a granular level.
The JVM has different specifics for floating point arithmetic. — The JOML class I used accepts and returns "doubles" just as the C code. In any case, having to emulate a specification that deviates from hardware capabilities probably shouldn't improve performance like this.
The exponential reciprocal step in my GSL code is less efficient than the division reciprocal step in my JOML code. — While commenting that out does reduce total execution time by around 25% (to ~0.045s), that still leaves a massive 3X gap with the JVM code (~0.015s).
The only remaining explanation I can think of is that most of the time spent in C is overhead from doing function calls. This would seem consistent with the fact that implementations in C and Cython perform similarly. Then, the performance advantage of the Java/Kotlin/JVM implementation comes from its JIT being able to optimize away the function calls by effectively inlining everything in the loop. However, given the reputation of JIT compilers as being only theoretically, slightly faster than native code in favourable conditions, that still seems like a massive speedup just from having a JIT.
I suppose if that is the case, then a follow-up question would be whether I could realistically or reliably expect these performance characteristics to carry over outside of a synthetic toy benchmark, in applications that may have much more scattered numeric calls rather than a single million-iteration loop.

First, a disclaimer: I am the author of JOML.
Now, you are probably not comparing apples with apples here. GSL is a general purpose linear algebra library supporting many different linear algebra algorithms and data structures.
JOML, on the other hand is not a general purpose linear algebra library, but a special purpose library covering only the use-cases of compute graphics, so it only contains very concrete classes for only 2-, 3- and 4-dimensional vectors and 2x2, 3x3 and 4x4 (and non-square variants) matrices.
In other words, even if you wanted to allocate a 5-dimensional vector, you couldn't with JOML.
Therefore, all the algorithms and data structures in JOML are explicitly designed on classes with x, y, z and w fields. Without any loops.
So, a 4-dimensional vector add is literally just:
dest.x = this.x + v.x;
dest.y = this.y + v.y;
dest.z = this.z + v.z;
dest.w = this.w + v.w;
And there isn't even any SIMD involved in that, because as of now, there is no JVM JIT that can auto-vectorize over different fields of a class. Thus, a vector add (or multiply; or any lane-wise) operation right now will produce exactly these scalar operations.
Next, you say:
JOML is actually using native code. — The README says it makes no JNI calls, I'm importing directly from a multi-platform .jar file that I've checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn't help anyway at such a granular level.
JOML itself does not define and use native code via the JNI interface. Of course, the operators and JRE methods that JOML uses internally will get intrinsified to native code, but not via the JNI interface. Rather, all methods (such as Math.fma()) will get intrinsified directly into their machine code equivalents at JIT compilation time.
Now, as pointed out by others in the comments to your questions: You are using a linked library (as opposed to a headers-only library like GLM, which would probably be a better fit for your C/C++ code). So, a C/C++ compiler probably won't be able to "see through" your call-site to the callee and apply optimizations there based on the static information that it has at the callsite (like you calling gsl_vector_calloc with the argument 4). So, every runtime checking/branching on the argument that GSL needs to do, will still have to happen at runtime. This would be quite different to when using a headers-only library (like GLM), where any half-decent C/C++ will for sure optimize all the things away based on the static knowledge of your calls/code. And I would assume that, yes, an equivalent C/C++ program would beat a Java/Scala/Kotlin/JVM program in speed.
So, your comparison of GSL and JOML is somewhat like comparing the performance of Microsoft Excel evaluating a cell with content = 1 + 2 with writing C code that effectively outputs printf("%f\n", 1.0 + 2.0);. The former (Microsoft Excel, here being GSL) is much more general and versatile while the latter (JOML) is highly specialized.
It just so happens that the specialization fits to your exact use-case right now, making it even possible to use JOML for that.

general concept-java code and cycle clocks

I am just curious how can we know how many cycle clocks CPU needs by looking through certain java code.
ex:
public class Factorial
{
public static void main(String[] args)
{ final int NUM_FACTS = 100;
for(int i = 0; i < NUM_FACTS; i++)
System.out.println( i + "! is " + factorial(i));
}
public static int factorial(int n)
{ int result = 1;
for(int i = 2; i <= n; i++)
result *= i;
return result;
}
}

I am just curious how can we know how many cycle clocks CPU needs by looking through certain java code.
If you are talking about real hardware clock cycles, the answer is "You can't know"1.
The reason that it is so hard is that a program goes through a number of complicated (and largely opaque) transformations before and during execution:
The source code is compiled to bytecodes ahead of time. This depends on the bytecode compiler used.
The bytecodes are JIT compiled to native code, at some time during the execution. This depends on the JIT compiler in the execution platform AND on the execution behavior of the application.
The number of clock cycles taken to execute a given instruction sequence depends on native code, the CPU model including such things as memory cache sizes and ... the application's memory access patterns.
On top of that, the JVM has various sources of "under the hood" non-determinism, and various launch-time tuning parameters that influence the behavior ... and cycle counts.
But fortunately, there are practical ways to examine software performance that don't depend on measuring hardware clock cycles. You can:
measure or estimate native instructions executed,
measure or estimate bytecodes executed,
estimate Java-level operations or statements executed, or
run the code and measure the time taken.
The last two are generally the most practical.
1 - ... except by running the application / JVM on an accurate hardware-level simulator for your exact hardware configuration and getting the simulator to count the clock cycles. And to be honest, I don't know if simulators that operate to that level actually exist. If they do, they are most likely proprietary to Intel, AMD and so on.

I don't think you'd be able to know the clock cycles.
But you could measure the CPU time it took to run the code.
You'd need to use the java.lang.management API.
Take a look here:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/ThreadMXBean.html

When comparing Java with C++ for speed should I compile the C++ code with -O3 or -O2?

I am writing a variety of equivalent programs in Java and C++ to compare the two languages for speed. Those programs employ heavy mathematical computations in a loop.
Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.
Which g++ compiler optimization should I use to reach a conclusion about my comparisons?
I know this is not as simple to conclude as it sounds, but I would like to have some insights about latency/speed comparisons between Java and C++.

Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.
-O3 will certainly beat -O2 in microbenchmarks but when you benchmark a more realistic application (such as a FIX engine) you will see that -O2 beats -O3 in terms of performance.
As far as I know, -O3 does a very good job compiling small and mathematical pieces of code, but for more realistic and larger applications it can actually be slower than -O2. By trying to aggressively optimize everything (i.e. inlining, vectorization, etc.), the compiler will produce huge binaries leading to cpu cache misses (i.e. especially instruction cache misses). That's one of the reasons the Hotspot JIT chooses not to optimize big methods and/or non-hot methods.
One important thing to notice is that JIT uses methods as independent units eligible for optimization. In your previous questions, you have the following code:
int iterations = stoi(argv[1]);
int load = stoi(argv[2]);
long long x = 0;
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(); // START clock
for(int j = 0; j < load; j++) {
if (i % 4 == 0) {
x += (i % 4) * (i % 8);
} else {
x -= (i % 16) * (i % 32);
}
}
long end = get_nano_ts(); // STOP clock
// (omitted for clarity)
}
cout << "My result: " << x << endl;
But this code is JIT-unfriendly because the hot block of code is not in its own method. For major JIT gains, you should have placed the block of code inside the loop on its own method. Your method executes a hot block of code instead of a hot method. The method that contains the for loop is probably called only once so the JIT will not do anything about it.
When comparing Java with C++ for speed should I compile the C++ code with -O3 or -O2?
Well, if you use -O3 for microbenchmarks you will get amazing fast results that will be unrealistic for larger and more complex applications. That's why I think the judges use -O2 instead of -O3. For example, our garbage-free Java FIX engine is faster than C++ FIX engines and I have no idea if they are compiling with -O0, -O1, -O2, -O3 or a mix of them through executable linking.
In theory it is possible for a person to selective compartmentalize an entire C++ application in executable pieces, choose which ones are going to be compiled with -O2 and which ones are going to be compiled with -O3. Then link everything in an ideal binary executable. But in reality, how feasible is that?
The approach the Hotspot chooses is much simpler. It says:
Listen, I am going to consider each method as an independent unit of execution instead of any block of code anywhere. If that method is hot enough (i.e. called often) and small enough I will try to aggressively optimize it.
That of course has the drawback of requiring code warmup but it is much simpler and produces the best results most of the time for realistic/large/complex applications.
And last but not least, you should probably consider this question if you want to compile your entire application with -O3: When can I confidently compile program with -O3?

If possible, compare it against both, since -O2 and -O3 are both options available to the C++ developer. Sometimes -O2 will win. Sometimes -O3 will win. If you have both available, that's just more information which can be used to support whatever you're trying to accomplish by doing these speed comparisons.

C++ and Java performance

this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.

You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.

All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!

Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)

Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.

In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).

To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?

What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}

In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

How can I code Java to allow SSE use and bounds-check elimination (or other advanced optimizations)?

The Situation:
I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves a lot of byte[] access and basic int mathematics for hashing and comparison. Performance really matters, because the goal of the compression is to reduce I/O requirements. I am not posting code because it isn't cleaned up yet, and may be restructured heavily.
The Questions:
How can I write my code to allow it to JIT-compile to a form using faster SSE operations?
How can I structure it so the compiler can easily eliminate array bounds checks?
Are there any broad references about the relative speed of specific math operations (how many increments/decrements does it take to equal a normal add/subtract, how fast is shift-or vs. an array access)?
How can I work on optimizing branching -- is it better to have numerous conditional statements with short bodies, or a few long ones, or short ones with nested conditions?
With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?
What I've already done:
Before I get attacked for premature optimization: the basic algorithm is already excellent, but the Java implementation is less than 2/3 the speed of equivalent C. I've already replaced copying loops with System.arraycopy, worked on optimizing loops and eliminated un-needed operations.
I make heavy use of bit twiddling and packing bytes into ints for performance, as well as shifting & masking.
For legal reasons, I can't look at implementations in similar libraries, and existing libraries have too restrictive license terms to use.
Requirements for a GOOD (accepted) answer:
Unacceptable answers: "this is faster" without an explanation of how much AND why, OR hasn't been tested with a JIT compiler.
Borderline answers: have not been tested with anything before Hotspot 1.4
Basic answers: will provide a general rule and explanation of why it is faster at the compiler level, and roughly how much faster
Good answers: include a couple of samples of code to demonstrate
Excellent answers: have benchmarks with both JRE 1.5 and 1.6
PERFECT answer: Is by someone who worked on the HotSpot compiler, and can fully explain or reference the conditions for an optimization to be used, and how much faster it typically is. Might include java code and sample assembly code generated by HotSpot.
Also: if anyone has links detailing the guts of Hotspot optimization and branching performance, those are welcome. I know enough about bytecode that a site analyzing performance at a bytecode rather than sourcecode level would be helpful.
(Edit) Partial Answer: Bounds-Check Ellimination:
This is taken from supplied link to the HotSpot internals wiki at: https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
HotSpot will eliminate bounds checks in all for loops with the following conditions:
Array is loop invariant (not reallocated within the loop)
Index variable has a constant stride (increases/decreases by constant amount, in only one spot if possible)
Array is indexed by a linear function of the variable.
Example: int val = array[index*2 + 5]
OR: int val = array[index+9]
NOT: int val = array[Math.min(var,index)+7]
Early version of code:
This is a sample version. Do not steal it, because it is an unreleased version of code for the H2 database project. The final version will be open source. This is an optimization upon the code here: H2 CompressLZF code
Logically, this is identical to the development version, but that one uses a for(...) loop to step through input, and an if/else loop for different logic between literal and backreference modes. It reduces array access and checks between modes.
public int compressNewer(final byte[] in, final int inLen, final byte[] out, int outPos){
int inPos = 0;
// initialize the hash table
if (cachedHashTable == null) {
cachedHashTable = new int[HASH_SIZE];
} else {
System.arraycopy(EMPTY, 0, cachedHashTable, 0, HASH_SIZE);
}
int[] hashTab = cachedHashTable;
// number of literals in current run
int literals = 0;
int future = first(in, inPos);
final int endPos = inLen-4;
// Loop through data until all of it has been compressed
while (inPos < endPos) {
future = (future << 8) | in[inPos+2] & 255;
// hash = next(hash,in,inPos);
int off = hash(future);
// ref = possible index of matching group in data
int ref = hashTab[off];
hashTab[off] = inPos;
off = inPos - ref - 1; //dropped for speed
// has match if bytes at ref match bytes in future, etc
// note: using ref++ rather than ref+1, ref+2, etc is about 15% faster
boolean hasMatch = (ref > 0 && off <= MAX_OFF && (in[ref++] == (byte) (future >> 16) && in[ref++] == (byte)(future >> 8) && in[ref] == (byte)future));
ref -=2; // ...EVEN when I have to recover it
// write out literals, if max literals reached, OR has a match
if ((hasMatch && literals != 0) || (literals == MAX_LITERAL)) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos - literals, out, outPos, literals);
outPos += literals;
literals = 0;
}
//literal copying split because this improved performance by 5%
if (hasMatch) { // grow match as much as possible
int maxLen = inLen - inPos - 2;
maxLen = maxLen > MAX_REF ? MAX_REF : maxLen;
int len = 3;
// grow match length as possible...
while (len < maxLen && in[ref + len] == in[inPos + len]) {
len++;
}
len -= 2;
// short matches write length to first byte, longer write to 2nd too
if (len < 7) {
out[outPos++] = (byte) ((off >> 8) + (len << 5));
} else {
out[outPos++] = (byte) ((off >> 8) + (7 << 5));
out[outPos++] = (byte) (len - 7);
}
out[outPos++] = (byte) off;
inPos += len;
//OPTIMIZATION: don't store hashtable entry for last byte of match and next byte
// rebuild neighborhood for hashing, but don't store location for this 3-byte group
// improves compress performance by ~10% or more, sacrificing ~2% compression...
future = ((in[inPos+1] & 255) << 16) | ((in[inPos + 2] & 255) << 8) | (in[inPos + 3] & 255);
inPos += 2;
} else { //grow literals
literals++;
inPos++;
}
}
// write out remaining literals
literals += inLen-inPos;
inPos = inLen-literals;
if(literals >= MAX_LITERAL){
out[outPos++] = (byte)(MAX_LITERAL-1);
System.arraycopy(in, inPos, out, outPos, MAX_LITERAL);
outPos += MAX_LITERAL;
inPos += MAX_LITERAL;
literals -= MAX_LITERAL;
}
if (literals != 0) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos, out, outPos, literals);
outPos += literals;
}
return outPos;
}
Final edit:
I've marked the best answer so far as accepted, since the deadline is nearly up. Since I took so long before deciding to post code, I will continue to upvote and respond to comments where possible. Apologies if the code is messy: this represented code in development, not polished up for committing.

Not a full answer, I simply don't have time to do the detailed benchmarks your question needs but hopefully useful.
Know your enemy
You are targeting a combination of the JVM (in essence the JIT) and the underlying CPU/Memory subsystem. Thus "This is faster on JVM X" is not likely to be valid in all cases as you move into more aggressive optimisations.
If your target market/application will largely run on a particular architecture you should consider investing in tools specific to it.
* If your performance on x86 is the critical factor then intel's VTune is excellent for drilling down into the sort of jit output analysis you describe.
* The differences between 64 bit and 32 bit JITs can be considerable, especially on x86 platforms where calling conventions can change and enregistering opportunities are very different.
Get the right tools
You would likely want to get a sampling profiler. The overhead of instrumentation (and the associated knock on on things like inlining, cache pollution and code size inflation) for your specific needs would be far too great. The intel VTune analyser can actually be used for Java though the integration is not so tight as others.
If you are using the sun JVM and are happy only knowing what the latest/greatest version is doing then the options available to investigate the output of the JIT are considerable if you know a bit of assembly.
This article details some interesting analysis using this functionality
Learn from other implementations
The Change history change history indicates that previous inline assembly was in fact counter productive and that allowing the compiler to take total control of the output (with tweaks in code rather than directives in assembly) yielded better results.
Some specifics
Since LZF is, in an efficient unmanaged implementation on modern desktop CPUS, largely memory bandwidth limited (hence it being compered to the speed of an unoptimised memcpy) you will need you code to remain entirely within level 1 cache.
As such any static fields you cannot make into constants should be placed within the same class as these values will often be placed within the same area of memory devoted to the vtables and meta data associated with classes.
Object allocations which cannot be trapped by Escape Analysis (only in 1.6 onwards) will need to be avoided.
The c code makes aggressive use of loop unrolling. However the performance of this on older (1.4 era) VM's is heavily dependant on the mode the JVM is in. Apparently latter sun jvm versions are more aggressive at inlining and unrolling, especially in server mode.
The prefetch instrctions generated by the JIT can make all the difference on code like this which is near memory bound.
"It's coming straight for us"
Your target is moving, and will continue to. Again Marc Lehmann's previous experience:
default HLOG size is now 15 (cpu caches have increased)
Even minor updates to the jvm can involve significant compiler changes
6544668 Don't vecorized array operations that can't be aligned at runtime.
6536652 Implement some superword (SIMD) optimizations
6531696 don't use immediate 16-bits value store to memory on Intel cpus
6468290 Divide and allocate out of eden on a per cpu basis
Captain Obvious
Measure, Measure, Measure. IF you can get your library to include (in a separate dll) a simple and easy to execute benchmark that logs the relevant information (vm version, cpu, OS, command line switches etc) and makes this simple to send back to you you will increase your coverage, best of all you'll cover those people using it that care.

As far as bounds check elimination is concerned, I believe the new JDK will already include an improved algorithm that eliminates it, whenever it's possible. These are the two main papers on this subject:
V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002
Effective Enhancement of Loop Versioning in Java. Link. This paper is from the guys at Excelsior, who implemented the technique in their Jet JVM.
Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ. Link. Slightly based on the above paper, this is the implementation that I believe will be included in the next JDK. The achieved speedups are also presented.
There is also this blog entry, which discusses one of the papers superficially, and also presents some benchmarking results, not only for arrays but also for arithmetic in the new JDK. The comments of the blog entry are also very interesting, since the authors of the above papers present some very interesting comments and discuss arguments. Also, there are some pointers to other similar blog posts on this subject.
Hope it helps.

It's rather unlikely that you need to help the JIT compiler much with optimizing a straightforward number crunching algorithm like LZW. ShuggyCoUk mentioned this, but I think it deserves extra attention:
The cache-friendliness of your code will be a big factor.
You have to reduce the size of your woking set and improve data access locality as much as possible. You mention "packing bytes into ints for performance". This sounds like using ints to hold byte values in order to have them word-aligned. Don't do that! The increased data set size will outweigh any gains (I once converted some ECC number crunching code from int[] to byte[] and got a 2x speed-up).
On the off chance that you don't know this: if you need to treat some data as both bytes and ints, you don't have to shift and |-mask it - use ByteBuffer.asIntBuffer() and related methods.
With current 1.6 JVM, how many
elements must be copied before
System.arraycopy beats a copying loop?
Better do the benchmark yourself. When I did it way back when in Java 1.3 times, it was somewhere around 2000 elements.

Lots of answers so far, but couple of additional things:
Measure, measure, measure. As much as most Java developers warn against micro-benchmarking, make sure you compare performance between changes. Optimizations that do not result in measurable improvements are generally not worth keeping (of course, sometimes it's combination of things, and that gets trickier)
Tight loops matter as much with Java as with C (and ditto with variable allocations -- although you don't directly control it, HotSpot will eventually have to do it). I manage to practically double the speed of UTF-8 decoding by rearranging tight loop for handling single-byte case (7-bit ascii) as tight(er) inner loop, leaving other cases out.
Do not underestimate cost of allocating and/or clearing large arrays -- if you want lzf encoding/decoding to be faster for small/medium chunks too (not just megabyte sized), keep in mind that ALL allocations of byte[]/int[] are somewhat costly; not because of GC, but because JVM MUST clear the space.
H2 implementation has also been optimized quite a bit (for example: it does not clear the hash array any more, this often makes sense); and I actually helped modify it for use in another Java project. My contribution was mostly just changing it do be more optimal for non-streaming case, but that did not touch the tight encode/decode loops.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.