JamVM on Motorola FX9500 Problems - what should I do? - java

I am using a Motorola FX9500 RFID reader, which runs Linux with the jamvm version 1.5.0 on it (I can only deploy applications to it - I cannot change the Java VM or anything so my options are limited) - here's what I see when I check the version:
[cliuser#FX9500D96335 ~]$ /usr/bin/jamvm -version
java version "1.5.0"
JamVM version 1.5.4
Copyright (C) 2003-2010 Robert Lougher <rob#jamvm.org.uk>
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2,
or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
Build information:
Execution Engine: inline-threaded interpreter with stack-caching
Compiled with: gcc 4.2.2
Boot Library Path: /usr/lib/classpath
Boot Class Path: /usr/local/jamvm/share/jamvm/classes.zip:/usr/share/classpath/glibj.zip
I need to write an application so I grabbed the Oracle Java SDK 1.5.0 and installed it onto my Windows 7 PC, so it has this version:
C:\>javac -version
javac 1.5.0
Am I being too idealistic in considering that an application I compile with that compiler would work correctly on the aforementioned JamVM? Anyway, pressing on in ignorance I write this little application:
public final class TestApp {
public static void main(final String[] args) {
long p = Long.MIN_VALUE;
int o = (int)(-(p + 10) % 10);
System.out.println(o);
}
}
Compile it with the aforementioned javac compiler and run it on the PC like so:
C:\>javac TestApp.java
C:\>java TestApp
8
All fine there. Life is good, so I take that .class file and place it on the FX9500 and run it like so:
[cliuser#FX9500D96335 ~]$ /usr/bin/jamvm TestApp
-2
Eek, what the...as you can see - it returns a different result.
So, why and who's wrong or is this something like the specification is not clear about how to deal with this calculation (surely not)? Could it be that I need to compile it with a different compiler?
Why Do I Care About This?
The reason I came to this situation is that a calculation exactly like that happens inside java.lang.Long.toString and I have a bug in my real application where I am logging out a long and getting a java.lang.ArrayIndexOutOfBoundsException. Because the value I am wanting to log may very well be at the ends of a Long.
I think I can work around it by checking for Long.MIN_VALUE and Long.MAX_VALUE and logging "Err, I can't tell you the number but it is really Long.XXX, believe me, would I lie to you?". But when I find this, I feel like my application is built on a sandy foundation now and it needs to be really robust. I am seriously considering just saying that JamVM is not up to the job and writing the application in Python (since the reader also has a Python runtime).
I'm kind of hoping that someone tells me I'm a dullard and I should have compiled it on my Windows PC like .... and then it would work, so please tell me that (if it is true, of course)!
Update
Noofiz got me thinking (thanks) and I knocked up this additional test application:
public final class TestApp2 {
public static void main(final String[] args) {
long p = Long.MIN_VALUE + 10;
if (p != -9223372036854775798L) {
System.out.println("O....M.....G");
return;
}
p = -p;
if (p != 9223372036854775798L) {
System.out.println("W....T.....F");
return;
}
int o = (int)(p % 10);
if (o != 8) {
System.out.println("EEEEEK");
return;
}
System.out.println("Phew, that was a close one");
}
}
I, again, compile on the Windows machine and run it.
It prints Phew, that was a close one
I copy the .class file to the contraption in question and run it.
It prints...
...wait for it...
W....T.....F
Oh dear. I feel a bit woozy, I think I need a cup of tea...
Update 2
One other thing I tried, that did not make any difference, was to copy the classes.zip and glibj.zip files off of the FX9500 to the PC and then do a cross compile like so (that must mean the compiled file should be fine right?):
javac -source 1.4 -target 1.4 -bootclasspath classes.zip;glibj.zip -extdirs "" TestApp2.java
But the resulting .class file, when run on the reader prints the same message.

I wrote JamVM. As you would probably guess, such errors would have been noticed by now, and JamVM wouldn't pass even the simplest of test suites with them (GNU Classpath has its own called Mauve, and OpenJDK has jtreg). I regularly run on ARM (the FX9500 uses a PXA270 ARM) and x86-64, but various platforms get tested as part of IcedTea.
So I haven't much of a clue as to what's happened here. I would guess it only affects Java longs as these are used infrequently and so most programs work. JamVM maps Java longs to C long longs, so my guess would be that the compiler used to build JamVM is producing incorrect code for long long handling on the 32-bit ARM.
Unfortunately there's not much you can do (apart from avoid longs) if you can't replace the JVM. The only thing you can do is try and turn the JIT off (a simple code-copying JIT, aka inline-threading). To do this use -Xnoinlining on the command line, e.g.:
jamvm -Xnoinlining ...

The problem is in different modulus implementations:
public static long mod(long a, long b){
long result = a % b;
if (result < 0)
{
result += b;
}
return result;
}
this code returns -2, while this:
public static long mod2(long a, long b){
long result = a % b;
if (result > 0 && a < 0)
{
result -= b;
}
return result;
}
returns 8. Reasons why JamVM is doing this way are behind my understanding.
From JLS:
15.17.3. Remainder Operator %
The remainder operation for operands that are integers after binary
numeric promotion (§5.6.2) produces a result value such that
(a/b)*b+(a%b) is equal to a.
According to this JamVM breaks language specification. Very bad.

I would have commented, but for some reason, that requires reputation.
Long negation doesn't work on this device. I don't understand its exact nature, but if you do two unary minuses you do get back to where you started, e.g. x=10; -x==4294967286; -x==10. 4294967286 is very close to Integer.MAX_VALUE*2 (2147483647*2 = 4294967294). It's even closer to Integer.MAX_VALUE*2-10!
It seems to be isolated to this one operation, and doesn't affect longs in a further fundamental way. It's simple to avoid the operation in your own code, and with some dextrous abuse of the bootclasspath can avoid the calls in GNU Classpath code, replacing them with *-1s. If you need to start your application from the device GUI, you can include the -Xbootclasspath=... switch into the args parameter for it to be passed to JamVM).
The bug is actually already fixed in latter (than the latest release) JamVM code:
* https://github.com/ansoncat/jamvm/commit/736c2cb76baf1fedddc1eda5825908f5a0511373
* https://github.com/ansoncat/jamvm/commit/ac83bdc886ac4f6e60d684de1b4d0a5e90f1c489
though doesn't help us with the fixed version on the device. Rob Lougher has mentioned this issue as a reason for releasing a new version of JamVM, though I don't know when this would be, or whether Motorola would be enough convinced to update their firmware.
The FX9500 is actually a repackaged Sirit IN610, meaning that both devices share this bug. Sirit are way friendlier that Motorola and are providing a firmware upgrade, to be available in the near future. I hope that Motorola will also include the fix, though I don't know the details of the arrangement between the two parties.
Either way, we have a very big application running on the FX9500, and the long negation operation hasn't proved to be an impassable barrier.
Good luck, Dan.

Related

_XReply() terminates app with _XIOError()

We're developing some complexed application which consists of linux binary integrated with java jni calls (from JVM created in linux binary) from our custom made .jar file. All gui work is implemented and done by java part. Each time some gui property has to be changed or gui has to be repainted, it is done by jni call to JVM.
Complete display/gui is repainted (or refreshed) as fast as JVM/java can handle it. It is done iteratively and frequently, few hunderds or thousands iterations per second.
After some exact time, application is terminated with exit(1) which I caught with gdb to be called from _XIOError(). This termination can be repeated after more or less exact time period, e.g. after some 15h on x86 dual core 2.5GHz. If I use some slower computer, it lasts longer, like it is proportional to cpu/gpu speed. Some conclusion would be that some part of xorg ran out of some resource or something like that.
Here is my backtrace:
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2 0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3 0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4 0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5 0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6 0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7 0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8 0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9 0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so
I made my own exit() call in liboverrides.so and used it with LD_PRELOAD to capture exit() call in gdb with help of abort()/SIGABRT.
After some debugging of libX11 and libxcb, I noticed that _XReply() got NULL reply (response from xcb_wait_for_reply()) that causes call to _XIOError() and exit(1). Going more deeply in libxcb in xcb_wait_for_reply() function, I noticed that one of the reasons it can return NULL reply is when it detects broken or closed socket connection, which could be my situation.
For test purposes, if I change xcb_io.c and ignore _XIOError(), application doesn't work any more. And if I repeat request inside _XReply(), it fails each time, i.e. gets NULL response on each xcb_wait_for_reply().
So, my questions would be why such uncontrolled app termination with exit(1) from _XReply() -> XIOError() -> exit(1) happened or how can I find out reason why and what happened, so I can fix it or do some workaround.
For this problem to repeat, as I wrote above, I have to wait for some 15h, but currently I'm very short on time for debuging and can't find the cause of problem/termination.
We also tried to reorganise java part which handles gui/display refresh, but the problem wasn't solved.
Some SW facts:
- java jre 1.8.0_20, even with java 7 can repeat the problem
- libX11.so 1.5.0
- libxcb.so 1.8.1
- debian wheezy
- kernel 3.2.0
This is likely a known issue in libX11 regarding the handling of request numbers used for xcb_wait_for_reply.
At some point after libxcb v1.5 code to use 64-bit sequence numbers internally everywhere was introduced and logic was added to widen sequence numbers on entry to those public APIs that still take 32-bit sequence numbers.
Here is a quote from submitted libxcb bug report (actual emails removed):
We have an application that does a lot of XDrawString and XDrawLine.
After several hours the application is exited by an XIOError.
The XIOError is called in libX11 in the file xcb_io.c, function
_XReply. It didn't get a response from xcb_wait_for_reply.
libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to
this commit:
commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey
Sharp Date: Sat Oct 9 17:13:45 2010 -0700
xcb_in: Use 64-bit sequence numbers internally everywhere.
Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.
Signed-off-by: Jamey Sharp <jamey#xxxxxx.xxx>
Reverting it on top of 1.8.1 helps.
Adding traces to libxcb I found that the last request numbers used for
xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in
the while loop of the _XReply function), half a second later: 63215
(then XIOError is called). The widen_request is also 63215, I would
have expected 63215+2^32. Therefore it seems that the request is not
correctly widened.
The commit above also changed the compares in poll_for_reply from
XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening
never worked correctly, but it was never observed, because only the
lower 32bits were compared.
Reproducing the issue
Here's the original code snippet from the submitted bug report which was used to reproduce the issue:
for(;;) {
XDrawLine(dpy, w, gc, 10, 60, 180, 20);
XFlush(dpy);
}
and apparently the issue can be reproduced with even simpler code:
for(;;) {
XNoOp(dpy);
}
According to submitted libxcb bug report these conditions are needed to reproduce (assuming the reproduce code is in xdraw.c):
libxcb >= 1.8 (i.e. includes the commit ed37b08)
compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
the sequence counter wraps.
Proposed patch
The proposed patch which can be applied on top of libxcb 1.8.1 is this:
diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
## -454,7 +454,7 ## void _XSend(Display *dpy, const char *data, long size)
static const xReq dummy_request;
static char const pad[3];
struct iovec vec[3];
- uint64_t requests;
+ unsigned long requests;
_XExtension *ext;
xcb_connection_t *c = dpy->xcb->connection;
if(dpy->flags & XlibDisplayIOError)
## -470,7 +470,7 ## void _XSend(Display *dpy, const char *data, long size)
if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
{
uint64_t sequence;
- for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+ for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
append_pending_request(dpy, sequence);
}
requests = dpy->request - dpy->xcb->last_flushed;
Detailed technical explanation
Plase find bellow included detailed technical explanation by Jonas Petersen (also included in the aforementioned bug report):
Hi,
Here's two patches. The first one fixes a 32-bit sequence wrap bug.
The second patch only adds a comment to another relevant statement.
The patches contain some details. Here is the whole story for who
might be interested:
Xlib (libx11) will crash an application with a "Fatal IO error 11
(Resource temporarily unavailable)" after 4 294 967 296 requests to
the server. That is when the Xlib internal 32-bit sequence wraps.
Most applications probably will hardly reach this number, but if they
do, they have a chance to die a mysterious death. For example the
application I'm working on did always crash after about 20 hours when
I started to do some stress testing. It does some intensive drawing
through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per
second in full hd resolution (on Ubuntu). Some optimizations did
extend the grace to about 35 hours but it would still crash.
What then followed was some frustrating weeks of digging and debugging
to realize that it's not in my application, nor in gtkmm, gtk or glib
but that it's this little bug in Xlib which exists since 2006-10-06
apparently.
It took a while to turn out that the number 0x100000000 (2^32) has
some relevance. (Much) later it turned out it can be reproduced with
Xlib only, using this code for example:
while(1) {
XDrawPoint(display, drawable, gc, x, y);
XFlush(display); }
It might take one or two hours, but when it reaches the 4294 million
it will explode into a "Fatal IO error 11".
What I then learned is that even though Xlib uses internal 32bit
sequence numbers they get (smartly) widened to 64bit in the process
so that the 32bit sequence may wrap without any disruption in the
widened 64bit sequence. Obviously there must be something wrong with
that.
The Fatal IO error is issued in _XReply() when it's not getting a
reply where there should be one, but the cause is earlier in _XSend()
in the moment when the Xlib 32-bit sequence number wraps.
The problem is that when it wraps to 0, the value of 'last_flushed'
will still be at the upper boundary (e.g. 0xffffffff). There is two
locations in
_XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:
requests = dpy->request - dpy->xcb->last_flushed;
I case of request = 0x0 and last_flushed = 0xffffffff it will assign
0xffffffff00000001 to 'requests' and then to XCB as a number (amount)
of requests. This is the main killer.
The second location is this:
for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request;
\
++sequence)
I case of request = 0x0 (less than last_flushed) there is no chance to
enter the loop ever and as a result some requests are ignored.
The solution is to "unwrap" dpy->request at these two locations and
thus retain the sequence related to last_flushed.
uint64_t unwrapped_request = ((uint64_t)(dpy->request < \
dpy->xcb->last_flushed) << 32) + dpy->request;
It creates a temporary 64-bit request number which has bit 8 set if
'request' is less than 'last_flushed'. It is then used in the two
locations instead of dpy->request.
I'm not sure if it might be more efficient to use that statement
inplace, instead of using a variable.
There is another line in require_socket() that worried me at first:
dpy->xcb->last_flushed = dpy->request = sent;
That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to
32-bit when assinging it to 'request' and then also assign the
truncated value to the (64-bit) 'last_flushed'. But it seems inteded.
I have added a note explaining that for the next poor soul debugging
sequence issues... :-)
Jonas
Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping
xcb_io: Add comment explaining a mixed type double assignment
src/xcb_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+),
3 deletions(-)
--
1.7.10.4
Good luck!

Tuning Java 7 to match performance of Java 6

We have a simple unit test as part of our performance test suite which we use to verify that the base system is sane and performs before we even start testing our code. This way we usually verify that a machine is suitable for running actual performance tests.
When we compare Java 6 and Java 7 using this test, Java 7 takes considerably longer to execute! We see an average of 22 seconds for Java 6 and 24 seconds for Java 7. The test only computes fibonacci, so only bytecode execution in a single thread should be relevant here and not I/O or anything else.
Currently we run it with default settings on Windows with or without "-server", with both 32 and 64 bit JVM, all runs indicate a similar degradation for Java 7.
Which tuning options may be suitable here to try to match Java 7 against Java 6?
public class BaseLinePerformance {
#Before
public void setup() throws Exception{
fib(46);
}
#Test
public void testBaseLine() throws Exception {
long start = System.currentTimeMillis();
fib(46);
fib(46);
System.out.println("Time: " + (System.currentTimeMillis() - start));
}
public static void fib(final int n) throws Exception {
for (int i = 0; i < n; i++) {
System.out.println("fib(" + i + ") = " + fib2(i));
}
}
public static int fib2(final int n) {
if (n == 0)
return 0;
else if (n == 1)
return 1;
else
return fib2(n - 2) + fib2(n - 1);
}
}
Update: I have reduced the test to not do any sleeps and followed the other suggestions from How do I write a correct micro-benchmark in Java?, I still see the same difference between Java 7 and Java 6, additional JVM options to print compilation and GC do not show any output during the actual test, only initially compilation information is printed.
One of my colleagues found out the reason for this after a bit more digging:
There is a JVM flag -XX:MaxRecursiveInlineLevel which has a default value of 1. It seems the handling of this setting was slightly incorrect in previous versions, so Sun/Oracle "fixed" this in Java 7, however it has the side-effect that sometimes the inlining now is done less aggressively and thus pure runtime/CPU time of recursive code can be longer than before.
We are testing setting it to 2 to get the same behavior as in Java 6 at least for the test in question.
This is not an easy answer, there are plenty of things that can account for those 2 seconds.
I am assuming for your comments that you are already familiar with micro benchmarking and that your benchmark is being run after warming up the JVM having your code reach an optimized JIT state and no GCs happening, also assuming that your hardware setup has not changed.
I would recommend CPU profiling your benchmark, that will help you identify where those two seconds are being accounted and perhaps act accordingly.
If you are curious about the bytecode you can take a peek at it.
To do this you can compile your class and do javap -c ClassName on both machines, this will disassemble the class file bytecode and show it to you, here you will surely see changes between both compiled classes.
In conclusion, profile and tune your application accordingly to reach 22 seconds after looking at the data, there is nothing you can do anyways about the bytecode implementation.

D lang simple benchmarking

I'm new in D and I'm comparing it vs Java in simple tests and expecting to see that the native language will be faster (or roughly the same). But it in my first test with recursion D is surprisingly slower than Java (almost two times).
Java (this is bad java perfomance test but it just simple idea):
public static void main(String... args) {
long before = System.nanoTime();
System.out.println(fibonacci(40));
System.out.println(TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - before));
}
static int fibonacci(int n) {
if (n < 2) {
return n;
}
return fibonacci(n - 2) + fibonacci(n - 1);
}
Environment: Win7 64bit, JDK: 1.7.0_10 x64.
D:
import std.stdio;
import std.datetime;
void main(string[] args)
{
auto r = benchmark!(simplebench)(1);
writefln("%s", r[0].to!("msecs", int));
}
void simplebench() {
writeln(fibonacci(40));
}
int fibonacci(int n) {
if (n < 2) {
return n;
}
return fibonacci(n - 2) + fibonacci(n - 1);
}
Environment: Win7 64bit, dmd 2.061, compiler options: -noboundscheck -inline -O -release
Java ~570ms and D ~1011ms.
What am I doing wrong?
Thanks!
Java is also native via its JIT compiler. If you disable the JIT using -Xint (forces interpreter) then you'll see that D is significantly faster. For what it's worth, I tried a similar implementation in C and got the same speed as in D.
These sorts of micro benchmarks are not useful for testing general performance. All you are doing here is testing Java JIT compiled code versus D compiled code. They're both being compiled. Also, this is not typical Java code. Typical Java programs allocates a lot of memory on the heap, whereas typical D programs do not.
If you want to learn about real performance of D versus Java then you need you test it on real programs.
DMD is the reference compiler for D, however its backend doesn't produce code as fast as the other compilers, GDC and LDC, as CyberShadow mentioned earlier.
Timings measured on my computer:
D compilers, all using the following flags or their equivalents: -noboundscheck -inline -O -release
DMD ~905ms,
LDC ~663ms,
GDC ~382ms
Java ~445ms
g++ ~370ms under -O3
These (micro-)results show that the D code is as performant as the cpp equivalent when compiled using the same backends, and that it is faster than the Java code.
DMD's backend doesn't optimize as well as the GCC-based GDC or the LLVM-based LDC. Your test program runs slightly faster for me than Java when built with GDC. If performance is important for your project, don't use DMD for release builds.
You are not doing anything wrong. Java's JIT is very good and probably optimises certain code better than the, let me guess, the DMD compiler? - Try the GDC or LDC compilers and see what kind of results does the testing of produced executables give. I would also test LuaJIT as well, and I would expect it to be extremely fast with these small algorithms that deal with PODs.

Stack performance in programming languages

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:
public class Fib {
public static int fib(int n) {
if (n < 2) return 1;
return fib(n-1) + fib(n-2);
}
public static void main(String[] args) {
System.out.println(fib(Integer.valueOf(args[0])));
}
}
Ok so the point is that using this algorithm with input 40 I got these timings:
C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s
They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.
I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?
How would you explain these results?
EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)
EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)
You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.
Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.
In addition, make sure you're measuring CPU time rather than clock time.
Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.
For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.
In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:
int fib(unsigned int n) {
int t, a, b;
if (n < 2) return 1;
a = b = 1;
while (n-- >= 2) {
t = a + b;
a = b;
b = t;
}
return b;
}
further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.
And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:
static int fib(unsigned int n) {
static int lookup[] = {
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
46368, 75025, 121393, 196418, 317811, 514229, 832040,
1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
24157817, 39088169, 63245986, 102334155, 165580141 };
int t, a, b;
if (n < sizeof(lookup)/sizeof(*lookup))
return lookup[n];
a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
t = a + b;
a = b;
b = t;
}
return b;
}
That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.
You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)
When I try to reproduce for OCaml I get:
let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))
let () = Format.printf "%d#." (f 40)
$ ocamlopt fib.ml
$ time ./a.out
165580141
real 0m1.643s
This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).
One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.
Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.
You may be able to see a difference by inverting the logic of the if:
public static int fib(int n) {
if (n >= 2) return fib(n-1) + fib(n-2);
return 1;
}
It's worth a try, anyway :)
As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)
I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...
As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...
Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.
I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.
The unoptimised version was 34 lines long and looked like a straight translation of the C code.
The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.
With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.
Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.
One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.
Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

C++ and Java performance

this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.
You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.
All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!
Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)
Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).
To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?
What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}
In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

Categories

Resources