We're developing some complexed application which consists of linux binary integrated with java jni calls (from JVM created in linux binary) from our custom made .jar file. All gui work is implemented and done by java part. Each time some gui property has to be changed or gui has to be repainted, it is done by jni call to JVM.
Complete display/gui is repainted (or refreshed) as fast as JVM/java can handle it. It is done iteratively and frequently, few hunderds or thousands iterations per second.
After some exact time, application is terminated with exit(1) which I caught with gdb to be called from _XIOError(). This termination can be repeated after more or less exact time period, e.g. after some 15h on x86 dual core 2.5GHz. If I use some slower computer, it lasts longer, like it is proportional to cpu/gpu speed. Some conclusion would be that some part of xorg ran out of some resource or something like that.
Here is my backtrace:
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2 0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3 0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4 0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5 0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6 0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7 0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8 0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9 0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so
I made my own exit() call in liboverrides.so and used it with LD_PRELOAD to capture exit() call in gdb with help of abort()/SIGABRT.
After some debugging of libX11 and libxcb, I noticed that _XReply() got NULL reply (response from xcb_wait_for_reply()) that causes call to _XIOError() and exit(1). Going more deeply in libxcb in xcb_wait_for_reply() function, I noticed that one of the reasons it can return NULL reply is when it detects broken or closed socket connection, which could be my situation.
For test purposes, if I change xcb_io.c and ignore _XIOError(), application doesn't work any more. And if I repeat request inside _XReply(), it fails each time, i.e. gets NULL response on each xcb_wait_for_reply().
So, my questions would be why such uncontrolled app termination with exit(1) from _XReply() -> XIOError() -> exit(1) happened or how can I find out reason why and what happened, so I can fix it or do some workaround.
For this problem to repeat, as I wrote above, I have to wait for some 15h, but currently I'm very short on time for debuging and can't find the cause of problem/termination.
We also tried to reorganise java part which handles gui/display refresh, but the problem wasn't solved.
Some SW facts:
- java jre 1.8.0_20, even with java 7 can repeat the problem
- libX11.so 1.5.0
- libxcb.so 1.8.1
- debian wheezy
- kernel 3.2.0
This is likely a known issue in libX11 regarding the handling of request numbers used for xcb_wait_for_reply.
At some point after libxcb v1.5 code to use 64-bit sequence numbers internally everywhere was introduced and logic was added to widen sequence numbers on entry to those public APIs that still take 32-bit sequence numbers.
Here is a quote from submitted libxcb bug report (actual emails removed):
We have an application that does a lot of XDrawString and XDrawLine.
After several hours the application is exited by an XIOError.
The XIOError is called in libX11 in the file xcb_io.c, function
_XReply. It didn't get a response from xcb_wait_for_reply.
libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to
this commit:
commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey
Sharp Date: Sat Oct 9 17:13:45 2010 -0700
xcb_in: Use 64-bit sequence numbers internally everywhere.
Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.
Signed-off-by: Jamey Sharp <jamey#xxxxxx.xxx>
Reverting it on top of 1.8.1 helps.
Adding traces to libxcb I found that the last request numbers used for
xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in
the while loop of the _XReply function), half a second later: 63215
(then XIOError is called). The widen_request is also 63215, I would
have expected 63215+2^32. Therefore it seems that the request is not
correctly widened.
The commit above also changed the compares in poll_for_reply from
XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening
never worked correctly, but it was never observed, because only the
lower 32bits were compared.
Reproducing the issue
Here's the original code snippet from the submitted bug report which was used to reproduce the issue:
for(;;) {
XDrawLine(dpy, w, gc, 10, 60, 180, 20);
XFlush(dpy);
}
and apparently the issue can be reproduced with even simpler code:
for(;;) {
XNoOp(dpy);
}
According to submitted libxcb bug report these conditions are needed to reproduce (assuming the reproduce code is in xdraw.c):
libxcb >= 1.8 (i.e. includes the commit ed37b08)
compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
the sequence counter wraps.
Proposed patch
The proposed patch which can be applied on top of libxcb 1.8.1 is this:
diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
## -454,7 +454,7 ## void _XSend(Display *dpy, const char *data, long size)
static const xReq dummy_request;
static char const pad[3];
struct iovec vec[3];
- uint64_t requests;
+ unsigned long requests;
_XExtension *ext;
xcb_connection_t *c = dpy->xcb->connection;
if(dpy->flags & XlibDisplayIOError)
## -470,7 +470,7 ## void _XSend(Display *dpy, const char *data, long size)
if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
{
uint64_t sequence;
- for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+ for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
append_pending_request(dpy, sequence);
}
requests = dpy->request - dpy->xcb->last_flushed;
Detailed technical explanation
Plase find bellow included detailed technical explanation by Jonas Petersen (also included in the aforementioned bug report):
Hi,
Here's two patches. The first one fixes a 32-bit sequence wrap bug.
The second patch only adds a comment to another relevant statement.
The patches contain some details. Here is the whole story for who
might be interested:
Xlib (libx11) will crash an application with a "Fatal IO error 11
(Resource temporarily unavailable)" after 4 294 967 296 requests to
the server. That is when the Xlib internal 32-bit sequence wraps.
Most applications probably will hardly reach this number, but if they
do, they have a chance to die a mysterious death. For example the
application I'm working on did always crash after about 20 hours when
I started to do some stress testing. It does some intensive drawing
through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per
second in full hd resolution (on Ubuntu). Some optimizations did
extend the grace to about 35 hours but it would still crash.
What then followed was some frustrating weeks of digging and debugging
to realize that it's not in my application, nor in gtkmm, gtk or glib
but that it's this little bug in Xlib which exists since 2006-10-06
apparently.
It took a while to turn out that the number 0x100000000 (2^32) has
some relevance. (Much) later it turned out it can be reproduced with
Xlib only, using this code for example:
while(1) {
XDrawPoint(display, drawable, gc, x, y);
XFlush(display); }
It might take one or two hours, but when it reaches the 4294 million
it will explode into a "Fatal IO error 11".
What I then learned is that even though Xlib uses internal 32bit
sequence numbers they get (smartly) widened to 64bit in the process
so that the 32bit sequence may wrap without any disruption in the
widened 64bit sequence. Obviously there must be something wrong with
that.
The Fatal IO error is issued in _XReply() when it's not getting a
reply where there should be one, but the cause is earlier in _XSend()
in the moment when the Xlib 32-bit sequence number wraps.
The problem is that when it wraps to 0, the value of 'last_flushed'
will still be at the upper boundary (e.g. 0xffffffff). There is two
locations in
_XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:
requests = dpy->request - dpy->xcb->last_flushed;
I case of request = 0x0 and last_flushed = 0xffffffff it will assign
0xffffffff00000001 to 'requests' and then to XCB as a number (amount)
of requests. This is the main killer.
The second location is this:
for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request;
\
++sequence)
I case of request = 0x0 (less than last_flushed) there is no chance to
enter the loop ever and as a result some requests are ignored.
The solution is to "unwrap" dpy->request at these two locations and
thus retain the sequence related to last_flushed.
uint64_t unwrapped_request = ((uint64_t)(dpy->request < \
dpy->xcb->last_flushed) << 32) + dpy->request;
It creates a temporary 64-bit request number which has bit 8 set if
'request' is less than 'last_flushed'. It is then used in the two
locations instead of dpy->request.
I'm not sure if it might be more efficient to use that statement
inplace, instead of using a variable.
There is another line in require_socket() that worried me at first:
dpy->xcb->last_flushed = dpy->request = sent;
That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to
32-bit when assinging it to 'request' and then also assign the
truncated value to the (64-bit) 'last_flushed'. But it seems inteded.
I have added a note explaining that for the next poor soul debugging
sequence issues... :-)
Jonas
Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping
xcb_io: Add comment explaining a mixed type double assignment
src/xcb_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+),
3 deletions(-)
--
1.7.10.4
Good luck!
Related
This question already has answers here:
Why is the max recursion depth I can reach non-deterministic?
(4 answers)
Closed 5 years ago.
A simple class for demonstration purposes:
public class Main {
private static int counter = 0;
public static void main(String[] args) {
try {
f();
} catch (StackOverflowError e) {
System.out.println(counter);
}
}
private static void f() {
counter++;
f();
}
}
I executed the above program 5 times, the results are:
22025
22117
15234
21993
21430
Why are the results different each time?
I tried setting the max stack size (for example -Xss256k). The results were then a bit more consistent but again not equal each time.
Java version:
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
EDIT
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
The observed variance is caused by background JIT compilation.
This is how the process looks like:
Method f() starts execution in interpreter.
After a number of invocations (around 250) the method is scheduled for compilation.
The compiler thread works in parallel to the application thread. Meanwhile the method continues execution in interpreter.
As soon as the compiler thread finishes compilation, the method entry point is replaced, so the next call to f() will invoke the compiled version of the method.
There is basically a race between applcation thread and JIT compiler thread. Interpreter may perform different number of calls before the compiled version of the method is ready. At the end there is a mix of interpreted and compiled frames.
No wonder that compiled frame layout differs from interpreted one. Compiled frames are usually smaller; they don't need to store all the execution context on the stack (method reference, constant pool reference, profiler data, all arguments, expression variables etc.)
Futhermore, there is even more race possibilities with Tiered Compilation (default since JDK 8). There can be a combination of 3 types of frames: interpreter, C1 and C2 (see below).
Let's have some fun experiments to support the theory.
Pure interpreted mode. No JIT compilation.
No races => stable results.
$ java -Xint Main
11895
11895
11895
Disable background compilation. JIT is ON, but is synchronized with the application thread.
No races again, but the number of calls is now higher due to compiled frames.
$ java -XX:-BackgroundCompilation Main
23462
23462
23462
Compile everything with C1 before execution. Unlike previous case there will be no interpreted frames on the stack, so the number will be a bit higher.
$ java -Xcomp -XX:TieredStopAtLevel=1 Main
23720
23720
23720
Now compile everything with C2 before execution. This will produce the most optimized code with the smallest frame. The number of calls will be the highest.
$ java -Xcomp -XX:-TieredCompilation Main
59300
59300
59300
Since the default stack size is 1M, this should mean the frame now is only 16 bytes long. Is it?
$ java -Xcomp -XX:-TieredCompilation -XX:CompileCommand=print,Main.f Main
0x00000000025ab460: mov %eax,-0x6000(%rsp) ; StackOverflow check
0x00000000025ab467: push %rbp ; frame link
0x00000000025ab468: sub $0x10,%rsp
0x00000000025ab46c: movabs $0xd7726ef0,%r10 ; r10 = Main.class
0x00000000025ab476: addl $0x2,0x68(%r10) ; Main.counter += 2
0x00000000025ab47b: callq 0x00000000023c6620 ; invokestatic f()
0x00000000025ab480: add $0x10,%rsp
0x00000000025ab484: pop %rbp ; pop frame
0x00000000025ab485: test %eax,-0x23bb48b(%rip) ; safepoint poll
0x00000000025ab48b: retq
In fact, the frame here is 32 bytes, but JIT has inlined one level of recursion.
Finally, let's look at the mixed stack trace. In order to get it, we'll crash JVM on StackOverflowError (option available in debug builds).
$ java -XX:AbortVMOnException=java.lang.StackOverflowError Main
The crash dump hs_err_pid.log contains the detailed stack trace where we can find interpreted frames at the bottom, C1 frames in the middle and lastly C2 frames on the top.
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5958 [0x00007f21251a5900+0x0000000000000058]
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
// ... repeated 19787 times ...
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
// ... repeated 1866 times ...
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
j Main.f()V+8
j Main.f()V+8
// ... repeated 1839 times ...
j Main.f()V+8
j Main.main([Ljava/lang/String;)V+0
v ~StubRoutines::call_stub
First of all, the following has not been researched. I have not "deep dived" the OpenJDK source code to validate any of the following, and I don't have access to any inside knowledge.
I tried to validate your results by running your test on my machine:
$ java -version
openjdk version "1.8.0_71"
OpenJDK Runtime Environment (build 1.8.0_71-b15)
OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)
I get the "count" varying over a range of ~250. (Not as much as you are seeing)
First some background. A thread stack in a typical Java implementation is a contiguous region of memory that is allocated before the thread is started, and that is never grown or moved. A stack overflow happens when the JVM tries to create a stack frame to make a method call, and the frame goes beyond the limits of the memory region. The test could be done by testing the SP explicitly, but my understanding is that it is normally implemented using a clever trick with the memory page settings.
When a stack region is allocated, the JVM makes a syscall to tell the OS to mark a "red zone" page at the end of the stack region read-only or non-accessible. When a thread makes a call that overflows the stack, it accesses memory in the "red zone" which triggers a memory fault. The OS tells the JVM via a "signal", and the JVM's signal handler maps it to a StackOverflowError that is "thrown" on the thread's stack.
So here are a couple of possible explanations for the variability:
The granularity of hardware-based memory protection is the page boundary. So if the thread stack has been allocated using malloc, the start of the region is not going to be page aligned. Therefore the distance from the start of the stack frame to the first word of the "red zone" (which >is< page aligned) is going to be variable.
The "main" stack is potentially special, because that region may be used while the JVM is bootstrapping. That might lead to some "stuff" being left on the stack from before main was called. (This is not convincing ... and I'm not convinced.)
Having said this, the "large" variability that you are seeing is baffling. Page sizes are too small to explain a difference of ~7000 in the counts.
UPDATE
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
Interesting. Among other things, that could cause stack limit checking to be done differently.
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Plausible. The size of the stackframe could well be different after the f() method has been JIT compiled. Assuming f() was JIT compiled at some point you stack will have a mixture of "old" and "new" frames. If the JIT compilation occurred at different points, then the ratio will be different ... and hence the count will be different when you hit the limit.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
Little chance of that, I'm afraid ... unless you are prepared to PAY someone to do a few days research for you.
1) No such (public) reference documentation exists, AFAIK. At least, I've never been able to find a definitive source for this kind of thing ... apart from deep diving the source code.
2) Looking at the JIT compiled code tells you nothing of how the bytecode interpreter handled things before the code was JIT compiled. So you won't be able to see if the frame size has changed.
The exact functioning of Java stack undocumented, but it totally depends on the memory allocated to that thread.
Just try using the Thread constructor with stacksize and see if it gets constant. I have not tried it it, so please share the results.
For simplicity imagine this scenario, we have a 2-bit computer, which has a pair of 2 bit registers called r1 and r2 and only works with immediate addressing.
Lets say the bit sequence 00 means add to our cpu. Also 01 means move data to r1 and 10 means move data to r2.
So there is an Assembly Language for this computer and a Assembler, where a sample code would be written like
mov r1,1
mov r2,2
add r1,r2
Simply, when I assemble this code to native language and the file will be something like:
0101 1010 0001
the 12 bits above is the native code for:
Put decimal 1 to R1, Put decimal 2 to R2, Add the data and store in R1.
So this is basically how a compiled code works, right?
Lets say someone implements a JVM for this architecture. In Java I will be writing code like:
int x = 1 + 2;
How exactly will JVM interpret this code? I mean eventually the same bit pattern must be passed to the cpu, isn't it? All cpu's have a number of instructions that it can understand and execute, and they are after all just some bits. Lets say the compiled Java byte-code looks something like this:
1111 1100 1001
or whatever.. Does it mean that the interpreting changes this code to 0101 1010 0001 when executing? If it is, it is already in the Native Code, so why is it said that JIT only kicks in after a number of times? If it does not convert it exactly to 0101 1010 0001, then what does it do? How does it make the cpu do the addition?
Maybe there are some mistakes in my assumptions.
I know interpreting is slow, compiled code is faster but not portable, and a virtual machine "interprets" a code, but how? I am looking for "how exactly/technically interpreting" is done. Any pointers (such as books or web pages) are welcome instead of answers as well.
The CPU architecture you describe is unfortunately too restricted to make this really clear with all the intermediate steps. Instead, I will write pseudo-C and pseudo-x86-assembler, hopefully in a way that is clear without being terribly familiar with C or x86.
The compiled JVM bytecode might look something like this:
ldc 0 # push first first constant (== 1)
ldc 1 # push the second constant (== 2)
iadd # pop two integers and push their sum
istore_0 # pop result and store in local variable
The interpreter has (a binary encoding of) these instructions in an array, and an index referring to the current instruction. It also has an array of constants, and a memory region used as stack and one for local variables. Then the interpreter loop looks like this:
while (true) {
switch(instructions[pc]) {
case LDC:
sp += 1; // make space for constant
stack[sp] = constants[instructions[pc+1]];
pc += 2; // two-byte instruction
case IADD:
stack[sp-1] += stack[sp]; // add to first operand
sp -= 1; // pop other operand
pc += 1; // one-byte instruction
case ISTORE_0:
locals[0] = stack[sp];
sp -= 1; // pop
pc += 1; // one-byte instruction
// ... other cases ...
}
}
This C code is compiled into machine code and run. As you can see, it's highly dynamic: It inspects each bytecode instruction each time that instruction is executed, and all values goes through the stack (i.e. RAM).
While the actual addition itself probably happens in a register, the code surrounding the addition is rather different from what a Java-to-machine code compiler would emit. Here's an excerpt from what a C compiler might turn the above into (pseudo-x86):
.ldc:
incl %esi # increment the variable pc, first half of pc += 2;
movb %ecx, program(%esi) # load byte after instruction
movl %eax, constants(,%ebx,4) # load constant from pool
incl %edi # increment sp
movl %eax, stack(,%edi,4) # write constant onto stack
incl %esi # other half of pc += 2
jmp .EndOfSwitch
.addi
movl %eax, stack(,%edi,4) # load first operand
decl %edi # sp -= 1;
addl stack(,%edi,4), %eax # add
incl %esi # pc += 1;
jmp .EndOfSwitch
You can see that the operands for the addition come from memory instead of being hardcoded, even though for the purposes of the Java program they are constant. That's because for the interpreter, they are not constant. The interpreter is compiled once and then must be able to execute all sorts of programs, without generating specialized code.
The purpose of the JIT compiler is to do just that: Generate specialized code. A JIT can analyze the ways the stack is used to transfer data, the actual values of various constants in the program, and the sequence of calculations performed, to generate code that more efficiently does the same thing. In our example program, it would allocate the local variable 0 to a register, replace the access to the constant table with moving constants into registers (movl %eax, $1), and redirect the stack accesses to the right machine registers. Ignoring a few more optimizations (copy propagation, constant folding and dead code elimination) that would normally be done, it might end up with code like this:
movl %ebx, $1 # ldc 0
movl %ecx, $2 # ldc 1
movl %eax, %ebx # (1/2) addi
addl %eax, %ecx # (2/2) addi
# no istore_0, local variable 0 == %eax, so we're done
Not all computers have the same instruction set. Java bytecode is a kind of Esperanto - an artificial language to improve communication. The Java VM translates the universal Java bytecode to the instruction set of the computer it runs on.
So how does JIT figure in here? The main purpose of the JIT compiler is optimization. There are often different ways to translate a certain piece of bytecode into the target machine code. The most performance-ideal translation is often non-obvious because it might depend on the data. There are also limits to how far a program can analyze an algorithm without executing it - the halting problem is a well-known such limitation but not the only one. So what the JIT compiler does is try different possible translations and measure how fast they are executed with the real-world data the program processes. So it takes a number of executions until the JIT compiler found the perfect translation.
One of the important steps in Java is that the compiler first translates the .java code into a .class file, which contains the Java bytecode. This is useful, as you can take .class files and run them on any machine that understands this intermediate language, by then translating it on the spot line-by-line, or chunk-by-chunk. This is one of the most important functions of the java compiler + interpreter. You can directly compile Java source code to native binary, but this negates the idea of writing the original code once and being able to run it anywhere. This is because the compiled native binary code will only run on the same hardware/OS architecture that it was compiled for. If you want to run it on another architecture, you'd have to recompile the source on that one. With the compilation to the intermediate-level bytecode, you don't need to drag around the source code, but the bytecode. It's a different issue, as you now need a JVM that can interpret and run the bytecode. As such, compiling to the intermediate-level bytecode, which the interpreter then runs, is an integral part of the process.
As for the actual realtime running of code: yes, the JVM will eventually interpret/run some binary code that may or may not be identical to natively compiled code. And in a one-line example, they may seem superficially the same. But the interpret typically doesn't precompile everything, but goes through the bytecode and translates to binary line-by-line or chunk-by-chunk. There are pros and cons to this (compared to natively compiled code, e.g. C and C compilers) and lots of resources online to read up further on. See my answer here, or this, or this one.
Simplifying, interpreter is a infinite loop with a giant switch inside.
It reads Java byte code (or some internal representation) and emulates a CPU executing it.
This way the real CPU executes the interpreter code, which emulates the virtual CPU.
This is painfully slow. Single virtual instruction adding two numbers requires three function calls and many other operations.
Single virtual instruction takes a couple of real instructions to execute.
This is also less memory efficient as you have both real and emulated stack, registers and instruction pointers.
while(true) {
Operation op = methodByteCode.get(instructionPointer);
switch(op) {
case ADD:
stack.pushInt(stack.popInt() + stack.popInt())
instructionPointer++;
break;
case STORE:
memory.set(stack.popInt(), stack.popInt())
instructionPointer++;
break;
...
}
}
When some method is interpreted multiple times, JIT compiler kicks in.
It will read all virtual instructions and generate one or more native instructions which does the same.
Here I'm generating string with text assembly which would require additional assembly to native binary conversions.
for(Operation op : methodByteCode) {
switch(op) {
case ADD:
compiledCode += "popi r1"
compiledCode += "popi r2"
compiledCode += "addi r1, r2, r3"
compiledCode += "pushi r3"
break;
case STORE:
compiledCode += "popi r1"
compiledCode += "storei r1"
break;
...
}
}
After native code is generated, JVM will copy it somewhere, mark this region as executable and instruct the interpreter to invoke it instead of interpreting byte code next time this method is invoked.
Single virtual instruction might still take more than one native instruction but this will be nearly as fast as ahead of time compilation to native code (like in C or C++).
Compilation is usually much slower than interpreting, but has to be done only once and only for chosen methods.
I know that a method cannot be larger than 64 KB with Java. The limitation causes us problems with generated code from a JavaCC grammar. We had problems with Java 6 and were able to fix this by changing the grammar. Has the limit been changed for Java 7 or is it planned for Java 8?
Just to make it clear. I don't need a method larger than 64 KB by myself. But I wrote a grammar which compiles to a very large method.
According to JVMS7 :
The fact that end_pc is exclusive is a historical mistake in the
design of the Java virtual machine: if the Java virtual machine code
for a method is exactly 65535 bytes long and ends with an instruction
that is 1 byte long, then that instruction cannot be protected by an
exception handler. A compiler writer can work around this bug by
limiting the maximum size of the generated Java virtual machine code
for any method, instance initialization method, or static initializer
(the size of any code array) to 65534 bytes.
But this is about Java 7. There is no final specs for Java 8, so nobody (except its developers) could answer this question.
UPD (2015-04-06) According to JVM8 it is also true for Java 8.
Good question. As always we should go to the source to find the answer ("The Java® Virtual Machine Specification"). The section does not explicitly mention a limit (as did the Java6 VM spec) though, but somewhat circumspectly:
The greatest number of local variables in the local variables array of a frame created upon invocation of a method (§2.6) is limited to 65535 by the size of the max_locals item of the Code attribute (§4.7.3) giving the code of the method, and by the 16-bit local variable indexing of the Java Virtual Machine instruction set.
Cheers,
It has not changed. The limit of code in methods is still 64 KB in both Java 7 and Java 8.
References:
From the Java 7 Virtual Machine Specification (4.9.1 Static Constraints):
The static constraints on the Java Virtual Machine code in a class file specify how
Java Virtual Machine instructions must be laid out in the code array and what the
operands of individual instructions must be.
The static constraints on the instructions in the code array are as follows:
The code array must not be empty, so the code_length item cannot have the
value 0.
The value of the code_length item must be less than 65536.
From the Java 8 Virtual Machine Specification (4.7.3 The Code Attribute):
The value of the code_length item gives the number of bytes in the code array
for this method.
The value of code_length must be greater than zero (as the code array must
not be empty) and less than 65536.
Andremoniy has answered the java 7 part of this question already, but seems at that time it was soon to decide about java 8 so I complete the answer to cover that part:
Quoting from jvms:
The fact that end_pc is exclusive is a historical mistake in the design of the Java Virtual Machine: if the Java Virtual Machine code for a method is exactly 65535 bytes long and ends with an instruction that is 1 byte long, then that instruction cannot be protected by an exception handler. A compiler writer can work around this bug by limiting the maximum size of the generated Java Virtual Machine code for any method, instance initialization method, or static initializer (the size of any code array) to 65534 bytes.
As you see seems this historical problem doesn't seem to remedy at least in this version (java 8).
As a workaround, and if you have access to the parser's code, you could modify it to work within whatever 'limits are imposed by the JVM compiler ...
(Assuming it den't take forever to find the portions in the parser code to modify)
I am using a Motorola FX9500 RFID reader, which runs Linux with the jamvm version 1.5.0 on it (I can only deploy applications to it - I cannot change the Java VM or anything so my options are limited) - here's what I see when I check the version:
[cliuser#FX9500D96335 ~]$ /usr/bin/jamvm -version
java version "1.5.0"
JamVM version 1.5.4
Copyright (C) 2003-2010 Robert Lougher <rob#jamvm.org.uk>
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2,
or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
Build information:
Execution Engine: inline-threaded interpreter with stack-caching
Compiled with: gcc 4.2.2
Boot Library Path: /usr/lib/classpath
Boot Class Path: /usr/local/jamvm/share/jamvm/classes.zip:/usr/share/classpath/glibj.zip
I need to write an application so I grabbed the Oracle Java SDK 1.5.0 and installed it onto my Windows 7 PC, so it has this version:
C:\>javac -version
javac 1.5.0
Am I being too idealistic in considering that an application I compile with that compiler would work correctly on the aforementioned JamVM? Anyway, pressing on in ignorance I write this little application:
public final class TestApp {
public static void main(final String[] args) {
long p = Long.MIN_VALUE;
int o = (int)(-(p + 10) % 10);
System.out.println(o);
}
}
Compile it with the aforementioned javac compiler and run it on the PC like so:
C:\>javac TestApp.java
C:\>java TestApp
8
All fine there. Life is good, so I take that .class file and place it on the FX9500 and run it like so:
[cliuser#FX9500D96335 ~]$ /usr/bin/jamvm TestApp
-2
Eek, what the...as you can see - it returns a different result.
So, why and who's wrong or is this something like the specification is not clear about how to deal with this calculation (surely not)? Could it be that I need to compile it with a different compiler?
Why Do I Care About This?
The reason I came to this situation is that a calculation exactly like that happens inside java.lang.Long.toString and I have a bug in my real application where I am logging out a long and getting a java.lang.ArrayIndexOutOfBoundsException. Because the value I am wanting to log may very well be at the ends of a Long.
I think I can work around it by checking for Long.MIN_VALUE and Long.MAX_VALUE and logging "Err, I can't tell you the number but it is really Long.XXX, believe me, would I lie to you?". But when I find this, I feel like my application is built on a sandy foundation now and it needs to be really robust. I am seriously considering just saying that JamVM is not up to the job and writing the application in Python (since the reader also has a Python runtime).
I'm kind of hoping that someone tells me I'm a dullard and I should have compiled it on my Windows PC like .... and then it would work, so please tell me that (if it is true, of course)!
Update
Noofiz got me thinking (thanks) and I knocked up this additional test application:
public final class TestApp2 {
public static void main(final String[] args) {
long p = Long.MIN_VALUE + 10;
if (p != -9223372036854775798L) {
System.out.println("O....M.....G");
return;
}
p = -p;
if (p != 9223372036854775798L) {
System.out.println("W....T.....F");
return;
}
int o = (int)(p % 10);
if (o != 8) {
System.out.println("EEEEEK");
return;
}
System.out.println("Phew, that was a close one");
}
}
I, again, compile on the Windows machine and run it.
It prints Phew, that was a close one
I copy the .class file to the contraption in question and run it.
It prints...
...wait for it...
W....T.....F
Oh dear. I feel a bit woozy, I think I need a cup of tea...
Update 2
One other thing I tried, that did not make any difference, was to copy the classes.zip and glibj.zip files off of the FX9500 to the PC and then do a cross compile like so (that must mean the compiled file should be fine right?):
javac -source 1.4 -target 1.4 -bootclasspath classes.zip;glibj.zip -extdirs "" TestApp2.java
But the resulting .class file, when run on the reader prints the same message.
I wrote JamVM. As you would probably guess, such errors would have been noticed by now, and JamVM wouldn't pass even the simplest of test suites with them (GNU Classpath has its own called Mauve, and OpenJDK has jtreg). I regularly run on ARM (the FX9500 uses a PXA270 ARM) and x86-64, but various platforms get tested as part of IcedTea.
So I haven't much of a clue as to what's happened here. I would guess it only affects Java longs as these are used infrequently and so most programs work. JamVM maps Java longs to C long longs, so my guess would be that the compiler used to build JamVM is producing incorrect code for long long handling on the 32-bit ARM.
Unfortunately there's not much you can do (apart from avoid longs) if you can't replace the JVM. The only thing you can do is try and turn the JIT off (a simple code-copying JIT, aka inline-threading). To do this use -Xnoinlining on the command line, e.g.:
jamvm -Xnoinlining ...
The problem is in different modulus implementations:
public static long mod(long a, long b){
long result = a % b;
if (result < 0)
{
result += b;
}
return result;
}
this code returns -2, while this:
public static long mod2(long a, long b){
long result = a % b;
if (result > 0 && a < 0)
{
result -= b;
}
return result;
}
returns 8. Reasons why JamVM is doing this way are behind my understanding.
From JLS:
15.17.3. Remainder Operator %
The remainder operation for operands that are integers after binary
numeric promotion (§5.6.2) produces a result value such that
(a/b)*b+(a%b) is equal to a.
According to this JamVM breaks language specification. Very bad.
I would have commented, but for some reason, that requires reputation.
Long negation doesn't work on this device. I don't understand its exact nature, but if you do two unary minuses you do get back to where you started, e.g. x=10; -x==4294967286; -x==10. 4294967286 is very close to Integer.MAX_VALUE*2 (2147483647*2 = 4294967294). It's even closer to Integer.MAX_VALUE*2-10!
It seems to be isolated to this one operation, and doesn't affect longs in a further fundamental way. It's simple to avoid the operation in your own code, and with some dextrous abuse of the bootclasspath can avoid the calls in GNU Classpath code, replacing them with *-1s. If you need to start your application from the device GUI, you can include the -Xbootclasspath=... switch into the args parameter for it to be passed to JamVM).
The bug is actually already fixed in latter (than the latest release) JamVM code:
* https://github.com/ansoncat/jamvm/commit/736c2cb76baf1fedddc1eda5825908f5a0511373
* https://github.com/ansoncat/jamvm/commit/ac83bdc886ac4f6e60d684de1b4d0a5e90f1c489
though doesn't help us with the fixed version on the device. Rob Lougher has mentioned this issue as a reason for releasing a new version of JamVM, though I don't know when this would be, or whether Motorola would be enough convinced to update their firmware.
The FX9500 is actually a repackaged Sirit IN610, meaning that both devices share this bug. Sirit are way friendlier that Motorola and are providing a firmware upgrade, to be available in the near future. I hope that Motorola will also include the fix, though I don't know the details of the arrangement between the two parties.
Either way, we have a very big application running on the FX9500, and the long negation operation hasn't proved to be an impassable barrier.
Good luck, Dan.
The piece of function is like this:
JNIEXPORT jint JNICALL functionCall() {
// Entrance
printf("Time: %d\tFile: %s\tFunc: %s\tLine: %d\n", clock(), __FILE__, __FUNCTION__, __LINE);
// other codes
...
// Exit
printf("Time: %d\tFile: %s\tFunc: %s\tLine: %d\n", clock(), __FILE__, __FUNCTION__, __LINE);
}
The total project is compiled to xxx.so file, called by java code at some Android app.
Now I am debugging the app, it will crash in the end. According to the logs, I find that the log number print at the entrance is only 14, but the log number print at the exit is more than 200.
How can this come out?
When printf format arguments don't line up correctly with the corresponding arguments, bad things happen in almost every implementation. For example, if given a 64-bit value where a 32-bit value is expected, Most implementations print the upper 32 bits from the 64-bit value. The next argument to print (in your case %s/FILE) will start out with the next 32-bits of the 64-bit value. In your example though, this data will get treated like a pointer to a string (aka a c string). Since the lower 32-bits of the value points at nothing, bad things can happen.
Here, printf might assume %d means 32-bit integer (its platform dependent). clock() returns "clock_t". It could be that clock_t is larger than 32 bits since it returns clock ticks. This would trigger the condition above.
Best to avoid all this nonsense and use std::cout.
Also that method is a function that returns a value. If you have any return statements above the final logging line, it won't be executed of course. In C it can be quite difficult to arrange things so that final logging always happens. In C++ you just write a local class with a destructor.
Assuming you have caught all the normal return points, is there a possible exception condition?
Again, using a class with a destructor would trap help this.