For simplicity imagine this scenario, we have a 2-bit computer, which has a pair of 2 bit registers called r1 and r2 and only works with immediate addressing.
Lets say the bit sequence 00 means add to our cpu. Also 01 means move data to r1 and 10 means move data to r2.
So there is an Assembly Language for this computer and a Assembler, where a sample code would be written like
mov r1,1
mov r2,2
add r1,r2
Simply, when I assemble this code to native language and the file will be something like:
0101 1010 0001
the 12 bits above is the native code for:
Put decimal 1 to R1, Put decimal 2 to R2, Add the data and store in R1.
So this is basically how a compiled code works, right?
Lets say someone implements a JVM for this architecture. In Java I will be writing code like:
int x = 1 + 2;
How exactly will JVM interpret this code? I mean eventually the same bit pattern must be passed to the cpu, isn't it? All cpu's have a number of instructions that it can understand and execute, and they are after all just some bits. Lets say the compiled Java byte-code looks something like this:
1111 1100 1001
or whatever.. Does it mean that the interpreting changes this code to 0101 1010 0001 when executing? If it is, it is already in the Native Code, so why is it said that JIT only kicks in after a number of times? If it does not convert it exactly to 0101 1010 0001, then what does it do? How does it make the cpu do the addition?
Maybe there are some mistakes in my assumptions.
I know interpreting is slow, compiled code is faster but not portable, and a virtual machine "interprets" a code, but how? I am looking for "how exactly/technically interpreting" is done. Any pointers (such as books or web pages) are welcome instead of answers as well.
The CPU architecture you describe is unfortunately too restricted to make this really clear with all the intermediate steps. Instead, I will write pseudo-C and pseudo-x86-assembler, hopefully in a way that is clear without being terribly familiar with C or x86.
The compiled JVM bytecode might look something like this:
ldc 0 # push first first constant (== 1)
ldc 1 # push the second constant (== 2)
iadd # pop two integers and push their sum
istore_0 # pop result and store in local variable
The interpreter has (a binary encoding of) these instructions in an array, and an index referring to the current instruction. It also has an array of constants, and a memory region used as stack and one for local variables. Then the interpreter loop looks like this:
while (true) {
switch(instructions[pc]) {
case LDC:
sp += 1; // make space for constant
stack[sp] = constants[instructions[pc+1]];
pc += 2; // two-byte instruction
case IADD:
stack[sp-1] += stack[sp]; // add to first operand
sp -= 1; // pop other operand
pc += 1; // one-byte instruction
case ISTORE_0:
locals[0] = stack[sp];
sp -= 1; // pop
pc += 1; // one-byte instruction
// ... other cases ...
}
}
This C code is compiled into machine code and run. As you can see, it's highly dynamic: It inspects each bytecode instruction each time that instruction is executed, and all values goes through the stack (i.e. RAM).
While the actual addition itself probably happens in a register, the code surrounding the addition is rather different from what a Java-to-machine code compiler would emit. Here's an excerpt from what a C compiler might turn the above into (pseudo-x86):
.ldc:
incl %esi # increment the variable pc, first half of pc += 2;
movb %ecx, program(%esi) # load byte after instruction
movl %eax, constants(,%ebx,4) # load constant from pool
incl %edi # increment sp
movl %eax, stack(,%edi,4) # write constant onto stack
incl %esi # other half of pc += 2
jmp .EndOfSwitch
.addi
movl %eax, stack(,%edi,4) # load first operand
decl %edi # sp -= 1;
addl stack(,%edi,4), %eax # add
incl %esi # pc += 1;
jmp .EndOfSwitch
You can see that the operands for the addition come from memory instead of being hardcoded, even though for the purposes of the Java program they are constant. That's because for the interpreter, they are not constant. The interpreter is compiled once and then must be able to execute all sorts of programs, without generating specialized code.
The purpose of the JIT compiler is to do just that: Generate specialized code. A JIT can analyze the ways the stack is used to transfer data, the actual values of various constants in the program, and the sequence of calculations performed, to generate code that more efficiently does the same thing. In our example program, it would allocate the local variable 0 to a register, replace the access to the constant table with moving constants into registers (movl %eax, $1), and redirect the stack accesses to the right machine registers. Ignoring a few more optimizations (copy propagation, constant folding and dead code elimination) that would normally be done, it might end up with code like this:
movl %ebx, $1 # ldc 0
movl %ecx, $2 # ldc 1
movl %eax, %ebx # (1/2) addi
addl %eax, %ecx # (2/2) addi
# no istore_0, local variable 0 == %eax, so we're done
Not all computers have the same instruction set. Java bytecode is a kind of Esperanto - an artificial language to improve communication. The Java VM translates the universal Java bytecode to the instruction set of the computer it runs on.
So how does JIT figure in here? The main purpose of the JIT compiler is optimization. There are often different ways to translate a certain piece of bytecode into the target machine code. The most performance-ideal translation is often non-obvious because it might depend on the data. There are also limits to how far a program can analyze an algorithm without executing it - the halting problem is a well-known such limitation but not the only one. So what the JIT compiler does is try different possible translations and measure how fast they are executed with the real-world data the program processes. So it takes a number of executions until the JIT compiler found the perfect translation.
One of the important steps in Java is that the compiler first translates the .java code into a .class file, which contains the Java bytecode. This is useful, as you can take .class files and run them on any machine that understands this intermediate language, by then translating it on the spot line-by-line, or chunk-by-chunk. This is one of the most important functions of the java compiler + interpreter. You can directly compile Java source code to native binary, but this negates the idea of writing the original code once and being able to run it anywhere. This is because the compiled native binary code will only run on the same hardware/OS architecture that it was compiled for. If you want to run it on another architecture, you'd have to recompile the source on that one. With the compilation to the intermediate-level bytecode, you don't need to drag around the source code, but the bytecode. It's a different issue, as you now need a JVM that can interpret and run the bytecode. As such, compiling to the intermediate-level bytecode, which the interpreter then runs, is an integral part of the process.
As for the actual realtime running of code: yes, the JVM will eventually interpret/run some binary code that may or may not be identical to natively compiled code. And in a one-line example, they may seem superficially the same. But the interpret typically doesn't precompile everything, but goes through the bytecode and translates to binary line-by-line or chunk-by-chunk. There are pros and cons to this (compared to natively compiled code, e.g. C and C compilers) and lots of resources online to read up further on. See my answer here, or this, or this one.
Simplifying, interpreter is a infinite loop with a giant switch inside.
It reads Java byte code (or some internal representation) and emulates a CPU executing it.
This way the real CPU executes the interpreter code, which emulates the virtual CPU.
This is painfully slow. Single virtual instruction adding two numbers requires three function calls and many other operations.
Single virtual instruction takes a couple of real instructions to execute.
This is also less memory efficient as you have both real and emulated stack, registers and instruction pointers.
while(true) {
Operation op = methodByteCode.get(instructionPointer);
switch(op) {
case ADD:
stack.pushInt(stack.popInt() + stack.popInt())
instructionPointer++;
break;
case STORE:
memory.set(stack.popInt(), stack.popInt())
instructionPointer++;
break;
...
}
}
When some method is interpreted multiple times, JIT compiler kicks in.
It will read all virtual instructions and generate one or more native instructions which does the same.
Here I'm generating string with text assembly which would require additional assembly to native binary conversions.
for(Operation op : methodByteCode) {
switch(op) {
case ADD:
compiledCode += "popi r1"
compiledCode += "popi r2"
compiledCode += "addi r1, r2, r3"
compiledCode += "pushi r3"
break;
case STORE:
compiledCode += "popi r1"
compiledCode += "storei r1"
break;
...
}
}
After native code is generated, JVM will copy it somewhere, mark this region as executable and instruct the interpreter to invoke it instead of interpreting byte code next time this method is invoked.
Single virtual instruction might still take more than one native instruction but this will be nearly as fast as ahead of time compilation to native code (like in C or C++).
Compilation is usually much slower than interpreting, but has to be done only once and only for chosen methods.
Related
This question already has answers here:
Why is the max recursion depth I can reach non-deterministic?
(4 answers)
Closed 5 years ago.
A simple class for demonstration purposes:
public class Main {
private static int counter = 0;
public static void main(String[] args) {
try {
f();
} catch (StackOverflowError e) {
System.out.println(counter);
}
}
private static void f() {
counter++;
f();
}
}
I executed the above program 5 times, the results are:
22025
22117
15234
21993
21430
Why are the results different each time?
I tried setting the max stack size (for example -Xss256k). The results were then a bit more consistent but again not equal each time.
Java version:
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
EDIT
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
The observed variance is caused by background JIT compilation.
This is how the process looks like:
Method f() starts execution in interpreter.
After a number of invocations (around 250) the method is scheduled for compilation.
The compiler thread works in parallel to the application thread. Meanwhile the method continues execution in interpreter.
As soon as the compiler thread finishes compilation, the method entry point is replaced, so the next call to f() will invoke the compiled version of the method.
There is basically a race between applcation thread and JIT compiler thread. Interpreter may perform different number of calls before the compiled version of the method is ready. At the end there is a mix of interpreted and compiled frames.
No wonder that compiled frame layout differs from interpreted one. Compiled frames are usually smaller; they don't need to store all the execution context on the stack (method reference, constant pool reference, profiler data, all arguments, expression variables etc.)
Futhermore, there is even more race possibilities with Tiered Compilation (default since JDK 8). There can be a combination of 3 types of frames: interpreter, C1 and C2 (see below).
Let's have some fun experiments to support the theory.
Pure interpreted mode. No JIT compilation.
No races => stable results.
$ java -Xint Main
11895
11895
11895
Disable background compilation. JIT is ON, but is synchronized with the application thread.
No races again, but the number of calls is now higher due to compiled frames.
$ java -XX:-BackgroundCompilation Main
23462
23462
23462
Compile everything with C1 before execution. Unlike previous case there will be no interpreted frames on the stack, so the number will be a bit higher.
$ java -Xcomp -XX:TieredStopAtLevel=1 Main
23720
23720
23720
Now compile everything with C2 before execution. This will produce the most optimized code with the smallest frame. The number of calls will be the highest.
$ java -Xcomp -XX:-TieredCompilation Main
59300
59300
59300
Since the default stack size is 1M, this should mean the frame now is only 16 bytes long. Is it?
$ java -Xcomp -XX:-TieredCompilation -XX:CompileCommand=print,Main.f Main
0x00000000025ab460: mov %eax,-0x6000(%rsp) ; StackOverflow check
0x00000000025ab467: push %rbp ; frame link
0x00000000025ab468: sub $0x10,%rsp
0x00000000025ab46c: movabs $0xd7726ef0,%r10 ; r10 = Main.class
0x00000000025ab476: addl $0x2,0x68(%r10) ; Main.counter += 2
0x00000000025ab47b: callq 0x00000000023c6620 ; invokestatic f()
0x00000000025ab480: add $0x10,%rsp
0x00000000025ab484: pop %rbp ; pop frame
0x00000000025ab485: test %eax,-0x23bb48b(%rip) ; safepoint poll
0x00000000025ab48b: retq
In fact, the frame here is 32 bytes, but JIT has inlined one level of recursion.
Finally, let's look at the mixed stack trace. In order to get it, we'll crash JVM on StackOverflowError (option available in debug builds).
$ java -XX:AbortVMOnException=java.lang.StackOverflowError Main
The crash dump hs_err_pid.log contains the detailed stack trace where we can find interpreted frames at the bottom, C1 frames in the middle and lastly C2 frames on the top.
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5958 [0x00007f21251a5900+0x0000000000000058]
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
// ... repeated 19787 times ...
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
// ... repeated 1866 times ...
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
j Main.f()V+8
j Main.f()V+8
// ... repeated 1839 times ...
j Main.f()V+8
j Main.main([Ljava/lang/String;)V+0
v ~StubRoutines::call_stub
First of all, the following has not been researched. I have not "deep dived" the OpenJDK source code to validate any of the following, and I don't have access to any inside knowledge.
I tried to validate your results by running your test on my machine:
$ java -version
openjdk version "1.8.0_71"
OpenJDK Runtime Environment (build 1.8.0_71-b15)
OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)
I get the "count" varying over a range of ~250. (Not as much as you are seeing)
First some background. A thread stack in a typical Java implementation is a contiguous region of memory that is allocated before the thread is started, and that is never grown or moved. A stack overflow happens when the JVM tries to create a stack frame to make a method call, and the frame goes beyond the limits of the memory region. The test could be done by testing the SP explicitly, but my understanding is that it is normally implemented using a clever trick with the memory page settings.
When a stack region is allocated, the JVM makes a syscall to tell the OS to mark a "red zone" page at the end of the stack region read-only or non-accessible. When a thread makes a call that overflows the stack, it accesses memory in the "red zone" which triggers a memory fault. The OS tells the JVM via a "signal", and the JVM's signal handler maps it to a StackOverflowError that is "thrown" on the thread's stack.
So here are a couple of possible explanations for the variability:
The granularity of hardware-based memory protection is the page boundary. So if the thread stack has been allocated using malloc, the start of the region is not going to be page aligned. Therefore the distance from the start of the stack frame to the first word of the "red zone" (which >is< page aligned) is going to be variable.
The "main" stack is potentially special, because that region may be used while the JVM is bootstrapping. That might lead to some "stuff" being left on the stack from before main was called. (This is not convincing ... and I'm not convinced.)
Having said this, the "large" variability that you are seeing is baffling. Page sizes are too small to explain a difference of ~7000 in the counts.
UPDATE
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
Interesting. Among other things, that could cause stack limit checking to be done differently.
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Plausible. The size of the stackframe could well be different after the f() method has been JIT compiled. Assuming f() was JIT compiled at some point you stack will have a mixture of "old" and "new" frames. If the JIT compilation occurred at different points, then the ratio will be different ... and hence the count will be different when you hit the limit.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
Little chance of that, I'm afraid ... unless you are prepared to PAY someone to do a few days research for you.
1) No such (public) reference documentation exists, AFAIK. At least, I've never been able to find a definitive source for this kind of thing ... apart from deep diving the source code.
2) Looking at the JIT compiled code tells you nothing of how the bytecode interpreter handled things before the code was JIT compiled. So you won't be able to see if the frame size has changed.
The exact functioning of Java stack undocumented, but it totally depends on the memory allocated to that thread.
Just try using the Thread constructor with stacksize and see if it gets constant. I have not tried it it, so please share the results.
I am writing a variety of equivalent programs in Java and C++ to compare the two languages for speed. Those programs employ heavy mathematical computations in a loop.
Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.
Which g++ compiler optimization should I use to reach a conclusion about my comparisons?
I know this is not as simple to conclude as it sounds, but I would like to have some insights about latency/speed comparisons between Java and C++.
Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.
-O3 will certainly beat -O2 in microbenchmarks but when you benchmark a more realistic application (such as a FIX engine) you will see that -O2 beats -O3 in terms of performance.
As far as I know, -O3 does a very good job compiling small and mathematical pieces of code, but for more realistic and larger applications it can actually be slower than -O2. By trying to aggressively optimize everything (i.e. inlining, vectorization, etc.), the compiler will produce huge binaries leading to cpu cache misses (i.e. especially instruction cache misses). That's one of the reasons the Hotspot JIT chooses not to optimize big methods and/or non-hot methods.
One important thing to notice is that JIT uses methods as independent units eligible for optimization. In your previous questions, you have the following code:
int iterations = stoi(argv[1]);
int load = stoi(argv[2]);
long long x = 0;
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(); // START clock
for(int j = 0; j < load; j++) {
if (i % 4 == 0) {
x += (i % 4) * (i % 8);
} else {
x -= (i % 16) * (i % 32);
}
}
long end = get_nano_ts(); // STOP clock
// (omitted for clarity)
}
cout << "My result: " << x << endl;
But this code is JIT-unfriendly because the hot block of code is not in its own method. For major JIT gains, you should have placed the block of code inside the loop on its own method. Your method executes a hot block of code instead of a hot method. The method that contains the for loop is probably called only once so the JIT will not do anything about it.
When comparing Java with C++ for speed should I compile the C++ code with -O3 or -O2?
Well, if you use -O3 for microbenchmarks you will get amazing fast results that will be unrealistic for larger and more complex applications. That's why I think the judges use -O2 instead of -O3. For example, our garbage-free Java FIX engine is faster than C++ FIX engines and I have no idea if they are compiling with -O0, -O1, -O2, -O3 or a mix of them through executable linking.
In theory it is possible for a person to selective compartmentalize an entire C++ application in executable pieces, choose which ones are going to be compiled with -O2 and which ones are going to be compiled with -O3. Then link everything in an ideal binary executable. But in reality, how feasible is that?
The approach the Hotspot chooses is much simpler. It says:
Listen, I am going to consider each method as an independent unit of execution instead of any block of code anywhere. If that method is hot enough (i.e. called often) and small enough I will try to aggressively optimize it.
That of course has the drawback of requiring code warmup but it is much simpler and produces the best results most of the time for realistic/large/complex applications.
And last but not least, you should probably consider this question if you want to compile your entire application with -O3: When can I confidently compile program with -O3?
If possible, compare it against both, since -O2 and -O3 are both options available to the C++ developer. Sometimes -O2 will win. Sometimes -O3 will win. If you have both available, that's just more information which can be used to support whatever you're trying to accomplish by doing these speed comparisons.
How are keywords represented in binary form?
For ex:: In java, how is the sin() represented in binary? How is sqrt() and other functions represented.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
Thanks in advance.
Firstly, sin is not a keyword in Java. It is an identifier. Keywords are things like if, class, and so on.
It depends on when you are asking about.
In the source code, the sin identifier is represented as characters, and those characters are represented as bits (i.e. binary) .... if you want to look at it that way.
In the classfile that is output by the javac compiler, the word sin is represented as string in the Constant Pool. (The JVM spec specifies the format of classfiles in great detail.)
When the classfile is first loaded by a JVM, the word sin becomes a Java String object.
When the code is linked by the JVM, the reference to the String is resolved to some kind of reference to a method. (The details are implementation specific. You'd need to read the JVM source code to find out more.)
When the code is JIT compiler, the reference to the method (typically) turns into the address in memory of the first native instruction of the JIT compiled method. (Strictly speaking, this is not "assembly language". But the native instructions could be represented as assembly language. Assembly language is really just a "human friendly" textual representation of the instructions.)
so how does the computer know that when sin is written it has to do the sine of a number.
What happens is that the Java runtime loads that class containing the method. Then it looks for the sin(double) method in the class that it loaded. What typically happens is that the named method resolves to some bytecodes that are the instructions that tell the runtime what the method should do. But in the case of sin, the method is a native method, and the instructions are actually native instructions that are part of one of the JVM's native libraries.
If not of methods, Can we have binary representation of Keywords?? Like int, and float etc??
It depends on the actual keywords. But generally speaking, genuine Java keywords are transformed by the compiler into a form that doesn't have a distinct / discrete representation for the individual keywords.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
This tells me that you probably have a fundamental misunderstanding of how programming languages are implemented. So instead of answering this question (it doesn't really have a proper answer other than "well they're not represented at all"), I will try to help you understand why this question is the wrong one to ask.
Your computer runs machine code, and only machine code. You can feed it any random sequence of bytes, it doesn't matter what they were intended to be, as soon as you point the program counter to it it will be interpreted as if it is machine code (of course giving it bytes that were not intended to be machine code is probably a bad idea). As a running example, I'll use this x64 code:
48 01 F7 48 89 F8 C3
If you have no idea what's going on, that's normal at this level. Most people don't read machine code (but they could if they learned it, it's not magic). This is where the zeroes and ones are, to the processor it's not even in hexadecimal, that's just what humans like to read.
At a level above that there is assembly, which is in most cases really just a different way of looking at machine code, in such a way that humans find it easier to read. The example from earlier looks more sensible in assembly:
add rdi, rsi
mov rax, rdi
ret
Still not very clear what's going on to someone who doesn't know x64 assembly, but at least it gives some sort of clue: there's an add in it. It probably adds things.
At a yet higher level, you could have java bytecode or java, but I think the java aspect of this question misses the point, it's probably there because OP doesn't realize that java is different from "the classic picture". Java just complicates matters without explaining the big picture. Let's use C instead. The example in C could look like:
int64_t foo_or_whatever(int64_t x, int64_t y)
{
return x + y;
}
If you don't know C but you do know Java, the only strange thing here is int64_t, which is roughly the equivalent of a long in Java.
So yes, things were added, as the assembly code suggested. Now where did the keywords go?
That question doesn't make as much sense as you thought it did. The compiler understands keywords, and uses them to create machine code that implements your program. After that point they stop being relevant. They only mean something in the context of the high level language that you wrote the code in, you could say that at that level, they are stored as ASCII or UTF8 string in a file. They have nothing to do with machine code, they do not appear in any form there, and you can write machine code without having translated it from a high level language that has keywords. That return and ret looks vaguely similar is a bit of a red herring, they have something to do with each other but the relation is far from simple (that it worked out simply in the example I'm using is of course no accident).
The int64_t has perhaps not entirely disappeared (mostly it has, though). The fact that the addition operates on 64bit integers is encoded in the instruction 48 01 F7. Not the keyword int64_t (which isn't even a keyword, but let's not get into that), "the fact that what you have there is an addition between 64bit integers", which is an conceptually different thing though caused here by the use of int64_t. To split that instruction out while skipping some of the detail (because this is a beginner question), there's
48 = 01001000 encoding REX.W, meaning this instruction is 64bit
01 = 00000001 encoding add rm64, r64 in this case
D1 = 11010001 encoding the operands rdi and rsi
To learn more about what the processor does with machine code (in case your follow-up question is "but how does it know what to do with something like 48 01 F7"), study computer architecture. If you want a book, I recommend Computer Architecture, Fifth Edition: A Quantitative Approach, which is quite accessible to beginners and commonly used in first-year courses about computer architectures.
To learn more about the journey from high level language to machine code, study compiler construction. If you want a book, I recommend Compilers: Principles, Techniques, and Tools, but it may be hard to get through it as a beginner. If you want a free course, you could follow Compilers on Coursera (the first few lectures especially will give you an overview of what compilers do without getting too technical yet).
Incidentally, if you give the example C code to GCC, it makes
lea rax, [rdi + rsi]
ret
It's still doing the same thing, but in a way that didn't fit my story, so I took the liberty of doing it in a slightly different way.
sin() is a function so it's represented as a memory address where its code block is.
Keywords (like for) aren't represented as binary, for for example is converted to a list of byte code jump instructions which are compiled into assembly instructions which are represented as binary.
My point is that you cannot convert most keywords directly into binary. You can unroll them into bytecode which you could then convert to native machine code and binary but not directly to binary.
Here, read this then after you understand it move onto how bytecode is converted to native code.
Keywords and Functions
That said, a keyword in Java (and most languages) is a reserved word like for, while or return but your examples are not keywords, they are function names like sin() and sqrt()
Not really sure what you want to know here; so let's go "bytecode"...
Both the .sin() and .sqrt() methods are static methods from the Math class; therefore, the compiler will generate a call site with both arguments, a reference to the method and then call invokestatic.
Other than invokestatic, you have invokevirtual, invokespecial, invokeinterface and (since Java 7) invokedynamic.
Now, at runtime, the JIT will kick in; and the JIT may end up producing pure native code, but this is not a guarantee. In any event, the code will be fast enough.
And the same goes for the JDK libraries themselves; the JIT will kick in and maybe turn the byte code into native code given a sufficient time to analyze it (escape analysis, inlining etc).
And since the JIT does "whatever it wants", you reliably cannot have a "binary" representation of any method from any class.
We're developing some complexed application which consists of linux binary integrated with java jni calls (from JVM created in linux binary) from our custom made .jar file. All gui work is implemented and done by java part. Each time some gui property has to be changed or gui has to be repainted, it is done by jni call to JVM.
Complete display/gui is repainted (or refreshed) as fast as JVM/java can handle it. It is done iteratively and frequently, few hunderds or thousands iterations per second.
After some exact time, application is terminated with exit(1) which I caught with gdb to be called from _XIOError(). This termination can be repeated after more or less exact time period, e.g. after some 15h on x86 dual core 2.5GHz. If I use some slower computer, it lasts longer, like it is proportional to cpu/gpu speed. Some conclusion would be that some part of xorg ran out of some resource or something like that.
Here is my backtrace:
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2 0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3 0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4 0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5 0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6 0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7 0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8 0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9 0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so
I made my own exit() call in liboverrides.so and used it with LD_PRELOAD to capture exit() call in gdb with help of abort()/SIGABRT.
After some debugging of libX11 and libxcb, I noticed that _XReply() got NULL reply (response from xcb_wait_for_reply()) that causes call to _XIOError() and exit(1). Going more deeply in libxcb in xcb_wait_for_reply() function, I noticed that one of the reasons it can return NULL reply is when it detects broken or closed socket connection, which could be my situation.
For test purposes, if I change xcb_io.c and ignore _XIOError(), application doesn't work any more. And if I repeat request inside _XReply(), it fails each time, i.e. gets NULL response on each xcb_wait_for_reply().
So, my questions would be why such uncontrolled app termination with exit(1) from _XReply() -> XIOError() -> exit(1) happened or how can I find out reason why and what happened, so I can fix it or do some workaround.
For this problem to repeat, as I wrote above, I have to wait for some 15h, but currently I'm very short on time for debuging and can't find the cause of problem/termination.
We also tried to reorganise java part which handles gui/display refresh, but the problem wasn't solved.
Some SW facts:
- java jre 1.8.0_20, even with java 7 can repeat the problem
- libX11.so 1.5.0
- libxcb.so 1.8.1
- debian wheezy
- kernel 3.2.0
This is likely a known issue in libX11 regarding the handling of request numbers used for xcb_wait_for_reply.
At some point after libxcb v1.5 code to use 64-bit sequence numbers internally everywhere was introduced and logic was added to widen sequence numbers on entry to those public APIs that still take 32-bit sequence numbers.
Here is a quote from submitted libxcb bug report (actual emails removed):
We have an application that does a lot of XDrawString and XDrawLine.
After several hours the application is exited by an XIOError.
The XIOError is called in libX11 in the file xcb_io.c, function
_XReply. It didn't get a response from xcb_wait_for_reply.
libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to
this commit:
commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey
Sharp Date: Sat Oct 9 17:13:45 2010 -0700
xcb_in: Use 64-bit sequence numbers internally everywhere.
Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.
Signed-off-by: Jamey Sharp <jamey#xxxxxx.xxx>
Reverting it on top of 1.8.1 helps.
Adding traces to libxcb I found that the last request numbers used for
xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in
the while loop of the _XReply function), half a second later: 63215
(then XIOError is called). The widen_request is also 63215, I would
have expected 63215+2^32. Therefore it seems that the request is not
correctly widened.
The commit above also changed the compares in poll_for_reply from
XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening
never worked correctly, but it was never observed, because only the
lower 32bits were compared.
Reproducing the issue
Here's the original code snippet from the submitted bug report which was used to reproduce the issue:
for(;;) {
XDrawLine(dpy, w, gc, 10, 60, 180, 20);
XFlush(dpy);
}
and apparently the issue can be reproduced with even simpler code:
for(;;) {
XNoOp(dpy);
}
According to submitted libxcb bug report these conditions are needed to reproduce (assuming the reproduce code is in xdraw.c):
libxcb >= 1.8 (i.e. includes the commit ed37b08)
compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
the sequence counter wraps.
Proposed patch
The proposed patch which can be applied on top of libxcb 1.8.1 is this:
diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
## -454,7 +454,7 ## void _XSend(Display *dpy, const char *data, long size)
static const xReq dummy_request;
static char const pad[3];
struct iovec vec[3];
- uint64_t requests;
+ unsigned long requests;
_XExtension *ext;
xcb_connection_t *c = dpy->xcb->connection;
if(dpy->flags & XlibDisplayIOError)
## -470,7 +470,7 ## void _XSend(Display *dpy, const char *data, long size)
if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
{
uint64_t sequence;
- for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+ for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
append_pending_request(dpy, sequence);
}
requests = dpy->request - dpy->xcb->last_flushed;
Detailed technical explanation
Plase find bellow included detailed technical explanation by Jonas Petersen (also included in the aforementioned bug report):
Hi,
Here's two patches. The first one fixes a 32-bit sequence wrap bug.
The second patch only adds a comment to another relevant statement.
The patches contain some details. Here is the whole story for who
might be interested:
Xlib (libx11) will crash an application with a "Fatal IO error 11
(Resource temporarily unavailable)" after 4 294 967 296 requests to
the server. That is when the Xlib internal 32-bit sequence wraps.
Most applications probably will hardly reach this number, but if they
do, they have a chance to die a mysterious death. For example the
application I'm working on did always crash after about 20 hours when
I started to do some stress testing. It does some intensive drawing
through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per
second in full hd resolution (on Ubuntu). Some optimizations did
extend the grace to about 35 hours but it would still crash.
What then followed was some frustrating weeks of digging and debugging
to realize that it's not in my application, nor in gtkmm, gtk or glib
but that it's this little bug in Xlib which exists since 2006-10-06
apparently.
It took a while to turn out that the number 0x100000000 (2^32) has
some relevance. (Much) later it turned out it can be reproduced with
Xlib only, using this code for example:
while(1) {
XDrawPoint(display, drawable, gc, x, y);
XFlush(display); }
It might take one or two hours, but when it reaches the 4294 million
it will explode into a "Fatal IO error 11".
What I then learned is that even though Xlib uses internal 32bit
sequence numbers they get (smartly) widened to 64bit in the process
so that the 32bit sequence may wrap without any disruption in the
widened 64bit sequence. Obviously there must be something wrong with
that.
The Fatal IO error is issued in _XReply() when it's not getting a
reply where there should be one, but the cause is earlier in _XSend()
in the moment when the Xlib 32-bit sequence number wraps.
The problem is that when it wraps to 0, the value of 'last_flushed'
will still be at the upper boundary (e.g. 0xffffffff). There is two
locations in
_XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:
requests = dpy->request - dpy->xcb->last_flushed;
I case of request = 0x0 and last_flushed = 0xffffffff it will assign
0xffffffff00000001 to 'requests' and then to XCB as a number (amount)
of requests. This is the main killer.
The second location is this:
for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request;
\
++sequence)
I case of request = 0x0 (less than last_flushed) there is no chance to
enter the loop ever and as a result some requests are ignored.
The solution is to "unwrap" dpy->request at these two locations and
thus retain the sequence related to last_flushed.
uint64_t unwrapped_request = ((uint64_t)(dpy->request < \
dpy->xcb->last_flushed) << 32) + dpy->request;
It creates a temporary 64-bit request number which has bit 8 set if
'request' is less than 'last_flushed'. It is then used in the two
locations instead of dpy->request.
I'm not sure if it might be more efficient to use that statement
inplace, instead of using a variable.
There is another line in require_socket() that worried me at first:
dpy->xcb->last_flushed = dpy->request = sent;
That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to
32-bit when assinging it to 'request' and then also assign the
truncated value to the (64-bit) 'last_flushed'. But it seems inteded.
I have added a note explaining that for the next poor soul debugging
sequence issues... :-)
Jonas
Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping
xcb_io: Add comment explaining a mixed type double assignment
src/xcb_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+),
3 deletions(-)
--
1.7.10.4
Good luck!
The Situation:
I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves a lot of byte[] access and basic int mathematics for hashing and comparison. Performance really matters, because the goal of the compression is to reduce I/O requirements. I am not posting code because it isn't cleaned up yet, and may be restructured heavily.
The Questions:
How can I write my code to allow it to JIT-compile to a form using faster SSE operations?
How can I structure it so the compiler can easily eliminate array bounds checks?
Are there any broad references about the relative speed of specific math operations (how many increments/decrements does it take to equal a normal add/subtract, how fast is shift-or vs. an array access)?
How can I work on optimizing branching -- is it better to have numerous conditional statements with short bodies, or a few long ones, or short ones with nested conditions?
With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?
What I've already done:
Before I get attacked for premature optimization: the basic algorithm is already excellent, but the Java implementation is less than 2/3 the speed of equivalent C. I've already replaced copying loops with System.arraycopy, worked on optimizing loops and eliminated un-needed operations.
I make heavy use of bit twiddling and packing bytes into ints for performance, as well as shifting & masking.
For legal reasons, I can't look at implementations in similar libraries, and existing libraries have too restrictive license terms to use.
Requirements for a GOOD (accepted) answer:
Unacceptable answers: "this is faster" without an explanation of how much AND why, OR hasn't been tested with a JIT compiler.
Borderline answers: have not been tested with anything before Hotspot 1.4
Basic answers: will provide a general rule and explanation of why it is faster at the compiler level, and roughly how much faster
Good answers: include a couple of samples of code to demonstrate
Excellent answers: have benchmarks with both JRE 1.5 and 1.6
PERFECT answer: Is by someone who worked on the HotSpot compiler, and can fully explain or reference the conditions for an optimization to be used, and how much faster it typically is. Might include java code and sample assembly code generated by HotSpot.
Also: if anyone has links detailing the guts of Hotspot optimization and branching performance, those are welcome. I know enough about bytecode that a site analyzing performance at a bytecode rather than sourcecode level would be helpful.
(Edit) Partial Answer: Bounds-Check Ellimination:
This is taken from supplied link to the HotSpot internals wiki at: https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
HotSpot will eliminate bounds checks in all for loops with the following conditions:
Array is loop invariant (not reallocated within the loop)
Index variable has a constant stride (increases/decreases by constant amount, in only one spot if possible)
Array is indexed by a linear function of the variable.
Example: int val = array[index*2 + 5]
OR: int val = array[index+9]
NOT: int val = array[Math.min(var,index)+7]
Early version of code:
This is a sample version. Do not steal it, because it is an unreleased version of code for the H2 database project. The final version will be open source. This is an optimization upon the code here: H2 CompressLZF code
Logically, this is identical to the development version, but that one uses a for(...) loop to step through input, and an if/else loop for different logic between literal and backreference modes. It reduces array access and checks between modes.
public int compressNewer(final byte[] in, final int inLen, final byte[] out, int outPos){
int inPos = 0;
// initialize the hash table
if (cachedHashTable == null) {
cachedHashTable = new int[HASH_SIZE];
} else {
System.arraycopy(EMPTY, 0, cachedHashTable, 0, HASH_SIZE);
}
int[] hashTab = cachedHashTable;
// number of literals in current run
int literals = 0;
int future = first(in, inPos);
final int endPos = inLen-4;
// Loop through data until all of it has been compressed
while (inPos < endPos) {
future = (future << 8) | in[inPos+2] & 255;
// hash = next(hash,in,inPos);
int off = hash(future);
// ref = possible index of matching group in data
int ref = hashTab[off];
hashTab[off] = inPos;
off = inPos - ref - 1; //dropped for speed
// has match if bytes at ref match bytes in future, etc
// note: using ref++ rather than ref+1, ref+2, etc is about 15% faster
boolean hasMatch = (ref > 0 && off <= MAX_OFF && (in[ref++] == (byte) (future >> 16) && in[ref++] == (byte)(future >> 8) && in[ref] == (byte)future));
ref -=2; // ...EVEN when I have to recover it
// write out literals, if max literals reached, OR has a match
if ((hasMatch && literals != 0) || (literals == MAX_LITERAL)) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos - literals, out, outPos, literals);
outPos += literals;
literals = 0;
}
//literal copying split because this improved performance by 5%
if (hasMatch) { // grow match as much as possible
int maxLen = inLen - inPos - 2;
maxLen = maxLen > MAX_REF ? MAX_REF : maxLen;
int len = 3;
// grow match length as possible...
while (len < maxLen && in[ref + len] == in[inPos + len]) {
len++;
}
len -= 2;
// short matches write length to first byte, longer write to 2nd too
if (len < 7) {
out[outPos++] = (byte) ((off >> 8) + (len << 5));
} else {
out[outPos++] = (byte) ((off >> 8) + (7 << 5));
out[outPos++] = (byte) (len - 7);
}
out[outPos++] = (byte) off;
inPos += len;
//OPTIMIZATION: don't store hashtable entry for last byte of match and next byte
// rebuild neighborhood for hashing, but don't store location for this 3-byte group
// improves compress performance by ~10% or more, sacrificing ~2% compression...
future = ((in[inPos+1] & 255) << 16) | ((in[inPos + 2] & 255) << 8) | (in[inPos + 3] & 255);
inPos += 2;
} else { //grow literals
literals++;
inPos++;
}
}
// write out remaining literals
literals += inLen-inPos;
inPos = inLen-literals;
if(literals >= MAX_LITERAL){
out[outPos++] = (byte)(MAX_LITERAL-1);
System.arraycopy(in, inPos, out, outPos, MAX_LITERAL);
outPos += MAX_LITERAL;
inPos += MAX_LITERAL;
literals -= MAX_LITERAL;
}
if (literals != 0) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos, out, outPos, literals);
outPos += literals;
}
return outPos;
}
Final edit:
I've marked the best answer so far as accepted, since the deadline is nearly up. Since I took so long before deciding to post code, I will continue to upvote and respond to comments where possible. Apologies if the code is messy: this represented code in development, not polished up for committing.
Not a full answer, I simply don't have time to do the detailed benchmarks your question needs but hopefully useful.
Know your enemy
You are targeting a combination of the JVM (in essence the JIT) and the underlying CPU/Memory subsystem. Thus "This is faster on JVM X" is not likely to be valid in all cases as you move into more aggressive optimisations.
If your target market/application will largely run on a particular architecture you should consider investing in tools specific to it.
* If your performance on x86 is the critical factor then intel's VTune is excellent for drilling down into the sort of jit output analysis you describe.
* The differences between 64 bit and 32 bit JITs can be considerable, especially on x86 platforms where calling conventions can change and enregistering opportunities are very different.
Get the right tools
You would likely want to get a sampling profiler. The overhead of instrumentation (and the associated knock on on things like inlining, cache pollution and code size inflation) for your specific needs would be far too great. The intel VTune analyser can actually be used for Java though the integration is not so tight as others.
If you are using the sun JVM and are happy only knowing what the latest/greatest version is doing then the options available to investigate the output of the JIT are considerable if you know a bit of assembly.
This article details some interesting analysis using this functionality
Learn from other implementations
The Change history change history indicates that previous inline assembly was in fact counter productive and that allowing the compiler to take total control of the output (with tweaks in code rather than directives in assembly) yielded better results.
Some specifics
Since LZF is, in an efficient unmanaged implementation on modern desktop CPUS, largely memory bandwidth limited (hence it being compered to the speed of an unoptimised memcpy) you will need you code to remain entirely within level 1 cache.
As such any static fields you cannot make into constants should be placed within the same class as these values will often be placed within the same area of memory devoted to the vtables and meta data associated with classes.
Object allocations which cannot be trapped by Escape Analysis (only in 1.6 onwards) will need to be avoided.
The c code makes aggressive use of loop unrolling. However the performance of this on older (1.4 era) VM's is heavily dependant on the mode the JVM is in. Apparently latter sun jvm versions are more aggressive at inlining and unrolling, especially in server mode.
The prefetch instrctions generated by the JIT can make all the difference on code like this which is near memory bound.
"It's coming straight for us"
Your target is moving, and will continue to. Again Marc Lehmann's previous experience:
default HLOG size is now 15 (cpu caches have increased)
Even minor updates to the jvm can involve significant compiler changes
6544668 Don't vecorized array operations that can't be aligned at runtime.
6536652 Implement some superword (SIMD) optimizations
6531696 don't use immediate 16-bits value store to memory on Intel cpus
6468290 Divide and allocate out of eden on a per cpu basis
Captain Obvious
Measure, Measure, Measure. IF you can get your library to include (in a separate dll) a simple and easy to execute benchmark that logs the relevant information (vm version, cpu, OS, command line switches etc) and makes this simple to send back to you you will increase your coverage, best of all you'll cover those people using it that care.
As far as bounds check elimination is concerned, I believe the new JDK will already include an improved algorithm that eliminates it, whenever it's possible. These are the two main papers on this subject:
V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002
Effective Enhancement of Loop Versioning in Java. Link. This paper is from the guys at Excelsior, who implemented the technique in their Jet JVM.
Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ. Link. Slightly based on the above paper, this is the implementation that I believe will be included in the next JDK. The achieved speedups are also presented.
There is also this blog entry, which discusses one of the papers superficially, and also presents some benchmarking results, not only for arrays but also for arithmetic in the new JDK. The comments of the blog entry are also very interesting, since the authors of the above papers present some very interesting comments and discuss arguments. Also, there are some pointers to other similar blog posts on this subject.
Hope it helps.
It's rather unlikely that you need to help the JIT compiler much with optimizing a straightforward number crunching algorithm like LZW. ShuggyCoUk mentioned this, but I think it deserves extra attention:
The cache-friendliness of your code will be a big factor.
You have to reduce the size of your woking set and improve data access locality as much as possible. You mention "packing bytes into ints for performance". This sounds like using ints to hold byte values in order to have them word-aligned. Don't do that! The increased data set size will outweigh any gains (I once converted some ECC number crunching code from int[] to byte[] and got a 2x speed-up).
On the off chance that you don't know this: if you need to treat some data as both bytes and ints, you don't have to shift and |-mask it - use ByteBuffer.asIntBuffer() and related methods.
With current 1.6 JVM, how many
elements must be copied before
System.arraycopy beats a copying loop?
Better do the benchmark yourself. When I did it way back when in Java 1.3 times, it was somewhere around 2000 elements.
Lots of answers so far, but couple of additional things:
Measure, measure, measure. As much as most Java developers warn against micro-benchmarking, make sure you compare performance between changes. Optimizations that do not result in measurable improvements are generally not worth keeping (of course, sometimes it's combination of things, and that gets trickier)
Tight loops matter as much with Java as with C (and ditto with variable allocations -- although you don't directly control it, HotSpot will eventually have to do it). I manage to practically double the speed of UTF-8 decoding by rearranging tight loop for handling single-byte case (7-bit ascii) as tight(er) inner loop, leaving other cases out.
Do not underestimate cost of allocating and/or clearing large arrays -- if you want lzf encoding/decoding to be faster for small/medium chunks too (not just megabyte sized), keep in mind that ALL allocations of byte[]/int[] are somewhat costly; not because of GC, but because JVM MUST clear the space.
H2 implementation has also been optimized quite a bit (for example: it does not clear the hash array any more, this often makes sense); and I actually helped modify it for use in another Java project. My contribution was mostly just changing it do be more optimal for non-streaming case, but that did not touch the tight encode/decode loops.