Differences between Just in Time compilation and On Stack Replacement - java

Both of them pretty much do the same thing. Identify that the method is hot and compile it instead of interpreting. With OSR, you just move to the compiled version right after it gets compiled, unlike with JIT, where the compiled code gets called when the method is called for the second time.
Other than this, are there any other differences?

In general, Just-in-time compilation refers to compiling native code at runtime and executing it instead of (or in addition to) interpreting. Some VMs, such as Google V8, don't even have an interpreter; they JIT compile every function that gets executed (with varying degrees of optimization).
On Stack Replacement (OSR) is a technique for switching between different implementations of the same function. For example, you could use OSR to switch from interpreted or unoptimized code to JITed code as soon as it finishes compiling.
OSR is useful in situations where you identify a function as "hot" while it is running. This might not necessarily be because the function gets called frequently; it might be called only once, but it spends a lot of time in a big loop which could benefit from optimization. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame which may have variables in different locations.
OSR can also occur in the other direction: from optimized code to unoptimized code or interpreted code. Optimized code may make some assumptions about the runtime behavior of the program based on past behavior. For instance, you could convert a virtual or dynamic method call into a static call if you've only ever seen one type of receiver object. If it turns out later that these assumptions were wrong, OSR can be used to fall back to a more conservative implementation: the optimized stack frame gets converted into an unoptimized stack frame. If the VM supports inlining, you might even end up converting an optimized stack frame into several unoptimized stack frames.

Yes, that's pretty much it. Just-in-time compilation can improve performance by compiling "hot spots" (spots of bytecode that are known / supposed to execute very often) of bytecode to native instructions. On-Stack Replacement complements JIT capabilities by replacing long running interpreted "hot" bytecode by it's compiled version when it becomes available. The mentioned On-Stack Replacement article shows a nice example where JIT compilation would not be very useful without OSR.

Related

if a Method with a while-true loop gets invoked only once, it will stay in the interpreter for the remaining lifetime of it's thread?

I had this sort of concern: if a method contains a while-true loop, it is only called once and it is interpreted, then it will execute in the interpreter forever and kill performance. I first suspected this when I'm testing an AOT-Compiled Minecraft version called libminecraft 1.14.4 native next generation. I used OpenJDK 13 + JVMCI and saw better peak performance. I fully understand that Minecraft had a lot of while-true loops running on multiple threads, so when I ran another test with inline-then-optimize Whole-Program Optimization, it gave horribly bad performance unless AOT-Compiled (The non-optimized version did well in the non-AOT test with the exact same OpenJDK version). Is is actually true that if a Method with a while-true loop gets invoked only once, it will stay in the interpreter for the remaining lifetime of it's thread? I can't run something as big as Minecraft with -XX:+PrintCompilation to tell.
A method with a long running loop can be JIT-compiled, too.
HotSpot JVM has a technique called on-stack replacement:
Also known as 'OSR'. The process of converting an interpreted (or less
optimized) stack frame into a compiled (or more optimized) stack
frame. This happens when the interpreter discovers that a method is
looping, requests the compiler to generate a special nmethod with an
entry point somewhere in the loop (specifically, at a backward
branch), and transfers control to that nmethod.
Most compiler features/optimizations are valid for OSR compilation just like for a regular compilation. However, there are cases (1, 2) when OSR stubs appear not as optimized as a fully compiled method. In a real application though, it's not a common case when a long running loop does not call other methods, so OSR is rarely a performance issue.
As a general programming practice while-true loop is not a good thing to use or at best try an look for some alternatives.
Interpreter does not store any code it just executes the bytecode ,
The key point here is that so long as the threads entry function does not exit/return, the thread will stay in existence. However this does not necessarily mean that the thread has to be actively executing code.
More over there are various way to implement the while true loop if you do not want to be executed even once before the breaking condition is meant you can try using while(!<<some-condition>>) instead of while(true)

How are coroutines implemented in JVM langs without JVM support?

This question came up after reading the Loom proposal, which describes an approach of implementing coroutines in the Java programming language.
Particularly this proposal says that to implement this feature in the language, additional JVM support will be required.
As I understand it there are already several languages on the JVM that have coroutines as part of their feature set such as Kotlin and Scala.
So how is this feature implemented without additional support and can it be implemented efficiently without it?
tl;dr Summary:
Particularly this proposal says that to implement this feature in the language the additional JVM support will be required.
When they say "required", they mean "required in order to be implemented in such a way that it is both performant and interoperable between languages".
So how this feature is implemented without additional support
There are many ways, the most easy to understand how it can possibly work (but not necessarily easiest to implement) is to implement your own VM with your own semantics on top of the JVM. (Note that is not how it is actually done, this is only an intuition as to why it can be done.)
and can it be implemented efficiently without it ?
Not really.
Slightly longer explanation:
Note that one goal of Project Loom is to introduce this abstraction purely as a library. This has three advantages:
It is much easier to introduce a new library than it is to change the Java programming language.
Libraries can immediately be used by programs written in every single language on the JVM, whereas a Java language feature can only be used by Java programs.
A library with the same API that does not use the new JVM features can be implemented, which will allow you to write code that runs on older JVMs with a simple re-compile (albeit with less performance).
However, implementing it as a library precludes clever compiler tricks turning co-routines into something else, because there is no compiler involved. Without clever compiler tricks, getting good performance is much harder, ergo, the "requirement" for JVM support.
Longer explanation:
In general, all of the usual "powerful" control structures are equivalent in a computational sense and can be implemented using each other.
The most well-known of those "powerful" universal control-flow structures is the venerable GOTO, another one are Continuations. Then, there are Threads and Coroutines, and one that people don't often think about, but that is also equivalent to GOTO: Exceptions.
A different possibility is a re-ified call stack, so that the call-stack is accessible as an object to the programmer and can be modified and re-written. (Many Smalltalk dialects do this, for example, and it is also kind-of like how this is done in C and assembly.)
As long as you have one of those, you can have all of those, by just implementing one on top of the other.
The JVM has two of those: Exceptions and GOTO, but the GOTO in the JVM is not universal, it is extremely limited: it only works inside a single method. (It is essentially intended only for loops.) So, that leaves us with Exceptions.
So, that is one possible answer to your question: you can implement co-routines on top of Exceptions.
Another possibility is to not use the JVM's control-flow at all and implement your own stack.
However, that is typically not the path that is actually taken when implementing co-routines on the JVM. Most likely, someone who implements co-routines would choose to use Trampolines and partially re-ify the execution context as an object. That is, for example, how Generators are implemented in C♯ on the CLI (not the JVM, but the challenges are similar). Generators (which are basically restricted semi-co-routines) in C♯ are implemented by lifting the local variables of the method into fields of a context object and splitting the method into multiple methods on that object at each yield statement, converting them into a state machine, and carefully threading all state changes through the fields on the context object. And before async/await came along as a language feature, a clever programmer implemented asynchronous programming using the same machinery as well.
HOWEVER, and that is what the article you pointed to most likely referred to: all that machinery is costly. If you implement your own stack or lift the execution context into a separate object, or compile all your methods into one giant method and use GOTO everywhere (which isn't even possible because of the size limit on methods), or use Exceptions as control-flow, at least one of these two things will be true:
Your calling conventions become incompatible with the JVM stack layout that other languages expect, i.e. you lose interoperability.
The JIT compiler has no idea what the hell your code is doing, and is presented with byte code patterns, execution flow patterns, and usage patterns (e.g. throwing and catching ginormous amounts of exceptions) it doesn't expect and doesn't know how to optimize, i.e. you lose performance.
Rich Hickey (the designer of Clojure) once said in a talk: "Tail Calls, Performance, Interop. Pick Two." I generalized this to what I call Hickey's Maxim: "Advanced Control-Flow, Performance, Interop. Pick Two."
In fact, it is generally hard to achieve even one of interop or performance.
Also, your compiler will become more complex.
All of this goes away, when the construct is available natively in the JVM. Imagine, for example, if the JVM didn't have Threads. Then, every language implementation would create its own Threading library, which is hard, complex, slow, and doesn't interoperate with any other language implementation's Threading library.
A recent, and real-world, example are lambdas: many language implementations on the JVM had lambdas, e.g. Scala. Then Java added lambdas as well, but because the JVM doesn't support lambdas, they must be encoded somehow, and the encoding that Oracle chose was different from the one Scala had chosen before, which meant that you couldn't pass a Java lambda to a Scala method expecting a Scala Function. The solution in this case was that the Scala developers completely re-wrote their encoding of lambdas to be compatible with the encoding Oracle had chosen. This actually broke backwards-compatibility in some places.
From the Kotlin Documentation on Coroutines (emphasis mine):
Coroutines simplify asynchronous programming by putting the complications into libraries. The logic of the program can be expressed sequentially in a coroutine, and the underlying library will figure out the asynchrony for us. The library can wrap relevant parts of the user code into callbacks, subscribe to relevant events, schedule execution on different threads (or even different machines!), and the code remains as simple as if it was sequentially executed.
Long story short, they are compiled down to code that uses callbacks and a state machine to handle suspending and resuming.
Roman Elizarov, the project lead, gave two fantastic talks at KotlinConf 2017 on this subject. One is an Introduction to Coroutines, the second is a Deep Dive on Coroutines.
Coroutines do not rely on features of the operating system or the JVM. Instead, coroutines and suspend functions are transformed by the compiler producing a state machine capable of handling suspensions in general and passing around suspending coroutines keeping their state. This is enabled by Continuations, which are added as a parameter to each and every suspending function by the compiler; this technique is called “Continuation-passing style”(CPS).
One example can be observed in the transformation of suspend functions:
suspend fun <T> CompletableFuture<T>.await(): T
The following shows its signature after CPS transformation:
fun <T> CompletableFuture<T>.await(continuation: Continuation<T>): Any?
If you want to know the hard details, you need to read this explanation.
The Project Loom was preceded by the Quasar library by the same author.
Here is a quote from it's docs:
Internally, a fiber is a continuation which is then scheduled in a
scheduler. A continuation captures the instantaneous state of a
computation, and allows it to be suspended and then resumed at a later
time from the point where it was suspended. Quasar creates
continuations by instrumenting (at the bytecode level) suspendable
methods. For scheduling, Quasar uses ForkJoinPool, which is a very
efficient, work-stealing, multi-threaded scheduler.
Whenever a class is loaded, Quasar’s instrumentation module (usually
run as a Java agent) scans it for suspendable methods. Every
suspendable method f is then instrumented in the following way: It is
scanned for calls to other suspendable methods. For every call to a
suspendable method g, some code is inserted before (and after) the
call to g that saves (and restores) the state of a local variables to
the fiber’s stack (a fiber manages its own stack), and records the
fact that this (i.e. the call to g) is a possible suspension point. At
the end of this “suspendable function chain”, we’ll find a call to
Fiber.park. park suspends the fiber by throwing a SuspendExecution
exception (which the instrumentation prevents you from catching, even
if your method contains a catch(Throwable t) block).
If g indeed blocks, the SuspendExecution exception will be caught by
the Fiber class. When the fiber is awakened (with unpark), method f
will be called, and then the execution record will show that we’re
blocked at the call to g, so we’ll immediately jump to the line in f
where g is called, and call it. Finally, we’ll reach the actual
suspension point (the call to park), where we’ll resume execution
immediately following the call. When g returns, the code inserted in f
will restore f’s local variables from the fiber stack.
This process sounds complicated, but its incurs a performance overhead
of no more than 3%-5%.
It seems that almost all pure java continuation libraries used similar bytecode instrumentation approach to capture and restore local variables on the stack frames.
Only Kotlin and Scala compilers were brave enough to implement more detached and potentially more performant approach with CPS transformations to state machines mentioned in some other answers here.

Is assigning to String variable costly in terms of memory and time?

String timeStamp = currentCommentObjectObj.getTimeStamp();
holder.timeStamp.setText(timeStamp);
or
holder.timeStamp.setText(currentCommentObjectObj.getTimeStamp());
Which is better from a time and space optimisation perspective?
More information :-
This code is inside onBindViewHolder of recycler view.
Although I prefer the second one, I believe that there is no difference. Because somewhere in the compilation steps the compiler optimizes your code and would recognize such discrepancies if any.
Please refer to http://www.noesispoint.com/jsp/scjp/SCJPch0.htm for more information.
Apparently first javac (the Java Compiler) compiles the code to JavaByteCode and then the Java Virtual Machine's compiler JIT optimizes and compiles the Byte code to the machines language.
Hope that helps.
Regardless of the compiler/JVM behavior, this is an operation that should be essentially instantaneous.
On a machine level, the only possible difference between the two methods is
(a) saving a pointer to memory, then loading the pointer, or
(b) passing the pointer directly to the next method call.
The amount of time that is different between these two is so small that it essentially will never matter, for any Object. Even in a loop, Android UI code should not be executed enough times for this to ever possibly matter.

What are safe points and safe point polling in context of profiling?

I am facing a situation where I do not see some method calls not being recorded by the VisualVM application. Wanted to find out the reason and came across this answer on SO. The third point mentions about a potential issue of the sampling method(which is the only option that I am seeing enabled probably because I am doing remote profiling). It mentions about safe points in code and safe point polling by code itself. What do these terms mean?
The issue of inaccuracy of Java sampling profiler tools and its relation to the safe points is very well discussed in Evaluating the Accuracy of Java Profilers (PLDI'10).
Essentially, Java profilers may produce inaccurate results when sampling due to the fact that the sampling occurs during the safe points. And since occurrence of safe-points can be modified by the compiler, execution of some methods may never by sampled by the profiler. Therefore, the profiler is scheduled to record a sample of the code (the time interval is up) but it must wait for the occurrence of the safe-point. And since the safe-point is e.g. moved around by the compiler, the method that would be ideally sampled is never observed.
As already explained by the previous anwer, a safepoint is an event or a position in the code where compiler interrupts execution in order to execute some internal VM code (for example GC).
The safe-point polling is a method of implementing the safepoint or a safepoint trigger. It means that in the code being executed you regularly check a flag to see if a safe-point execution is required, if yes (due to e.g. GC trigger), the thread is interrupted and the safepoint is executed. See e.g. GC safe-point (or safepoint) and safe-region
This blog post discusses safe points.
Basically they are points in the code where the JITter allows interruptions for GC, stack traces etc.
The post also says the safe points, by delaying stack samples, cannot occur in places where you might like them to, and that's a problem.
In my opinion, that's a small problem.
The whole reason you take a stack sample (as opposed to just a program-counter sample) is to show you all the call-sites leading to the current state, because those are likely to be much more juicy sources of slowness than whatever the program counter is doing.
(If it's doing anything. You might be in the middle of I/O, where the PC is meaningless, but the call-sites are still just as important.)
If the stack sample has to wait a few cycles to get to a safe point, all that means is it happens at the end of a block of instructions, not in the middle.
If you examine the sample you can still get a good idea what's happening.
I'm hoping profiler writers come to realize they don't need to sweat the small stuff.
What's more important is not to miss the big stuff.

Performance of new operator versus newInstance() in Java

I was using newInstance() in a sort-of performance-critical area of my code.
The method signature is:
<T extends SomethingElse> T create(Class<T> clasz)
I pass Something.class as argument, I get an instance of SomethingElse, created with newInstance().
Today I got back to clear this performance TODO from the list, so I ran a couple of tests of new operator versus newInstance(). I was very surprised with the performance penalty of newInstance().
I wrote a little about it, here: http://biasedbit.com/blog/new-vs-newinstance/
(Sorry about the self promotion... I'd place the text here, but this question would grow out of proportions.)
What I'd love to know is why does the -server flag provide such a performance boost when the number of objects being created grows largely and not for "low" values, say, 100 or 1000.
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Edit: I've added a warmup phase and the results are now more stable. Thanks for the input!
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Answering the second part first, your code seems to be making the classic mistake for Java micro-benchmarks and not "warming up" the JVM before making your measurements. Your application needs to run the method that does the test a few times, ignoring the first few iterations ... at least until the numbers stabilize. The reason for this is that a JVM has to do a lot of work to get an application started; e.g. loading classes and (when they've run a few times) JIT compiling the methods where significant application time is being spent.
I think the reason that "-server" is making a difference is that (among other things) it changes the rules that determine when to JIT compile. The assumption is that for a "server" it is better to JIT sooner this gives slower startup but better throughput. (By contrast a "client" is tuned to defer JIT compiling so that the user gets a working GUI sooner.)
IMHO the performance penalty comes from the class loading mechanism.
In case of reflection all the security mechanism are used and thus the creation penalty is higher.
In case of new operator the classes are already loaded in VM (checked and prepared by the default classloader) and the actual instantiation is a cheap process.
The -server parameter does a lot of JIT optimizations for the frequently used code. You might want to try it also with -batch parameter that will trade off the startup-time but then the code will run faster.
Among other things, the garbage collection profile for the -server option has significantly different survivor space sizing defaults.
On closer reading, I see that your example is a micro-benchmark and the results may be counter-intuitive. For example, on my platform, repeated calls to newInstance() are effectively optimized away during repeated runs, making newInstance() appear 12.5 times faster than new.

Categories

Resources