I was using newInstance() in a sort-of performance-critical area of my code.
The method signature is:
<T extends SomethingElse> T create(Class<T> clasz)
I pass Something.class as argument, I get an instance of SomethingElse, created with newInstance().
Today I got back to clear this performance TODO from the list, so I ran a couple of tests of new operator versus newInstance(). I was very surprised with the performance penalty of newInstance().
I wrote a little about it, here: http://biasedbit.com/blog/new-vs-newinstance/
(Sorry about the self promotion... I'd place the text here, but this question would grow out of proportions.)
What I'd love to know is why does the -server flag provide such a performance boost when the number of objects being created grows largely and not for "low" values, say, 100 or 1000.
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Edit: I've added a warmup phase and the results are now more stable. Thanks for the input!
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Answering the second part first, your code seems to be making the classic mistake for Java micro-benchmarks and not "warming up" the JVM before making your measurements. Your application needs to run the method that does the test a few times, ignoring the first few iterations ... at least until the numbers stabilize. The reason for this is that a JVM has to do a lot of work to get an application started; e.g. loading classes and (when they've run a few times) JIT compiling the methods where significant application time is being spent.
I think the reason that "-server" is making a difference is that (among other things) it changes the rules that determine when to JIT compile. The assumption is that for a "server" it is better to JIT sooner this gives slower startup but better throughput. (By contrast a "client" is tuned to defer JIT compiling so that the user gets a working GUI sooner.)
IMHO the performance penalty comes from the class loading mechanism.
In case of reflection all the security mechanism are used and thus the creation penalty is higher.
In case of new operator the classes are already loaded in VM (checked and prepared by the default classloader) and the actual instantiation is a cheap process.
The -server parameter does a lot of JIT optimizations for the frequently used code. You might want to try it also with -batch parameter that will trade off the startup-time but then the code will run faster.
Among other things, the garbage collection profile for the -server option has significantly different survivor space sizing defaults.
On closer reading, I see that your example is a micro-benchmark and the results may be counter-intuitive. For example, on my platform, repeated calls to newInstance() are effectively optimized away during repeated runs, making newInstance() appear 12.5 times faster than new.
Related
Assume you have quite long method with around 200 lines of very time sensitive code. Is it possible that extracting some parts of code to separate methods will slow down execution?
Most probably, you'll get a speedup. The problem is that optimizing a 200 lines beast is hard. Actually, Hotspot gives it up when the method is too long. Once I achieved a speedup factor of 2 by simply splitting a long method.
Short methods are fine, and they'll be inlined as needed. So the method call overhead gets minimized. By inlining, Hotspot may re-create your original method (improbable due to its excessive length) or create multiple methods, where some of them may contain code not present in the original method.
The answer is "yes, it may get slower." The problem is that the chosen inlining may be suboptimal. However, it's very improbable, and I'd expect a speedup instead.
The overhead is negligible, the compiler will inline the methods to execute them.
EDIT:
similar question
I don't think so.
Yes some calls would be added and some stackframes but that doesn't cost a lot of time and depending on your compiler it might even optimize the code in such a way that there is basically no difference in the version with one method compared to the one with many.
The loss of readability and reusability you would get by implementing all in one method is definitely not worth the (if at all existing) performance increase.
It is important that the factored out methods will be either declared private or final. The just in time compiler in the JVM will then inline everything, which means there will be a single big method executed as result.
However, always benchmark your code, when modifying it.
I put together a microbenchmark that seemed to show that the following types of calls took roughly the same amount of time across many iterations after warmup.
static.method(arg);
static.finalAnonInnerClassInstance.apply(arg);
static.modifiedNonFinalAnonInnerClassInstance.apply(arg);
Has anyone found evidence that these different types of calls in the aggregate will have different performance characteristics? My findings are they don't, but I found that a little surprising (especially knowing the bytecode is quite different for at least the static call) so I want to find if others have any evidence either way.
If they indeed had the same exact performance, then that would mean there was no penalty to having that level of indirection in the modified non final case.
I know standard optimization advice would be: "write your code and profile" but I'm writing a framework code generation kind of thing so there is no specific code to profile, and the choice between static and non final is fairly important for both flexibility and possibly performance. I am using framework code in the microbenchmark which I why I can't include it here.
My test was run on Windows JDK 1.7.0_06.
If you benchmark it in a tight loop, JVM would cache the instance, so there's no apparent difference.
If the code is executed in a real application,
if it's expected to be executed back-to-back very quickly, for example, String.length() used in for(int i=0; i<str.length(); i++){ short_code; }, JVM will optimize it, no worries.
if it's executed frequently enough, that the instance is mostly likely in CPU's L1 cache, the extra load of the instance is very fast; no worries.
otherwise, there is a non trivial overhead; but it's executed so infrequently, the overhead is almost impossible to detect among the overall cost of the application. no worries.
Both of them pretty much do the same thing. Identify that the method is hot and compile it instead of interpreting. With OSR, you just move to the compiled version right after it gets compiled, unlike with JIT, where the compiled code gets called when the method is called for the second time.
Other than this, are there any other differences?
In general, Just-in-time compilation refers to compiling native code at runtime and executing it instead of (or in addition to) interpreting. Some VMs, such as Google V8, don't even have an interpreter; they JIT compile every function that gets executed (with varying degrees of optimization).
On Stack Replacement (OSR) is a technique for switching between different implementations of the same function. For example, you could use OSR to switch from interpreted or unoptimized code to JITed code as soon as it finishes compiling.
OSR is useful in situations where you identify a function as "hot" while it is running. This might not necessarily be because the function gets called frequently; it might be called only once, but it spends a lot of time in a big loop which could benefit from optimization. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame which may have variables in different locations.
OSR can also occur in the other direction: from optimized code to unoptimized code or interpreted code. Optimized code may make some assumptions about the runtime behavior of the program based on past behavior. For instance, you could convert a virtual or dynamic method call into a static call if you've only ever seen one type of receiver object. If it turns out later that these assumptions were wrong, OSR can be used to fall back to a more conservative implementation: the optimized stack frame gets converted into an unoptimized stack frame. If the VM supports inlining, you might even end up converting an optimized stack frame into several unoptimized stack frames.
Yes, that's pretty much it. Just-in-time compilation can improve performance by compiling "hot spots" (spots of bytecode that are known / supposed to execute very often) of bytecode to native instructions. On-Stack Replacement complements JIT capabilities by replacing long running interpreted "hot" bytecode by it's compiled version when it becomes available. The mentioned On-Stack Replacement article shows a nice example where JIT compilation would not be very useful without OSR.
I have endeavored to concurrently implement Dixon's algorithm, with poor results. For small numbers <~40 bits, it operates in about twice the time as other implementations in my class, and after about 40 bits, takes far longer.
I've done everything I can, but I fear it has some fatal issue that I can't find.
My code (fairly lengthy) is located here. Ideally the algorithm would work faster than non-concurrent implementations.
Why would you think it would be faster? Spinning up a thread and adding synchronized calls are HUGE time syncs. If you can't avoid the synchronized keyword, I highly recommend a single-threaded solution.
You may be able to avoid them in various ways--for instance by ensuring that a given variable is only written by one thread even if read by others or by acting like a functional language and making all your variables final using Recursion for variable storage (Iffy, hard to imagine this would speed anything).
If you really need to be fast, however, I did find some very counter-intuitive things out recently from my own attempt at finding a speedy solution...
Static methods didn't help over actual class instances.
Breaking the code down into smaller classes and methods actually INCREASED speed.
Final methods helped more than I would have thought they would
Once I noticed that adding a method call helped speed things along
Don't stress over one-time class allocations or data allocations but avoid allocating objects in loops (This one is obvious but I think it's the most critical)
What I've been able to intuit is that the compiler is extremely smart at optimizing and is tuned to optimize "Ideal" java code. Static methods are no where near ideal--they are kind of a counter-pattern.. one of the most.
I suggest you write the clearest, best OO code you can that actually runs correctly as a reference--then time it and start attempting tweaks to speed it up.
Is there a why to tell, how expensive an operation for the processor in millisecons or flops is?
I would be intrested in "instanceof", casts (I heard they are very "expensive").
Are there some studies about that?
It will depend on which JVM you're using, and the cost of many operations can vary even within the same JVM, depending on the exact situation and how much optimization the JIT has performed.
For example, a virtual method call can still be inlined by the Hotspot JIT - so long as it hasn't been overridden by anything else. In some cases with the server JIT it can still be inlined with a quick type test, for up to a couple of types.
Basically, JITs are complex enough that there's unlikely to be a meaningful general purpose answer to the question. You should benchmark your own specific situation in as real-world a way as possible. You should usually write code with primary goals of simplicit and readability - but measure the performance regularly.
The time where counting instructions or cycles could give you a good idea about the performance of some code are long gone, thanks to many, many optimizations happening on all levels of software execution.
This is especially true for VM-based languages, where the JVM can simply skip some steps because it knows that it's not necessary.
For example, I've read some time ago in an article (I'll try to find and link it eventually) that these two methods are pretty much equivalent in cost (on the HotSpot JVM, that is):
public void frobnicate1(Object o) {
if (!(foo instanceof SomeClass)) {
throw new IllegalArgumentException("Oh Noes!");
}
frobnicateSomeClass((SomeClass) o);
}
public void frobnicate2(Object o) {
frobnicateSomeClass((SomeClass) o);
}
Obviously the first method does more work, but the JVM knows that the type of o has already been checked in the if and can actually skip the type-check on the cast later on and make it a no-op.
This and many other optimizations make counting "flops" or cycles pretty much useless.
Generally speaking an instanceof check is relatively cheap. On the HotSpot JVM it boils down to a numeric check of the type id in the object header.
This classic article describes why you should "Write Dumb Code".
There's also an article from 2002 that describes how instanceof is optimized in the HotSpot JVM.
Once the JVM has warmed up most operations can be counted in nano-seconds (millionths of a milli-second) When talking about something being expensive, you usually have to say its expensive relative to an alternative. Its next to impossible to describe something as expensive in all cases.
Usually, the most important expense is your time (and other developers in your team) Using instanceof can be expensive in development and code support time because it often indicates a poor design. Using proper OOP techniques is usually a better idea. The 10 nano-second an instanceof might take, is usually relatively trivial.
The cost of specific operations performed inside the CPU is almost never relavant for performance. If performance is bad, it's almost always because of IO (network, disk) or inefficient code. Writing efficient code is much more about finding a way to reduce the overall amount of operations rather than avoiding "costly" operations (except those that are orders of magnitude more costly, like IO).