Java call type performance

Java call type performance - java

I put together a microbenchmark that seemed to show that the following types of calls took roughly the same amount of time across many iterations after warmup.
static.method(arg);
static.finalAnonInnerClassInstance.apply(arg);
static.modifiedNonFinalAnonInnerClassInstance.apply(arg);
Has anyone found evidence that these different types of calls in the aggregate will have different performance characteristics? My findings are they don't, but I found that a little surprising (especially knowing the bytecode is quite different for at least the static call) so I want to find if others have any evidence either way.
If they indeed had the same exact performance, then that would mean there was no penalty to having that level of indirection in the modified non final case.
I know standard optimization advice would be: "write your code and profile" but I'm writing a framework code generation kind of thing so there is no specific code to profile, and the choice between static and non final is fairly important for both flexibility and possibly performance. I am using framework code in the microbenchmark which I why I can't include it here.
My test was run on Windows JDK 1.7.0_06.

If you benchmark it in a tight loop, JVM would cache the instance, so there's no apparent difference.
If the code is executed in a real application,
if it's expected to be executed back-to-back very quickly, for example, String.length() used in for(int i=0; i<str.length(); i++){ short_code; }, JVM will optimize it, no worries.
if it's executed frequently enough, that the instance is mostly likely in CPU's L1 cache, the extra load of the instance is very fast; no worries.
otherwise, there is a non trivial overhead; but it's executed so infrequently, the overhead is almost impossible to detect among the overall cost of the application. no worries.

Related

Why are floating point operations much faster with a warmup phase?

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).
After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.
I tested it with the following code:
public static void main(String args[])
{
float[] test = new float[10000];
float[] test_copy;
//warmup
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divideByTwo(test);
multiplyWithOneHalf(test_copy);
}
long divisionTime = 0L;
long multiplicationTime = 0L;
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divisionTime += divideByTwo(test);
multiplicationTime += multiplyWithOneHalf(test_copy);
}
System.out.println("Divide by 5.0f: " + divisionTime);
System.out.println("Multiply with 0.2f: " + multiplicationTime);
}
public static long divideByTwo(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f /= 5.0f;
}
return System.nanoTime() - before;
}
public static long multiplyWithOneHalf(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f *= 0.2f;
}
return System.nanoTime() - before;
}
public static void fillRandom(float[] data)
{
Random random = new Random();
for (float f : data)
{
f = random.nextInt() * random.nextFloat();
}
}
Results without warm-up phase:
Divide by 5.0f: 382224
Multiply with 0.2f: 490765
Results with warm-up phase:
Divide by 5.0f: 22081
Multiply with 0.2f: 10885
Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.
I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.
What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?

Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.
The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.
In short, one must discard the results pre-warmup.
Brian Goetz wrote a very good article here on this subject.
========
APPENDED: overview of what 'JVM Warm-up' means
JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).
As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.
Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.
The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)
So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.

You aren't measuring anything useful "without a warmup phase"; you're measuring the speed of interpreted code times how long it takes for the on-stack replacement to be generated. Maybe divisions cause compilation to kick in earlier.
There are sets of guidelines and various packages for building microbenchmarks that don't suffer from these sorts of issues. I would suggest that you read the guidelines and use the ready-made packages if you intend to continue doing this sort of thing.

Multi-thread state visibility in Java: is there a way to turn the JVM into the worst case scenario?

Suppose our code has 2 threads (A and B) have a reference to the same instance of this class somewhere:
public class MyValueHolder {
private int value = 1;
// ... getter and setter
}
When Thread A does myValueHolder.setValue(7), there is no guarantee that Thread B will ever read that value: myValueHolder.getValue() could - in theory - keep returning 1 forever.
In practice however, the hardware will clear the second level cache sooner or later, so Thread B will read 7 sooner or later (usually sooner).
Is there any way to make the JVM emulate that worst case scenario for which it keeps returning 1 forever for Thread B? That would be very useful to test our multi-threaded code with our existing tests under those circumstances.

jcstress maintainer here. There are multiple ways to answer that question.
The easiest solution would be wrapping the getter in the loop, and let JIT hoist it. This is allowed for non-volatile field reads, and simulates the visibility failure with compiler optimization.
More sophisticated trick involves getting the debug build of OpenJDK, and using -XX:+StressLCM -XX:+StressGCM, effectively doing the instruction scheduling fuzzing. Chances are the load in question will float somewhere you can detect with the regular tests your product has.
I am not sure if there is practical hardware holding the written value long enough opaque to cache coherency, but it is somewhat easy to build the testcase with jcstress. You have to keep in mind that the optimization in (1) can also happen, so we need to employ a trick to prevent that. I think something like this should work.

It would be great to have a Java compiler that would intentionally perform as many weird (but allowed) transfirmations as possible to be able to break thread unsafe code more easily, like Csmith for C. Unfortunately, such a compiler does not exist (as far as I know).
In the meantime, you can try the jcstress library* and exercise your code on several architectures, if possible with weaker memory models (i.e. not x86) to try and break your code:
The Java Concurrency Stress tests (jcstress) is an experimental harness and a suite of tests aid research in the correctness of concurrency support in the JVM, class libraries, and hardware.
But in the end, unfortunately, the only way to prove that a piece of code is 100% correct is code inspection (and I don't know of a static code analysis tool able to detect all race conditions).
*I have not used it and I am unclear which of jcstress and the java-concurrency-torture library is more up to date (I would suspect jcstress).

Not on a real machine, sadly testing multi-threaded code will remain difficult.
As you say, the hardware will clear the second level cache and the JVM has no control over that. The JSL only specifies what must happen and this is a case where B might never see the updated value of value.
The only way to force this to happen on a real machine is to alter the code in such a way to void your testing strategy i.e. you end up testing different code.
However, you might be able to run this on a simulator that simulates hardware that doesn't clear the second level cache. Sounds like a lot of effort though!

I think you are refering to the principle called "false sharing" where different CPUs must synchronize their caches or else face the possibility that data such as you describe could become mismatched. There is a very good article on false sharing on Intel's website. Intel describes some useful tools in their article for diagnosing this problem. This is a relevant quote:
The primary means of avoiding false sharing is through code
inspection. Instances where threads access global or dynamically
allocated shared data structures are potential sources of false
sharing. Note that false sharing can be obscured by the fact that
threads may be accessing completely different global variables that
happen to be relatively close together in memory. Thread-local storage
or local variables can be ruled out as sources of false sharing.
Although methods described in the article are not what you have asked for (forcing worst-case behavior from the JVM), as already stated this isn't really possible. The methods described in this article are the best way I know to try to diagnose and avoid false sharing.
There are other resources addressing this problem around the web. For example, this article has a suggestion for a way to avoid false sharing in Java. I have not tried this method, so I cannot vouch for it, but I think the author's idea is sound. You might consider trying out his suggestion.

I have previously suggested a worst case behaving JVM for test purposes on the memory model list but the idea didn't seem popular.
So how to gain "worst case JVM behaviour" , with existing technology i.e how can I test the scenario in the question and get it to fail EVERY time. You could try to find the setup with the weakest memory model possible but that's unlikely to be perfect.
What I have often considered is using a distributed JVM something similar to how I believe Terracotta works under the cover so your application now runs on multiple JVM's (either remote or local) (threads in the same application run in different instances). In this setup inter JVM thread communication takes place at memory barriers e.g. the synchronized keywords you are missing in bugged code for instance (it conforms to the Java Memory Model) and the application is configured i.e. you say this class thread runs here . No code change required to your tests just configuration, any well ordered java application should run out of the box, however this setup would be very intolerant of a badly ordered application (normally a problem ... now an asset i.e. the Memory model exhibits very weak but legal behavior). In the example above loading the code onto a cluster, if two threads run on different nodes setValue has no effect visible to the other thread unless the code was changed and synchronized, volatile etc etc were used, then the code works as intended.
Now your test for the example above (configured correctly) would fail every time without correct "happens before ordering" which is potentially very useful for tests. The flaw in the plan for complete coverage you would need a potentially a node per application thread (can be same machine or multiple in a cluster) or multiple test runs. If you have 1000's of threads then that could be prohibitive though hopefully they would be pooled and scaled down for E2E test scenarios or run it in a cloud. If nothing else this kind of setup might be useful in demonstrating the issue.
inter thread communication across JVMs

The example you have given is described as Incorrectly Synchronized in http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4. I think this is always incorrect and will lead to bugs sooner or later. Most of the times later :-).
To find such incorrectly synchronized code blocks, I use the following algorithm:
Record the threads for all field modifications using instrumentation. If a field is modified by more than one thread without synchronization, I have found a data race.
I implemented this algorithm inside http://vmlens.com, which is a tool to find data races inside java programs.

Here's a simple way: just comment out the code for setValue. You can uncomment it after testing. Since in many cases like this a mechanism is needed to fake failures, it would be a good idea to build a general mechanism for all such cases.

Java Concurrent Dixon's Algorithm

I have endeavored to concurrently implement Dixon's algorithm, with poor results. For small numbers <~40 bits, it operates in about twice the time as other implementations in my class, and after about 40 bits, takes far longer.
I've done everything I can, but I fear it has some fatal issue that I can't find.
My code (fairly lengthy) is located here. Ideally the algorithm would work faster than non-concurrent implementations.

Why would you think it would be faster? Spinning up a thread and adding synchronized calls are HUGE time syncs. If you can't avoid the synchronized keyword, I highly recommend a single-threaded solution.
You may be able to avoid them in various ways--for instance by ensuring that a given variable is only written by one thread even if read by others or by acting like a functional language and making all your variables final using Recursion for variable storage (Iffy, hard to imagine this would speed anything).
If you really need to be fast, however, I did find some very counter-intuitive things out recently from my own attempt at finding a speedy solution...
Static methods didn't help over actual class instances.
Breaking the code down into smaller classes and methods actually INCREASED speed.
Final methods helped more than I would have thought they would
Once I noticed that adding a method call helped speed things along
Don't stress over one-time class allocations or data allocations but avoid allocating objects in loops (This one is obvious but I think it's the most critical)
What I've been able to intuit is that the compiler is extremely smart at optimizing and is tuned to optimize "Ideal" java code. Static methods are no where near ideal--they are kind of a counter-pattern.. one of the most.
I suggest you write the clearest, best OO code you can that actually runs correctly as a reference--then time it and start attempting tweaks to speed it up.

Costs of operations in Java

Is there a why to tell, how expensive an operation for the processor in millisecons or flops is?
I would be intrested in "instanceof", casts (I heard they are very "expensive").
Are there some studies about that?

It will depend on which JVM you're using, and the cost of many operations can vary even within the same JVM, depending on the exact situation and how much optimization the JIT has performed.
For example, a virtual method call can still be inlined by the Hotspot JIT - so long as it hasn't been overridden by anything else. In some cases with the server JIT it can still be inlined with a quick type test, for up to a couple of types.
Basically, JITs are complex enough that there's unlikely to be a meaningful general purpose answer to the question. You should benchmark your own specific situation in as real-world a way as possible. You should usually write code with primary goals of simplicit and readability - but measure the performance regularly.

The time where counting instructions or cycles could give you a good idea about the performance of some code are long gone, thanks to many, many optimizations happening on all levels of software execution.
This is especially true for VM-based languages, where the JVM can simply skip some steps because it knows that it's not necessary.
For example, I've read some time ago in an article (I'll try to find and link it eventually) that these two methods are pretty much equivalent in cost (on the HotSpot JVM, that is):
public void frobnicate1(Object o) {
if (!(foo instanceof SomeClass)) {
throw new IllegalArgumentException("Oh Noes!");
}
frobnicateSomeClass((SomeClass) o);
}
public void frobnicate2(Object o) {
frobnicateSomeClass((SomeClass) o);
}
Obviously the first method does more work, but the JVM knows that the type of o has already been checked in the if and can actually skip the type-check on the cast later on and make it a no-op.
This and many other optimizations make counting "flops" or cycles pretty much useless.
Generally speaking an instanceof check is relatively cheap. On the HotSpot JVM it boils down to a numeric check of the type id in the object header.
This classic article describes why you should "Write Dumb Code".
There's also an article from 2002 that describes how instanceof is optimized in the HotSpot JVM.

Once the JVM has warmed up most operations can be counted in nano-seconds (millionths of a milli-second) When talking about something being expensive, you usually have to say its expensive relative to an alternative. Its next to impossible to describe something as expensive in all cases.
Usually, the most important expense is your time (and other developers in your team) Using instanceof can be expensive in development and code support time because it often indicates a poor design. Using proper OOP techniques is usually a better idea. The 10 nano-second an instanceof might take, is usually relatively trivial.

The cost of specific operations performed inside the CPU is almost never relavant for performance. If performance is bad, it's almost always because of IO (network, disk) or inefficient code. Writing efficient code is much more about finding a way to reduce the overall amount of operations rather than avoiding "costly" operations (except those that are orders of magnitude more costly, like IO).

Performance of new operator versus newInstance() in Java

I was using newInstance() in a sort-of performance-critical area of my code.
The method signature is:
<T extends SomethingElse> T create(Class<T> clasz)
I pass Something.class as argument, I get an instance of SomethingElse, created with newInstance().
Today I got back to clear this performance TODO from the list, so I ran a couple of tests of new operator versus newInstance(). I was very surprised with the performance penalty of newInstance().
I wrote a little about it, here: http://biasedbit.com/blog/new-vs-newinstance/
(Sorry about the self promotion... I'd place the text here, but this question would grow out of proportions.)
What I'd love to know is why does the -server flag provide such a performance boost when the number of objects being created grows largely and not for "low" values, say, 100 or 1000.
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Edit: I've added a warmup phase and the results are now more stable. Thanks for the input!

I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Answering the second part first, your code seems to be making the classic mistake for Java micro-benchmarks and not "warming up" the JVM before making your measurements. Your application needs to run the method that does the test a few times, ignoring the first few iterations ... at least until the numbers stabilize. The reason for this is that a JVM has to do a lot of work to get an application started; e.g. loading classes and (when they've run a few times) JIT compiling the methods where significant application time is being spent.
I think the reason that "-server" is making a difference is that (among other things) it changes the rules that determine when to JIT compile. The assumption is that for a "server" it is better to JIT sooner this gives slower startup but better throughput. (By contrast a "client" is tuned to defer JIT compiling so that the user gets a working GUI sooner.)

IMHO the performance penalty comes from the class loading mechanism.
In case of reflection all the security mechanism are used and thus the creation penalty is higher.
In case of new operator the classes are already loaded in VM (checked and prepared by the default classloader) and the actual instantiation is a cheap process.
The -server parameter does a lot of JIT optimizations for the frequently used code. You might want to try it also with -batch parameter that will trade off the startup-time but then the code will run faster.

Among other things, the garbage collection profile for the -server option has significantly different survivor space sizing defaults.
On closer reading, I see that your example is a micro-benchmark and the results may be counter-intuitive. For example, on my platform, repeated calls to newInstance() are effectively optimized away during repeated runs, making newInstance() appear 12.5 times faster than new.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.