How JVM executes math calculations faster second time onwards? - java

I'm doing 100's of logarithmic and power calculations inside a Java program. The second time onwards the time it takes (Using System.nanotime()) is way faster than the first instance. Why? Does Java (I use JDK8) use any in-memory caching for Math calculations?

At the very first Math calculation the JVM need at least to load the Math class from the hard drive into memory, verify it (scan for errors) and parse it to extract the methods, annotations, etc. It's much slower than calculating logarithm. Thus very first access to the class may be many times slower than the consequent accesses.
During the further iterations the JIT-compilation of your code can be triggered (so-called on-stack replacement) and your testing method will be compiled, thus you may have even more speed-up as calls to Math methods would simply be replaced with CPU instructions, reducing the overhead of passing the parameters to the native code as well as interpreter work on iteration. Also if your test is badly written and you don't use the results of the calculations, the JIT-compiler may remove the call to Math library at all.
Finally for such fast methods like Math.log the nanotime may produce too imprecise results. Consider writing proper JMH benchmark.

Related

Measure time to complete individual streaming steps?

Is there a library that allows measuring Java Stream steps?
As in "when consuming this stream, X amount of time is spent in that filter, Y amount of time is spent in this map, etc..."?
Or do I have to tinker around a bit?
By definition, there is no 'between' time for steps. The method chain that you define only defines the steps to take (the methods build an internal structure containing the steps).
Once you call the terminal operator, all steps are executed in indeterminate order. If you never call a terminal operator, no steps are executed either.
What I mean by indeterminate order is this:
intstream.peek(operation1).peek(operation2).sum();
You don't know if operation1 is called, then operation2 called, then operation1 again, or if operation1 is called a bunch of times, then operation2 a bunch of times, or something else yet. (I hope I'm correct in saying this. [edit] correct) It's probably up to the implementation of the VM to trade off storage with time complexity.
The best you can to is measure the time each operation takes as you have full control over every operation. The time the stream engine takes is hard to gauge. However, know that when you do this, you're breaking the stream API requirement that operations should be side-effect-free. So don't use it in production code.
The only tool I can think of is JMH, which is a general benchmarking tool for Java. However note that it might be hard to understand the effect that different Stream operators have on the benchmark due to JIT. The fact that your code contains a Stream with a bunch of different operators doesn't mean that's exactly the way JIT compiles it.

How to benchmark with Method.invoke in Java

While I was programming a micro-benchmark service for an SQL wrapper I'm programming, I decided to use a method of bench-marking that involved annotating benchmark methods. The problem is that Method.invoke() has a slight overhead (depending on your system). While this would be acceptable if I was bench-marking very long tasks, these tests last only about half a second on average, and every microsecond can really change how a benchmark result looks.
Is there any way to work around this overhead, or is there a way to time the processing internally (eg. via a java method)?

Can method extraction negatively impact code performance?

Assume you have quite long method with around 200 lines of very time sensitive code. Is it possible that extracting some parts of code to separate methods will slow down execution?
Most probably, you'll get a speedup. The problem is that optimizing a 200 lines beast is hard. Actually, Hotspot gives it up when the method is too long. Once I achieved a speedup factor of 2 by simply splitting a long method.
Short methods are fine, and they'll be inlined as needed. So the method call overhead gets minimized. By inlining, Hotspot may re-create your original method (improbable due to its excessive length) or create multiple methods, where some of them may contain code not present in the original method.
The answer is "yes, it may get slower." The problem is that the chosen inlining may be suboptimal. However, it's very improbable, and I'd expect a speedup instead.
The overhead is negligible, the compiler will inline the methods to execute them.
EDIT:
similar question
I don't think so.
Yes some calls would be added and some stackframes but that doesn't cost a lot of time and depending on your compiler it might even optimize the code in such a way that there is basically no difference in the version with one method compared to the one with many.
The loss of readability and reusability you would get by implementing all in one method is definitely not worth the (if at all existing) performance increase.
It is important that the factored out methods will be either declared private or final. The just in time compiler in the JVM will then inline everything, which means there will be a single big method executed as result.
However, always benchmark your code, when modifying it.

Why are floating point operations much faster with a warmup phase?

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).
After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.
I tested it with the following code:
public static void main(String args[])
{
float[] test = new float[10000];
float[] test_copy;
//warmup
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divideByTwo(test);
multiplyWithOneHalf(test_copy);
}
long divisionTime = 0L;
long multiplicationTime = 0L;
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divisionTime += divideByTwo(test);
multiplicationTime += multiplyWithOneHalf(test_copy);
}
System.out.println("Divide by 5.0f: " + divisionTime);
System.out.println("Multiply with 0.2f: " + multiplicationTime);
}
public static long divideByTwo(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f /= 5.0f;
}
return System.nanoTime() - before;
}
public static long multiplyWithOneHalf(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f *= 0.2f;
}
return System.nanoTime() - before;
}
public static void fillRandom(float[] data)
{
Random random = new Random();
for (float f : data)
{
f = random.nextInt() * random.nextFloat();
}
}
Results without warm-up phase:
Divide by 5.0f: 382224
Multiply with 0.2f: 490765
Results with warm-up phase:
Divide by 5.0f: 22081
Multiply with 0.2f: 10885
Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.
I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.
What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?
Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.
The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.
In short, one must discard the results pre-warmup.
Brian Goetz wrote a very good article here on this subject.
========
APPENDED: overview of what 'JVM Warm-up' means
JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).
As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.
Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.
The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)
So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.
You aren't measuring anything useful "without a warmup phase"; you're measuring the speed of interpreted code times how long it takes for the on-stack replacement to be generated. Maybe divisions cause compilation to kick in earlier.
There are sets of guidelines and various packages for building microbenchmarks that don't suffer from these sorts of issues. I would suggest that you read the guidelines and use the ready-made packages if you intend to continue doing this sort of thing.

How to measure number of operations performed

In my current project, I am measuring complexity of algorithms written in Java. I operate with asymptotic complexity (expected result) and I want to validate the expectation by comparison with the actual number of operations. Using Incrematation per operation seems to me a bit clumsy awkward. Is there any better approach to measure operational complexity?
Thanks
Edit: more info
The algorithms might run on different machines
Some parts of divide and conquer algorithms might be precached, hence it is probable, that the procedure will be faster than expected
Also it is important for me to find out the multiplicative constant (or the additive one), which is not taken in consideration in asymptotic complexity
Is there particular reason not to just measure the CPU time? The time utility or a profiler will get you the numbers. Just run each algorithm with a sufficient range of inputs and capture the cpu time (not wall clock time) spent.
On an actual computer you want to measure execution time, getCurrentTimeMillis(). Vary the N parameter, get solid statistics. Do an error estimate. Your basic least squares will be fine.
Counting operations on an algorithm is fine, but it has limited use. Different processors do stuff at different speeds. Counting the number of expressions or statements executed by your implemented algorithm is close to useless. In the algorithm you can use it to make comparisons for tweaking, in your implementation this is no longer the case, compiler/JIT//CPU tricks will dominate.
Asymptotic behavior should be very close to the calculated/expected if you do good measurements.
ByCounter can be used to instrument Java (across multiple classes) and count the number of bytecodes executed by the JVM at runtime.

Categories

Resources