While I was programming a micro-benchmark service for an SQL wrapper I'm programming, I decided to use a method of bench-marking that involved annotating benchmark methods. The problem is that Method.invoke() has a slight overhead (depending on your system). While this would be acceptable if I was bench-marking very long tasks, these tests last only about half a second on average, and every microsecond can really change how a benchmark result looks.
Is there any way to work around this overhead, or is there a way to time the processing internally (eg. via a java method)?
Related
I know how to measure time (eg. prefer System.nanoTime) and I'm measuring execution time of methods in 'test'-builds of my applications here and there.
But sometimes bad performance can happen in a production app. It would be nice to know if the reason for it is an HTTP-call to another Host, or a mapping-code for DB-entities, JSON/XML mappers or other parts of the logic.
So I want to include that time measuring also in the production code, waiting there for being enabled/disabled.
I'm thinking of a naive approach like creating a TimeMeasure-interface that is subclassed twice, one implementation with actual time measuring and the other with empty method bodies to avoid too much overhead in productive use.
Does that approach sounds valid or is there a better technique to keep a low overhead for code in your method that is only used in very rare cases. Is there any timemeasuring library that is already aware of such a feature?
I'm doing 100's of logarithmic and power calculations inside a Java program. The second time onwards the time it takes (Using System.nanotime()) is way faster than the first instance. Why? Does Java (I use JDK8) use any in-memory caching for Math calculations?
At the very first Math calculation the JVM need at least to load the Math class from the hard drive into memory, verify it (scan for errors) and parse it to extract the methods, annotations, etc. It's much slower than calculating logarithm. Thus very first access to the class may be many times slower than the consequent accesses.
During the further iterations the JIT-compilation of your code can be triggered (so-called on-stack replacement) and your testing method will be compiled, thus you may have even more speed-up as calls to Math methods would simply be replaced with CPU instructions, reducing the overhead of passing the parameters to the native code as well as interpreter work on iteration. Also if your test is badly written and you don't use the results of the calculations, the JIT-compiler may remove the call to Math library at all.
Finally for such fast methods like Math.log the nanotime may produce too imprecise results. Consider writing proper JMH benchmark.
I am trying to measure the complexity of an algorithm using a timer to measure the execution time, whilst changing the size of the input array.
The code I have at the moment is rather simple:
public void start() {
start = System.nanoTime();
}
public long stop() {
long time = System.nanoTime() - start;
start = 0;
return time;
}
It appears to work fine, up until the size of the array becomes very large, and what I expect to be an O(n) complexity algorithm turns out appearing to be O(n^2). I believe that this is due to the threading on the CPU, with other processes cutting in for more time during the runs with larger values for n.
Basically, I want to measure how much time my process has been running for, rather than how long it has been since I invoked the algorithm. Is there an easy way to do this in Java?
Measuring execution time is a really interesting, but also complicated topic. To do it right in Java, you have to know a little bit about how the JVM works. Here is a good article from developerWorks about benchmarking and measuring. Read it, it will help you a lot.
The author also provides a small framework for doing benchmarks. You can use this framework. It will give you exaclty what you needs - the CPU consuming time, instead of just two time stamps from before and after. The framework will also handle the JVM warm-up and will keep track of just-in-time-compilings.
You can also use a performance monitor like this one for Eclipse. The problem by such a performance monitor is, that it doesn't perform a benchmark. It just tracks the time, memory and such things, that your application currently uses. But that's not a real measurement - it's just a snapshot at a specific time.
Benchmarking in Java is a hard problem, not least because the JIT can have weird effects as your method gets more and more heavily optimized. Consider using a purpose-built tool like Caliper. Examples of how to use it and to measure performance on different input sizes are here.
If you want the actual CPU time of the current thread (or indeed, any arbitrary thread) rather than the wall clock time then you can get this via ThreadMXBean. Basically, do this at the start:
ThreadMXBean thx = ManagementFactory.getThreadMXBean();
thx.setThreadCpuTimeEnabled(true);
Then, whenever you want to get the elapsed CPU time for the current thread:
long cpuTime = thx.getCurrentThreadCpuTime();
You'll see that ThreadMXBean has calls to get CPU time and other info for arbitrary threads too.
Other comments about the complexities of timing also apply. The timing of the individual invocation of a piece of code can depend among other things on the state of the CPU and on what the JIT compiler decides to do at that particular moment. The overall scalability behaviour of an algorithm is generally a trend that emerges across a number of invocations and you will always need to be prepared for some "outliers" in your timings.
Also, remember that just because a particular timing is expressed in nanoseconds (or indeed milliseconds) does not mean that the timing actually has that granularity.
I'm trying to do a typical "A/B testing" like approach on two different implementations of a real-life algorithm, using the same data-set in both cases. The algorithm is deterministic in terms of execution, so I really expect the results to be repeatable.
On the Core 2 Duo, this is also the case. Using just the linux "time" command I'll get variations in execution time around 0.1% (over 10 runs).
On the i7 I will get all sorts of variations, and I can easily have 30% variations up and down from the average. I assume this is due to the various CPU optimizations that the i7 does (dynamic overclocking etc), but it really makes it hard to do this kind of testing. Is there any other way to determine which of 2 algorithms is "best", any other sensible metrics I can use ?
Edit: The algorithm does not sustain for very long and this is actually the real-life scenario I'm trying to benchmark. So running repeatedly is not really an option as such.
See if you can turn off the dynamic over-clocking in your BIOS. Also, ditch all possible other processes running when doing the benchmarking.
Well you could use O-notation principles in determining the performance of algorithms. This will determine the theoretical speed of an algorithm.
http://en.wikipedia.org/wiki/Big_O_notation
If you absolutely must know the real life speed of the alogorithm, then ofc you must benchmark it on a system. But using the O-notation you can see past all that and only focus on the factors/variables that are important.
You didn't indicate how you're benchmarking. You might want to read this if you haven't yet: How do I write a correct micro-benchmark in Java?
If you're running a sustained test I doubt dynamic clocking is causing your variations. It should stay at the maximum turbo speed. If you're running it too long perhaps it's going down one multiplier for heat. Although I doubt that, unless you're over-clocking and are near the thermal envelope.
Hyper-Threading might be playing a role. You could disable that in your BIOS and see if it makes a difference in your numbers.
On linux you can lock the CPU speed to stop clock speed variation. ;)
You need to make the benchmark as realistic as possible. For example, if you run an algorithm flat out and take an average it you might get very different results from performing the same tasks every 10 ms. i.e. I have seen 2x to 10x variation (between flat out and relatively low load), even with a locked clock speed.
I was using newInstance() in a sort-of performance-critical area of my code.
The method signature is:
<T extends SomethingElse> T create(Class<T> clasz)
I pass Something.class as argument, I get an instance of SomethingElse, created with newInstance().
Today I got back to clear this performance TODO from the list, so I ran a couple of tests of new operator versus newInstance(). I was very surprised with the performance penalty of newInstance().
I wrote a little about it, here: http://biasedbit.com/blog/new-vs-newinstance/
(Sorry about the self promotion... I'd place the text here, but this question would grow out of proportions.)
What I'd love to know is why does the -server flag provide such a performance boost when the number of objects being created grows largely and not for "low" values, say, 100 or 1000.
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Edit: I've added a warmup phase and the results are now more stable. Thanks for the input!
I did learn my lesson with the whole reflections thing, this is just curiosity about the optimisations the JVM performs in runtime, especially with the -server flag. Also, if I'm doing something wrong in the test, I'd appreciate your feedback!
Answering the second part first, your code seems to be making the classic mistake for Java micro-benchmarks and not "warming up" the JVM before making your measurements. Your application needs to run the method that does the test a few times, ignoring the first few iterations ... at least until the numbers stabilize. The reason for this is that a JVM has to do a lot of work to get an application started; e.g. loading classes and (when they've run a few times) JIT compiling the methods where significant application time is being spent.
I think the reason that "-server" is making a difference is that (among other things) it changes the rules that determine when to JIT compile. The assumption is that for a "server" it is better to JIT sooner this gives slower startup but better throughput. (By contrast a "client" is tuned to defer JIT compiling so that the user gets a working GUI sooner.)
IMHO the performance penalty comes from the class loading mechanism.
In case of reflection all the security mechanism are used and thus the creation penalty is higher.
In case of new operator the classes are already loaded in VM (checked and prepared by the default classloader) and the actual instantiation is a cheap process.
The -server parameter does a lot of JIT optimizations for the frequently used code. You might want to try it also with -batch parameter that will trade off the startup-time but then the code will run faster.
Among other things, the garbage collection profile for the -server option has significantly different survivor space sizing defaults.
On closer reading, I see that your example is a micro-benchmark and the results may be counter-intuitive. For example, on my platform, repeated calls to newInstance() are effectively optimized away during repeated runs, making newInstance() appear 12.5 times faster than new.