I'm trying to do a typical "A/B testing" like approach on two different implementations of a real-life algorithm, using the same data-set in both cases. The algorithm is deterministic in terms of execution, so I really expect the results to be repeatable.
On the Core 2 Duo, this is also the case. Using just the linux "time" command I'll get variations in execution time around 0.1% (over 10 runs).
On the i7 I will get all sorts of variations, and I can easily have 30% variations up and down from the average. I assume this is due to the various CPU optimizations that the i7 does (dynamic overclocking etc), but it really makes it hard to do this kind of testing. Is there any other way to determine which of 2 algorithms is "best", any other sensible metrics I can use ?
Edit: The algorithm does not sustain for very long and this is actually the real-life scenario I'm trying to benchmark. So running repeatedly is not really an option as such.
See if you can turn off the dynamic over-clocking in your BIOS. Also, ditch all possible other processes running when doing the benchmarking.
Well you could use O-notation principles in determining the performance of algorithms. This will determine the theoretical speed of an algorithm.
http://en.wikipedia.org/wiki/Big_O_notation
If you absolutely must know the real life speed of the alogorithm, then ofc you must benchmark it on a system. But using the O-notation you can see past all that and only focus on the factors/variables that are important.
You didn't indicate how you're benchmarking. You might want to read this if you haven't yet: How do I write a correct micro-benchmark in Java?
If you're running a sustained test I doubt dynamic clocking is causing your variations. It should stay at the maximum turbo speed. If you're running it too long perhaps it's going down one multiplier for heat. Although I doubt that, unless you're over-clocking and are near the thermal envelope.
Hyper-Threading might be playing a role. You could disable that in your BIOS and see if it makes a difference in your numbers.
On linux you can lock the CPU speed to stop clock speed variation. ;)
You need to make the benchmark as realistic as possible. For example, if you run an algorithm flat out and take an average it you might get very different results from performing the same tasks every 10 ms. i.e. I have seen 2x to 10x variation (between flat out and relatively low load), even with a locked clock speed.
Related
In my program memory and cpu time are constraints also this calculation will be done around 50000 times every second. Will there be a performance gain if bitwise operators are used over arithmetic ?
It is highly unlikely that it would make any difference; CPUs wouldn't care about this sort of thing for decades.
In general, if you're worried about performance before you have any actual indication that the performance is below your needs – you're going to have a bad time. Modern hardware, and the JVM's optimising code is so incredibly complicated, even the JVM performance engineers themselves are on record that they have a very hard time just looking at code and then guessing if it can be made faster with cheap tricks like trying to replace a division by a bitshift.
The solution is to simply never engage in that sort of thing: If you have performance needs, write them down, and use profilers to figure out where to look (because generally 99% of the CPU resources are spent on 1% or less of the code – so before starting performance measurements, you need to know what to measure).
Once you know, use JMH to actually test performance. That's what it is for.
IF JMH tells you that the bitshift is faster (I highly doubt it), know that this result does not necessarily translate to other CPU architecture.
My observations of doing some simple testing would indicate yes, it makes a difference. No I didn't use JMH so I know I'll get some push back. But regardless of the order of testing the bitwise operations were always observed to be faster. I can't say if faster equates to advantage but until proven otherwise I will continue to favor them when possible.
(i&1) == 1 is faster than i%2 == 1
i>>3 is faster than i/8
And I have seen this in the API code before, documented as being faster but I haven't tried it.
a<<6 + a<<5 + a<<2 vs a*100
And then there is this for the bit shifting.
Arithemetic Benchmark
I have a Java application and one of the methods is performance-critical.
I created a loop to call this method 10 times and I am checking for performance issues by using the profiler for every iteration. It turned out that the execution time decreases by iterations. Thus, the 10th iteration has a smaller execution time than then 9th iteration.
Any idea why such case is happening?
Could it be due to the loop overheads?
You are warming the CPU caches, and the JVM thus the performance changes.
Profillers put the JVM into an unusual mode, and depending on what profiler approach you are using then it may only be sampling at a regular interval.
I find that profillers are good for giving you relative measurements and to improve your understanding of the code; but always take their reading with a pinch of salt.
Do not trust just a single measurement.
Outside of using profillers, microbenchmarking is a good way to go. Although it is a very tricky subject.
Note that Hotspot tends not to kick in and optimise the byte codes until the target code has been called 10,000 or more times.
http://java.dzone.com/articles/microbenchmarking-java, and How do I write a correct micro-benchmark in Java? may help to get you started. There is also a lot of good advice on the Mechanical Sympathy Forum.
A good microbenchmarking framework is here http://openjdk.java.net/projects/code-tools/jmh/, it helps keep GC, and other JVM stop-the-world events out of the timings. As well as some guidence on how to prevent Hotspot from optimising out the very code that you are trying to measure.
I was wondering how can I estimate a total running time of a java program on specific machine before program ends? I need to know how much it will take so I can announce the progress by that.
FYI Main algorithm of my program takes O(n^3) time complexity. Suppose n= 100000, how much it takes to run this program on my machine? (dual intel xeon e2650)
Regards.
In theory 1GHz of computational power should result in about 1 billion simple operations. However finding the number of simple operations is not always easy. Even if you know the time complexity of a given algorithm, this is not enough - you also need to know the constant factor. In theory it is possible to have a linear algorithm that takes several seconds to compute something for input of size 10000(and some algorithms like this exist - like the linear pre-compute time RMQ).
What you do know, however is that something of O(n^3) will need to perform on the order of 100000^3 operations. So even if your constant is about 1/10^6(which is highly unprobable), this computation will take a lot of time.
I believe #ArturMalinowski 's proposal is the right way to approach your problem. If you benchmark the performance of your algorithm for some sequence known aforehand e.g. {32,64,128,...} or as he proposes {1,10,100,...}. This way you will be able to determine the constant factor with relatively good precision.
In my current project, I am measuring complexity of algorithms written in Java. I operate with asymptotic complexity (expected result) and I want to validate the expectation by comparison with the actual number of operations. Using Incrematation per operation seems to me a bit clumsy awkward. Is there any better approach to measure operational complexity?
Thanks
Edit: more info
The algorithms might run on different machines
Some parts of divide and conquer algorithms might be precached, hence it is probable, that the procedure will be faster than expected
Also it is important for me to find out the multiplicative constant (or the additive one), which is not taken in consideration in asymptotic complexity
Is there particular reason not to just measure the CPU time? The time utility or a profiler will get you the numbers. Just run each algorithm with a sufficient range of inputs and capture the cpu time (not wall clock time) spent.
On an actual computer you want to measure execution time, getCurrentTimeMillis(). Vary the N parameter, get solid statistics. Do an error estimate. Your basic least squares will be fine.
Counting operations on an algorithm is fine, but it has limited use. Different processors do stuff at different speeds. Counting the number of expressions or statements executed by your implemented algorithm is close to useless. In the algorithm you can use it to make comparisons for tweaking, in your implementation this is no longer the case, compiler/JIT//CPU tricks will dominate.
Asymptotic behavior should be very close to the calculated/expected if you do good measurements.
ByCounter can be used to instrument Java (across multiple classes) and count the number of bytecodes executed by the JVM at runtime.
I'm writing a MOS 6502 processor emulator as part of a larger project I've undertaken in my spare time. The emulator is written in Java, and before you say it, I know its not going to be as efficient and optimized as if it was written in c or assembly, but the goal is to make it run on various platforms and its pulling 2.5MHZ on a 1GHZ processor which is pretty good for an interpreted emulator. My problem is quite to the contrary, I need to limit the number of cycles to 1MHZ. Ive looked around but not seen many strategies for doing this. Ive tried a few things including checking the time after a number of cycles and sleeping for the difference between the expected time and the actual time elapsed, but checking the time slows down the emulation by a factor of 8 so does anyone have any better suggestions or perhaps ways to optimize time polling in java to reduce the slowdown?
The problem with using sleep() is that you generally only get a granularity of 1ms, and the actual sleep that you will get isn't necessarily even accurate to the nearest 1ms as it depends on what the rest of the system is doing. A couple of suggestions to try (off the top of my head-- I've not actually written a CPU emulator in Java):
stick to your idea, but check the time between a large-ish number of emulated instructions (execution is going to be a bit "lumpy" anyway especially on a uniprocessor machine, because the OS can potentially take away the CPU from your thread for several milliseconds at a time);
as you want to execute in the order of 1000 emulated instructions per millisecond, you could also try just hanging on to the CPU between "instructions": have your program periodically work out by trial and error how many runs through a loop it needs to go between instructions to "waste" enough CPU to make the timing work out at 1 million emulated instructions / sec on average (you may want to see if setting your thread to low priority helps system performance in this case).
I would use System.nanoTime() in a busy wait as #pst suggested earlier.
You can speed up the emulation by generating byte code. Most instructions should translate quite well and you can add a busy wait call so each instruction takes the amount of time the original instruction would have done. You have an option to increase the delay so you can watch each instruction being executed.
To make it really cool you could generate 6502 assembly code as text with matching line numbers in the byte code. This would allow you to use the debugger to step through the code, breakpoint it and see what the application is doing. ;)
A simple way to emulate the memory is to use direct ByteBuffer or native memory with the Unsafe class to access it. This will give you a block of memory you can access as any data type in any order.
You might be interested in examining the Java Apple Computer Emulator (JACE), which incorporates 6502 emulation. It uses Thread.sleep() in its TimedDevice class.
Have you looked into creating a Timer object that goes off at the cycle length you need it? You could have the timer itself initiate the next loop.
Here is the documentation for the Java 6 version:
http://download.oracle.com/javase/6/docs/api/java/util/Timer.html