I have been developing with Java for some time now, and always strife to do something in the most efficient way. By now i have mostly been trying to condense the number of lines of code I have. But when starting to work with 2d rendering it is more about how long it takes to compute a certain piece of code as it is called many times a second.
My question:
Is there some way to measure how long it takes to compute a certain piece of code in Eclipse, Java, ... ?
First, some nitpicking. You title this question ...
Putting a number to the efficiency of an algorithm
There is no practical quantifiable measure of "efficiency" for an algorithm. Efficiency (as normally conceived) is a measure of "something" relative to an ideal / perfect; e.g. a hypothetical 100% efficient steam engine would convert all of the energy in the coal being burned into useful "work". But for software, there is no ideal to measure against. (Or if there is, we can't be sure that it is the ideal.) Hence "efficiency" is the wrong term.
What you actually mean is a measure of the "performance" of ...
Algorithms are an abstract concept, and their performance cannot be measured.
What you actually want is a measure of the performance of a specific implementation of an algorithm; i.e. some actual code.
So how do you quantify performance?
Well, ultimately there is only one sound way to quantify performance. You measure it, empirically. (How you do that ... and the limitations ... are a matter I will come to.)
But what about theoretical approaches?
A common theoretical approach is to analyse the algorithm to give you a measure of computational complexity. The classic measure is Big-O Complexity. This is a very useful measure, but unfortunately Big-O Complexity does not actually measure performance at all. Rather, it is a way of characterizing the behaviour of an algorithm as the problem size scales up.
To illustrate, consider these algorithms for adding B numbers together:
int sum(int[] input) {
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
int sum(int[] input) {
int tmp = p(1000); // calculates the 1000th prime number
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
We can prove that both versions of sum have a complexity of O(N), according to the accepted mathematical definitions. Yet it obvious that the first one will be faster than the second one ... because the second one does a large (and pointless) calculation as well.
In short: Big-O Complexity is NOT a measure of Performance.
What about theoretical measures of Performance?
Well, as far as I'm aware, there are none that really work. The problem is that real performance (as in time taken to complete) depends on various complicated things in the compilation of code to executables AND the way that real execution platforms (hardware) behaves. It is too complicated to do a theoretical analysis that will reliably predict actual performance.
So how do you measure performance?
The naive answer is to benchmark like this:
Take a clock measurement
Run the code
Take a second clock measurement
Subtract the first measurement from the second ... and that is your answer.
But it doesn't work. Or more precisely, the answer you get may be wildly different from the performance that the code exhibits when you use it in a real world context.
Why?
There may be other things happening on the machine that are happening ... or have happened ... that influence the code's execution time. Another program might be running. You may have files pre-loaded into the file system cache. You may get hit by CPU clock scaling ... or a burst of network traffic.
Compilers and compiler flags can often make a lot of difference to how fast a piece of code runs.
The choice of inputs can often make a big difference.
If the compiler is smart, it might deduce that some or all of your benchmarked code does nothing "useful" (in the context) ... and optimize it away entirely.
And for languages like Java and C#, there are other important issues:
Implementations of these languages typically do a lot of work during startup to load and link the code.
Implementations of these languages are typically JIT compiled. This means that the language runtime system does the final translation of the code (e.g. bytecodes) to native code at runtime. The performance characteristics of your code after JIT compilation change drastically, and the time taken to do the compilation may be significant ... and may distort your time measurements.
Implementations of these languages typically rely on a garbage collected heap for memory management. The performance of a heap is often uneven, especially at startup.
These things (and possibly others) contribute to something that we call (in Java) JVM warmup overheads; particularly JIT compilation. If you don't take account of these overheads in your methodology, then your results are liable to be distorted.
So what is the RIGHT way to measure performance of Java code?
It is complicated, but the general principle is to run the benchmark code lots of times in the same JVM instance, measuring each iteration. The first few measurements (during JVM warmup) should be discarded, and the remaining measurements should be averaged.
These days, the recommended way to do Java benchmarking is to use a reputable benchmarking framework. The two prime candidates are Caliper and Oracle's jmh tool.
And what are the limitations of performance measurements that you mentioned?
Well I have alluded to them above.
Performance measurements can be distorted to various environmental factors on the execution platform.
Performance can be dependent on inputs ... an this may not be revealed by simple measurement.
Performance (e.g. of C / C++ code) can be dependent on the compiler and compiler switches.
Performance can be dependent on hardware; e.g. processors speed, number of cores, memory architecture, and so on.
These factors can make it difficult to make general statements about the performance of a specific piece of code, and to make general comparisons between alternative versions of the same code. As a rule, we can only make limited statements like "on system X, with compiler Y and input set Z the performance measures are P, Q, R".
The amount of lines has very little correlation to the execution speed of a program.
Your program will look completely different after it's processed by the compiler. In general, large compilers perform many optimizations, such as loop unrolling, getting rid of variables that are not used, getting rid of dead code, and hundreds more.
So instead of trying to "squeeze" the last bit of performance/memory out of your program by using short instead of int, char[] instead of String or whichever method you think will "optimize" (premature optimization) your program, just do it using objects, or types such that make sense to you, so it will be easier to maintain. Your compiler, interpreter, VM should take care of the rest. If it doesn't, only then do you start looking for bottlenecks, and start playing with hacks.
So what makes programs fast then? Algorithmic efficiency (at least it tends to make the biggest difference if the algorithm/data structure was not designed right). This is what computer scientists study.
Let's say you're given 2 data structures. An array, and a singly linked list.
An array stores things in a block, one after the other.
+-+-+-+-+-+-+-+
|1|3|2|7|4|6|1|
+-+-+-+-+-+-+-+
To retrieve the element at index 3, you simply just go to the 4th square and retrieve it. You know where it is because you know it's 3 after the first square.
A singly linked list will store things in a node, which may not be stored contiguously in memory, but each node will have a tag (pointer, reference) on it telling you where the next item in the list is.
+-+ +-+ +-+ +-+ +-+ +-+ +-+
|1| -> |3| -> |2| -> |7| -> |4| -> |6| -> |1|
+-+ +-+ +-+ +-+ +-+ +-+ +-+
To retrieve the element at index of 3, you will have to start with the first node, then go to the connected node, which is 1, and then go to 2, and finally after, you arrive at 3. All because you don't know where they are, so you follow a path to them.
Now say you have an Array and an SLL, both containing the same data, with the length n, which one would be faster? Depends on how you use it.
Let's say you do a lot of insertions at the end of the list. The algorithms (pseudocode) would be:
Array:
array[list.length] = element to add
increment length field
SLL:
currentNode = first element of SLL
while currentNode has next node:
currentNode = currentNode's next element
currentNode's next element = new Node(element to add)
increment length field
As you can see, in the array algorithm, it doesn't matter what the size of the array is. It always takes a constant amount of operations. Let's say a[list.length] takes 1 operation. Assigning it is another operation, incrementing the field, and writing it to memory is 2 operations. It would take 4 operations every time. But if you look at the SLL algorithm, it would take at least list.length number of operations just to find the last element in the list. In other words, the time it takes to add an element to the end of an SLL increases linearly as the size of the SLL increases t(n) = n, whereas for the array, it's more like t(n) = 4.
I suggest reading the free book written by my data structures professor. Even has working code in C++ and Java
Generally speaking, the speed vs. lines of code is not the most effective measure of performance since it depends heavily depends on your hardware and your compiler. There is something called Big Oh notation, which gives one a picture of how fast an algorithm will run as the number of inputs increase.
For example, if your algorithm speed is O(n), then the time it will take for code to run scales linear with time. If your algorithm speed is O(1), then the time it will take your code to run will be constant.
I found this particular way of measuring performance useful because you learn that it's not really lines of code that will effect speed it's your codes design that will effect speed. A code with a more efficient way of handling the problem can be faster than code with a less efficient method with 1/10 lines of code.
Related
I was wondering how can I estimate a total running time of a java program on specific machine before program ends? I need to know how much it will take so I can announce the progress by that.
FYI Main algorithm of my program takes O(n^3) time complexity. Suppose n= 100000, how much it takes to run this program on my machine? (dual intel xeon e2650)
Regards.
In theory 1GHz of computational power should result in about 1 billion simple operations. However finding the number of simple operations is not always easy. Even if you know the time complexity of a given algorithm, this is not enough - you also need to know the constant factor. In theory it is possible to have a linear algorithm that takes several seconds to compute something for input of size 10000(and some algorithms like this exist - like the linear pre-compute time RMQ).
What you do know, however is that something of O(n^3) will need to perform on the order of 100000^3 operations. So even if your constant is about 1/10^6(which is highly unprobable), this computation will take a lot of time.
I believe #ArturMalinowski 's proposal is the right way to approach your problem. If you benchmark the performance of your algorithm for some sequence known aforehand e.g. {32,64,128,...} or as he proposes {1,10,100,...}. This way you will be able to determine the constant factor with relatively good precision.
I have read about time complexities only in theory.. Is there any way to calculate them in a program? Not by assumptions like 'n' or anything but by actual values..
For example.. calculating time complexities of Merge sort and quick sort..
Merge Sort= O(nlogn);// any case
Quick Sort= O(n^2);// worst case(when pivot is largest or smallest value)
there is a huge difference in nlogn and n^2 mathematically..
So i tried this in my program..
main()
{
long t1=System.nanoTime();
// code of program..
long t2=System.nanoTime();
time taken=t2-t1;
}
The answer i get for both the algorithms,in fact for any algorithm i tried is mostly 20.
Is System.nanoTime() not precise enough or should i use a slower system? Or is there any other way?
Is there any way to calculate them in a program? Not by assumptions like 'n' or anything but by actual values.
I think you misunderstand what complexity is. It is not a value. It is not even a series of values. It is a formula. If you get rid of the N it is meaningless as a complexity measure (except in the case of O(1) ... obviously).
Setting that issue on one side, it would be theoretically possible to automate the rigorous analysis of complexity. However this is a hard problem: automated theorem proving is difficult ... especially if there is no human being in the loop to "guide" the process. And the Halting Theorem implies that there cannot be an automated theorem prover that can prove the complexity of an arbitrary program. (Certainly there cannot be a complexity prover that works for all programs that may or may not terminate ...)
But there is one way to calculate a performance measure for a program with a given set of input. You just run it! And indeed, you do a series of runs, graphing performance against some problem size measure (i.e. an N) ... and make an educated guess at a formula that relates the performance and the N measure. Or you could attempt to fit the measurements to a formula.
However ...
it is only a guess, and
this approach is not always going to work.
For example, if you tried this on classic Quicksort, you most likely conclude that complexity is O(NlogN) and miss the important caveat that there is a "worst case" where it is O(N^2). Another example is where the observable performance characteristics change as the problem size gets big.
In short, this approach is liable to give you unreliable answers.
Well, in practice with some assumptions on the program, you might be able to run your program on large number of test case (and measure the time it takes) and use interpolation to estimate the growth rate and the complexity of the program, and use statistical hypothesis testing to show the probability you are correct.
However, this thing cannot be done in ALL cases. In fact, you cannot even have an algorithm that tells for each program if it is going to halt or not (run an infinite loop). This is known as the Halting Problem, which is proven to be insolveable.
Micro benchmarks like this are inherently flawed, and you're never going to get brilliantly accurate readings using them - especially not in the nanoseconds range. The JIT needs time to "warm up" to your code, during which time it will optimise itself around what code is being called.
If you must go down this route, then you need a big test set for your algorithm that'll take seconds to run rather than nanoseconds, and preferably a "warm up" period in there as well - then you might see some differences close to what you're expecting. You're never going to just be able to take those timings though and calculate the time complexity from them directly - you'd need to run many cases with different sizes and then plot a graph as to the time taken for each input size. Even that approach won't gain you brilliantly accurate results, but it would be enough to give an idea.
Your question might be related to Can a program calculate the complexity of an algorithm? and Program/algorithm to find the time complexity of any given program, I think that you do a program where you count while or for loops and see if its nested or not but I don't figure how you can calculate complexity for some recursive functions.
The microbenchmark which you wrote is incorrect. When you want to gather some time metrics of your code for further optimization, JFF, etc. use JMH. This will help you a lot.
When we say that an algorithm exhibits O(nlogn) complexity, we're saying that the asymptotic upper bound for that algorithm is O(nlogn). That is, for sufficiently large values of n, the algorithm behaves like a function n log n. We're not saying that for n inputs, there will definitely be n log n executions. Simply that this is the definition set, that your algorithm belongs to.
By taking time intervals on your system, you're actually exposing yourself to the various variables involved in the computer system. That is, you're dealing with system latency, wire resistance, CPU speed, RAM usage... etc etc. All of these things will have a measurable effect on your outcome. That is why we use asymptotics to compute the time complexity of an algorithm.
One way to check the time complexity is to run both algorithms , on different sizes of n and check the ratio between each run . From this ratio you can get the time complexity
For example
If time complexity is O(n) then the ratio will be linear
If time complexity is O(n^2) the ratio will be (n1/n2)^2
if time complexity is O(log(n)) the ratio will be log(n1)/log(n2)
In my current project, I am measuring complexity of algorithms written in Java. I operate with asymptotic complexity (expected result) and I want to validate the expectation by comparison with the actual number of operations. Using Incrematation per operation seems to me a bit clumsy awkward. Is there any better approach to measure operational complexity?
Thanks
Edit: more info
The algorithms might run on different machines
Some parts of divide and conquer algorithms might be precached, hence it is probable, that the procedure will be faster than expected
Also it is important for me to find out the multiplicative constant (or the additive one), which is not taken in consideration in asymptotic complexity
Is there particular reason not to just measure the CPU time? The time utility or a profiler will get you the numbers. Just run each algorithm with a sufficient range of inputs and capture the cpu time (not wall clock time) spent.
On an actual computer you want to measure execution time, getCurrentTimeMillis(). Vary the N parameter, get solid statistics. Do an error estimate. Your basic least squares will be fine.
Counting operations on an algorithm is fine, but it has limited use. Different processors do stuff at different speeds. Counting the number of expressions or statements executed by your implemented algorithm is close to useless. In the algorithm you can use it to make comparisons for tweaking, in your implementation this is no longer the case, compiler/JIT//CPU tricks will dominate.
Asymptotic behavior should be very close to the calculated/expected if you do good measurements.
ByCounter can be used to instrument Java (across multiple classes) and count the number of bytecodes executed by the JVM at runtime.
First of all, i would like to know what is the fundamental difference between loop optimization and transformation , also
A simple loop in C follows:
for (i = 0; i < N; i++)
{
a[i] = b[i]*c[i];
}
but we can unroll it to:
for (i = 0; i < N/2; i++)
{
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
but further we can unroll it..but what is the limit till which we can unroll it, and how do we find that.
There are many more techniques like Loop Tilling,Loop Distribution,etc. , how to determine when to use the appropriate one.
I will assume that the OP has already profiled his/her code and has discovered that this piece of code is actually important, and actually answer the question :-) :
The compiler will try to make the loop unrolling decision based on what it knows about your code and the processor architecture.
In terms of making things faster.
As someone pointed out, unrolling does reduce the number of loop termination condition compares and jumps.
Depending on the architecture, the hardware may also support an efficient way to to index near memory locations (E.g., mov eax, [ebx + 4]), without adding additional instructions (this may expand to more micro-ops though - not sure).
Most modern processors use out of order execution, to find instruction level parallelism. This is hard to do, when the next N instructions are after multiple conditional jumps (i.e., the hardware would need to be able to discard variable levels of speculation).
There is more opportunity to reorder memory operations earlier so that the data fetch latency is hidden.
Code vectorization (e.g., converting to SSE/AVX), may also occur which allows parallel execution of the code in some cases. This is also a form of unrolling.
In terms of deciding when to stop unrolling:
Unrolling increases code size. The compiler knows that there are penalties for exceeding instruction code cache size (all modern processors), trace cache(P4), loop buffer cache(Core2/Nehalem/SandyBridge), micro-op cache(SandyBridge), etc. Ideally it uses static cost-benefit heuristics (a function of the specic code and architecture) to determine which level of unrolling will result in the best overall net performance. Depending on the compiler, the heurstics may vary (often I find that it would be nice to tweak this oneself).
Generally, if the loop contains a large amount of code it is less likely to be unrolled because the loop cost is already amortized, there is plenty of ILP available, and the code bloat cost of unrolling is excessive. For smaller pieces of code, the loop is likely to be unrolled, since the cost is likely to be low. The actual number of unrolls will depend on the specifics of the architecture, compiler heuristics and code, and will be what the compiler decides is optimal (it may not be :-) ).
In terms of when YOU should be doing these optimizations:
When you don't think the compiler did the correct thing. The compiler may not be sophisticated (or sufficiently up to date) enough to use the knowledge of the architecture you are working on optimally.
Possibly, the heuristics just failed (they are just heuristics after all). In general, if you know the piece of code is very important, try unroll it, and if it improved performance, keep it, otherwise throw it out. Also, only do this when you have roughly the whole system in place, since what may be beneficial, when your code working set is 20k, may not be beneficial when your code working set is 31k.
This may seem rather off topic to your question but I cannot but stress the importance of this.
The key is to write a correct code and get your code working as per the requirement without being bothered about micro optimization.
If later you find your program to be lacking in performance then you profile!! your application to find the problem areas and then try to optimize them.
Remember as one of the wise guys said It is only 10% of your code which runs 90% of the total run time of your application trick is to identify that code through profiling and then try to optimize it.
Well considering that your first attempt at optimizing is already wrong in 50% of all cases I really wouldn't try anything more complex (try any odd number).
Also instead of multiplying your indices, just add 2 to i and loop up to N again - avoids the unnecessary shifting (minor effect as long as we stay with powers of 2, but still)
To summarize: You created incorrect, slower code than what a compiler could do - well that's the perfect example of why you shouldn't do this stuff I assume.
I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.
Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?
This because of locality of your data.
In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.
This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.
What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.
As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.
(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)
In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..
The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.
Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)
Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.
It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.
It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.
If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.
If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.
If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.
You might try comparing performance between JDK6 and OpenJDK7, given this set of results...