test the speed of quicksort

test the speed of quicksort - java

I read the book of Algorithm 4th edition princeton and watched the online course video. I have found two interesting things.
It was said in the video, if we use a cutoff like this in quicksort, we will speed up the program by 10~20%:
if(hi - lo < CUTOFF) Insertion.sort(a);
It suggested that when we use recursive formula to divide the array a into subarray, and sort subarray recursively, we can use insertion sorting algorithm when the size of subarray is smaller than CUTOFF instead.However, when I test it with CUTOFF size 3, 7 and 10. It was not the case. It's about 10 times slower in my test data set. The data set is array of 5000 random numbers. So I guess we'd better not use insertion sorting for small size array.
When I trying to measure the running time of my code and compare it to the standard code from this course, i.e. the algs4.jar library. I found my time is longer, even I change my code as the standard code. Finally, I realized that even if we quicksort the same array twice ( copied the array as a1, a2), the running time of the second sorting will always be around half of the running time of the second sorting. i.e. (pseudo code):
stopWatch sw1 = new stopWatch();
quicksort(a1);
print sw1.elaspedTime();
stopWatch sw2 = new stopWatch();
quicksort(a2);
print sw2.elaspedTime();
Then the second one cost about half time, even they are the same algorithm and sorting the same array. I don't know why this happened. It's a very interesting phenomenon.

Now. By theory it could be faster, but depending on what language, compiler, system, CPU you are using, it might be different. I can just gonna use your 2nd point as an example. CPU has something called cache which would hold the frequent used data to increase speed. It is very small but it is super fast, way faster than RAM. So basically the first time you ran the program, the array was initially in memory and it got into cache when there is a cache miss. When you run the same code the second time, everything is in cache already, there is no need to look it up in RAM and no cache misses, so its way faster than first run. If you would like accurate result then you might have to clear RAM clear Cache, shut down any program you are running and ect

Related

HashSet performance Issues in fractal renderer

Use Case:
I'm trying to improve my application to render the Mandelbrot Set. I'm using HashSets to detect periodicity in the orbit of a point in the set.
For instance, the orbit of -1 is 0, -1, 0, -1... If I put each number I reach into a HashSet, I can detect infinite loops by just comparing the size of the HashSet to the iteration count. Doing this makes the render orders of magnitude faster.
Current Implementation:
As my program stands, the method that performs iterations receives a HashSet (of type Integer) constructed with the default constructor. This is what Java Mission Control shows me (typical output, regardless of the complexity or depth of the render):
The runtime of the iteration method is small, almost always less than .1 ms. (At high zooms, sometimes 10ms). As a result, a LOT of these hashsets are created, filled with ~10-100k entries, and then immediately dumped. This creates a lot of overhead, since the HashSet has to be resized quite frequently.
Things I've tried that don't work:
Making one HashSet and clearing it: The O(n) iteration through the backing map absolutely kills performance.
Making the HashSet large enough to contain the iterations, using the initalCapacity argument: I tried every power of 2 from 1024 to 524288, all make the program slower. My conjecture as to why is the following: Since we have so many HashSets, java more quickly runs out of large blocks for the new sets, so we trigger very frequent GC, or some similar issue.
Ideally, I would want the best of both worlds: Make one object that's large enough, and then clear it. However, I can't seem to locate such a data structure. What's the best approach to storing this data?

Im my experience, periodic sequences are rather short, so you should rather use an array and do a sequential search backwards. You can do some experiments, for instance if you only do a comparison with the element that you had 60 iterations ago you already capture periodic sequences of length 2, 3, 4, 5, 6, 12, 15 and 30. Calculating one element of the mandelbrot orbit is usually much faster than finding (and also inserting) an element in a hashset.
In this case it is also easier to tune this method further, eg ignore the first n elements or use some epsilon to avoid rounding errors. Using your method you would check double values for strict equality in my understanding.
Good luck!

Putting a number to the efficiency of an algorithm

I have been developing with Java for some time now, and always strife to do something in the most efficient way. By now i have mostly been trying to condense the number of lines of code I have. But when starting to work with 2d rendering it is more about how long it takes to compute a certain piece of code as it is called many times a second.
My question:
Is there some way to measure how long it takes to compute a certain piece of code in Eclipse, Java, ... ?

First, some nitpicking. You title this question ...
Putting a number to the efficiency of an algorithm
There is no practical quantifiable measure of "efficiency" for an algorithm. Efficiency (as normally conceived) is a measure of "something" relative to an ideal / perfect; e.g. a hypothetical 100% efficient steam engine would convert all of the energy in the coal being burned into useful "work". But for software, there is no ideal to measure against. (Or if there is, we can't be sure that it is the ideal.) Hence "efficiency" is the wrong term.
What you actually mean is a measure of the "performance" of ...
Algorithms are an abstract concept, and their performance cannot be measured.
What you actually want is a measure of the performance of a specific implementation of an algorithm; i.e. some actual code.
So how do you quantify performance?
Well, ultimately there is only one sound way to quantify performance. You measure it, empirically. (How you do that ... and the limitations ... are a matter I will come to.)
But what about theoretical approaches?
A common theoretical approach is to analyse the algorithm to give you a measure of computational complexity. The classic measure is Big-O Complexity. This is a very useful measure, but unfortunately Big-O Complexity does not actually measure performance at all. Rather, it is a way of characterizing the behaviour of an algorithm as the problem size scales up.
To illustrate, consider these algorithms for adding B numbers together:
int sum(int[] input) {
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
int sum(int[] input) {
int tmp = p(1000); // calculates the 1000th prime number
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
We can prove that both versions of sum have a complexity of O(N), according to the accepted mathematical definitions. Yet it obvious that the first one will be faster than the second one ... because the second one does a large (and pointless) calculation as well.
In short: Big-O Complexity is NOT a measure of Performance.
What about theoretical measures of Performance?
Well, as far as I'm aware, there are none that really work. The problem is that real performance (as in time taken to complete) depends on various complicated things in the compilation of code to executables AND the way that real execution platforms (hardware) behaves. It is too complicated to do a theoretical analysis that will reliably predict actual performance.
So how do you measure performance?
The naive answer is to benchmark like this:
Take a clock measurement
Run the code
Take a second clock measurement
Subtract the first measurement from the second ... and that is your answer.
But it doesn't work. Or more precisely, the answer you get may be wildly different from the performance that the code exhibits when you use it in a real world context.
Why?
There may be other things happening on the machine that are happening ... or have happened ... that influence the code's execution time. Another program might be running. You may have files pre-loaded into the file system cache. You may get hit by CPU clock scaling ... or a burst of network traffic.
Compilers and compiler flags can often make a lot of difference to how fast a piece of code runs.
The choice of inputs can often make a big difference.
If the compiler is smart, it might deduce that some or all of your benchmarked code does nothing "useful" (in the context) ... and optimize it away entirely.
And for languages like Java and C#, there are other important issues:
Implementations of these languages typically do a lot of work during startup to load and link the code.
Implementations of these languages are typically JIT compiled. This means that the language runtime system does the final translation of the code (e.g. bytecodes) to native code at runtime. The performance characteristics of your code after JIT compilation change drastically, and the time taken to do the compilation may be significant ... and may distort your time measurements.
Implementations of these languages typically rely on a garbage collected heap for memory management. The performance of a heap is often uneven, especially at startup.
These things (and possibly others) contribute to something that we call (in Java) JVM warmup overheads; particularly JIT compilation. If you don't take account of these overheads in your methodology, then your results are liable to be distorted.
So what is the RIGHT way to measure performance of Java code?
It is complicated, but the general principle is to run the benchmark code lots of times in the same JVM instance, measuring each iteration. The first few measurements (during JVM warmup) should be discarded, and the remaining measurements should be averaged.
These days, the recommended way to do Java benchmarking is to use a reputable benchmarking framework. The two prime candidates are Caliper and Oracle's jmh tool.
And what are the limitations of performance measurements that you mentioned?
Well I have alluded to them above.
Performance measurements can be distorted to various environmental factors on the execution platform.
Performance can be dependent on inputs ... an this may not be revealed by simple measurement.
Performance (e.g. of C / C++ code) can be dependent on the compiler and compiler switches.
Performance can be dependent on hardware; e.g. processors speed, number of cores, memory architecture, and so on.
These factors can make it difficult to make general statements about the performance of a specific piece of code, and to make general comparisons between alternative versions of the same code. As a rule, we can only make limited statements like "on system X, with compiler Y and input set Z the performance measures are P, Q, R".

The amount of lines has very little correlation to the execution speed of a program.
Your program will look completely different after it's processed by the compiler. In general, large compilers perform many optimizations, such as loop unrolling, getting rid of variables that are not used, getting rid of dead code, and hundreds more.
So instead of trying to "squeeze" the last bit of performance/memory out of your program by using short instead of int, char[] instead of String or whichever method you think will "optimize" (premature optimization) your program, just do it using objects, or types such that make sense to you, so it will be easier to maintain. Your compiler, interpreter, VM should take care of the rest. If it doesn't, only then do you start looking for bottlenecks, and start playing with hacks.
So what makes programs fast then? Algorithmic efficiency (at least it tends to make the biggest difference if the algorithm/data structure was not designed right). This is what computer scientists study.
Let's say you're given 2 data structures. An array, and a singly linked list.
An array stores things in a block, one after the other.
+-+-+-+-+-+-+-+
|1|3|2|7|4|6|1|
+-+-+-+-+-+-+-+
To retrieve the element at index 3, you simply just go to the 4th square and retrieve it. You know where it is because you know it's 3 after the first square.
A singly linked list will store things in a node, which may not be stored contiguously in memory, but each node will have a tag (pointer, reference) on it telling you where the next item in the list is.
+-+ +-+ +-+ +-+ +-+ +-+ +-+
|1| -> |3| -> |2| -> |7| -> |4| -> |6| -> |1|
+-+ +-+ +-+ +-+ +-+ +-+ +-+
To retrieve the element at index of 3, you will have to start with the first node, then go to the connected node, which is 1, and then go to 2, and finally after, you arrive at 3. All because you don't know where they are, so you follow a path to them.
Now say you have an Array and an SLL, both containing the same data, with the length n, which one would be faster? Depends on how you use it.
Let's say you do a lot of insertions at the end of the list. The algorithms (pseudocode) would be:
Array:
array[list.length] = element to add
increment length field
SLL:
currentNode = first element of SLL
while currentNode has next node:
currentNode = currentNode's next element
currentNode's next element = new Node(element to add)
increment length field
As you can see, in the array algorithm, it doesn't matter what the size of the array is. It always takes a constant amount of operations. Let's say a[list.length] takes 1 operation. Assigning it is another operation, incrementing the field, and writing it to memory is 2 operations. It would take 4 operations every time. But if you look at the SLL algorithm, it would take at least list.length number of operations just to find the last element in the list. In other words, the time it takes to add an element to the end of an SLL increases linearly as the size of the SLL increases t(n) = n, whereas for the array, it's more like t(n) = 4.
I suggest reading the free book written by my data structures professor. Even has working code in C++ and Java

Generally speaking, the speed vs. lines of code is not the most effective measure of performance since it depends heavily depends on your hardware and your compiler. There is something called Big Oh notation, which gives one a picture of how fast an algorithm will run as the number of inputs increase.
For example, if your algorithm speed is O(n), then the time it will take for code to run scales linear with time. If your algorithm speed is O(1), then the time it will take your code to run will be constant.
I found this particular way of measuring performance useful because you learn that it's not really lines of code that will effect speed it's your codes design that will effect speed. A code with a more efficient way of handling the problem can be faster than code with a less efficient method with 1/10 lines of code.

Calculating time complexities of algorithms practically

I have read about time complexities only in theory.. Is there any way to calculate them in a program? Not by assumptions like 'n' or anything but by actual values..
For example.. calculating time complexities of Merge sort and quick sort..
Merge Sort= O(nlogn);// any case
Quick Sort= O(n^2);// worst case(when pivot is largest or smallest value)
there is a huge difference in nlogn and n^2 mathematically..
So i tried this in my program..
main()
{
long t1=System.nanoTime();
// code of program..
long t2=System.nanoTime();
time taken=t2-t1;
}
The answer i get for both the algorithms,in fact for any algorithm i tried is mostly 20.
Is System.nanoTime() not precise enough or should i use a slower system? Or is there any other way?

Is there any way to calculate them in a program? Not by assumptions like 'n' or anything but by actual values.
I think you misunderstand what complexity is. It is not a value. It is not even a series of values. It is a formula. If you get rid of the N it is meaningless as a complexity measure (except in the case of O(1) ... obviously).
Setting that issue on one side, it would be theoretically possible to automate the rigorous analysis of complexity. However this is a hard problem: automated theorem proving is difficult ... especially if there is no human being in the loop to "guide" the process. And the Halting Theorem implies that there cannot be an automated theorem prover that can prove the complexity of an arbitrary program. (Certainly there cannot be a complexity prover that works for all programs that may or may not terminate ...)
But there is one way to calculate a performance measure for a program with a given set of input. You just run it! And indeed, you do a series of runs, graphing performance against some problem size measure (i.e. an N) ... and make an educated guess at a formula that relates the performance and the N measure. Or you could attempt to fit the measurements to a formula.
However ...
it is only a guess, and
this approach is not always going to work.
For example, if you tried this on classic Quicksort, you most likely conclude that complexity is O(NlogN) and miss the important caveat that there is a "worst case" where it is O(N^2). Another example is where the observable performance characteristics change as the problem size gets big.
In short, this approach is liable to give you unreliable answers.

Well, in practice with some assumptions on the program, you might be able to run your program on large number of test case (and measure the time it takes) and use interpolation to estimate the growth rate and the complexity of the program, and use statistical hypothesis testing to show the probability you are correct.
However, this thing cannot be done in ALL cases. In fact, you cannot even have an algorithm that tells for each program if it is going to halt or not (run an infinite loop). This is known as the Halting Problem, which is proven to be insolveable.

Micro benchmarks like this are inherently flawed, and you're never going to get brilliantly accurate readings using them - especially not in the nanoseconds range. The JIT needs time to "warm up" to your code, during which time it will optimise itself around what code is being called.
If you must go down this route, then you need a big test set for your algorithm that'll take seconds to run rather than nanoseconds, and preferably a "warm up" period in there as well - then you might see some differences close to what you're expecting. You're never going to just be able to take those timings though and calculate the time complexity from them directly - you'd need to run many cases with different sizes and then plot a graph as to the time taken for each input size. Even that approach won't gain you brilliantly accurate results, but it would be enough to give an idea.

Your question might be related to Can a program calculate the complexity of an algorithm? and Program/algorithm to find the time complexity of any given program, I think that you do a program where you count while or for loops and see if its nested or not but I don't figure how you can calculate complexity for some recursive functions.

The microbenchmark which you wrote is incorrect. When you want to gather some time metrics of your code for further optimization, JFF, etc. use JMH. This will help you a lot.

When we say that an algorithm exhibits O(nlogn) complexity, we're saying that the asymptotic upper bound for that algorithm is O(nlogn). That is, for sufficiently large values of n, the algorithm behaves like a function n log n. We're not saying that for n inputs, there will definitely be n log n executions. Simply that this is the definition set, that your algorithm belongs to.
By taking time intervals on your system, you're actually exposing yourself to the various variables involved in the computer system. That is, you're dealing with system latency, wire resistance, CPU speed, RAM usage... etc etc. All of these things will have a measurable effect on your outcome. That is why we use asymptotics to compute the time complexity of an algorithm.

One way to check the time complexity is to run both algorithms , on different sizes of n and check the ratio between each run . From this ratio you can get the time complexity
For example
If time complexity is O(n) then the ratio will be linear
If time complexity is O(n^2) the ratio will be (n1/n2)^2
if time complexity is O(log(n)) the ratio will be log(n1)/log(n2)

How large does a graph need to be to trigger the worst-case complexity of a Fibonacci heap?

I've been trying to trigger the worst-case complexity of a Fibonacci heap by using it with Dijkstra's algorithm but apparently with no luck. I have a second implementation of Dijkstra's using a vanilla binary heap, and it ALWAYS seems to win. I was told to conduct my tests using larger datasets, which I have, as shown (copy-pasted straight from my program):
Running Dijkstra's algorithm with 3354 nodes and 8870 links...
Source node: ALL
Time using binary heap = 2167698339 ns (2167.70 ms)
versus...
Running Dijkstra's algorithm with 3354 nodes and 8870 links...
Source node: ALL
Time using Fibonacci heap = 11863138070 ns (11863.14 ms)
2 seconds, against ~12 seconds. Quite a difference alright.
Now, I have another graph with a whopping 264,000 nodes and 733,000 edges. I haven't had the chance to test it yet, but would that be enough for the theoretical advantage of Fibonacci heaps to shine?
I hope I don't need something with over a million nodes. I mean it's not the biggest issue in the world but it would be nice to see the difference in action for once.

First of all your question's title is not correct. Size of input does not affect the worst case complexity. What you need is the size of the graph where the asymptotic computational complexity of the Fibonacci heap makes up for its constant factor. Remember the good old O(n)? Well O(n) would mean that for large enough datasets your algorithm will perform approximately k*n operations, where k is a fixed number. This k is the constant I am refering to. Now if you have an algorithm with complexity O(n) and another with compexity O(n*log(n)), this still does not mean that the first one is always faster than the second one. Imagine first one performs k1*n operations and the second one performs k2n*log(n) operations. Now if k1 = k2 * 1000, than the fact that the first algortihm will be faster than the second one only if n > 21000, which is quite large. What is important is that if you have a value for which the first algorithm will overtake the second one.
Depending on the implementation of a given datastructure, the constant may vary and thus you may need several times larger dataset to make up for it. I have seen some results where fibonacci heap got faster than plain old binary heap at about 500 000 edges(and about 5000 nodes) but these are only for that particular implementation. In your implemenation the difference may show earlier, or later depending on how efficiently you implemented both structures. What is certain is that if you implemented the data structures with correct complexities, the difference will show for some n(but it may happen that no existing computer can handle graphs that big).

I think your fibonacci heap is not necessarily faster for large graphs. What you need to increase is the number of decreaseKey operations. You should get more such operations if the mean degree of the nodes increases (so a high dimensional graph). Or one could say if the graph gets more complete (highly interconnected).

Why is the performance of these matrix multiplications so different?

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.
Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?

This because of locality of your data.
In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.
This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.
What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.
As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.
(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)
In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..

The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.
Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)
Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.

It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.
It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.
If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.
If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.
If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.

You might try comparing performance between JDK6 and OpenJDK7, given this set of results...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.