testing a custom data structures big-o complexity

testing a custom data structures big-o complexity - java

As an assignment, I implemented a custom data structure and a few test cases in order to make sure it works properly.
The code itself is not really needed for the question but you can assume it is some sort of a SortedList.
my problem is that I was also asked to test big-O complexity, e.g make sure put() is o(n) etc.
I am having allot of trouble understanding how can I write such a test.
one way that came to mind was to count the amount of iterations inside the put() method with a simple counter and then checking that it is equal to the size of the list, but that would require me to change the code of the list itself to count the exact number, and I would much rather doing it in a proper way, outside the class, holding only an instance of it.
any ideas anyone? I would really appreciate the help!

With unit testing you test the interface of a class, but the number of iterations is not part of the interface here. You could use a timer to check the runtime behaviour with different sizes. If it's O(n) there should be a linear dependency between time and n.

Get the current time at the start of the test, call the method that you're testing a large number of times, get the current time after, and check the difference to get the execution time. Repeat with a different size input, and check that the execution time follows the expected relationship.
For an o(n) algorithm, a suitable check would be that doubling the input size approximately doubles the execution time.

Related

Determine what sorting algorithms are being used by a .jar?

How does one go about determining the different types of sorting algorithms without being able to look at code?
The scenario is that our lecturer showed a .jar file with 4 sorting methods in it (sortA, sortB...) that we can call to sort a vector.
So, as far as I know, we can only measure and compare the times for the sorts. What's the best method to determine the sorts?
The main issue is that the times for the sorts don't take very long (differing by ~1000ms) to begin with, so comparing them by that isn't really on option, and so far all the data sets I've used (ascending, descending, nearly sorted) haven't really been giving much variation in the sort time.

I would create a data structure and do the following:
Sort the data on paper using known sorting algorithms and document the expected behavior on paper. Be very specific on what happens to your data after each pass.
Run a test on debug mode and step through the sorting process.
Observe how the elements are being sorted and compare it with your predictions (notes) obtained in step 1.
I think this should work. Basically, you are using the Scientific Method to determine which algorithm is being used in a given method. This way, you don't have to resort to "cheating" by decompiling code or rely in imprecise methodologies like using execution time. The three step process I outlined rely in solid, empirical data to arrive to your conclusions.

Convert List<Integer> to int[] - Pure Java vs ArrayUtils benchmark issue

I recently had to optimise an api call in a high performance system that did some List<Integer> to int[] conversion. My initial finding was that for my expected average amount of elements pure java seemed A LOT faster than the currently used Apache Commons ArrayUtils. Factor 100-200 even and I was sort of amazed. Problem was I only ran my test on one pair of conversions at a time, restarting my little test program for each number of elements I wanted to test. Doing that, pure java was way faster for any number of elements up to several k's where it started to even out. Now I sat down to write a nicer test program that runs and outputs the results for a number of different sized lists in one go and my results are quite different.
It seems ArrayUtils is only slower on the first run, then faster on all subsequent runs regardless of list size, and regardless of number of element on the first run.
You can find my test class here: http://pastebin.com/9EYLZQKV
Please help me pick holes in it, as is now I don't get why I get the output I get.
Many thanks!

You need to account for loading times and warmup in your code. The built in classes may be called first and so it may already be loaded, and warmed up.
A loop is not optimised until it has been run at least 10,000 times. If you don't know when this is I suggest running the code for at least half a second to two seconds before starting your timing. I suggest repeating the test for at least 2 to 10 seconds and average the results.
BTW The fastest way to convert from List<Integer> to int[] is to avoid needing to convert in the first place. If you use TIntArrayList or something like it, you can use the same structure without copying or converting.

Is there a difference in performance for calling .length on an array versus saving a size variable?

I am creating a simulation program, and I want the code to be very optimized. Right now I have an array that gets cycled through a lot and in the various for loops I use
for(int i = 0; i<array.length; i++){
//do stuff with the array
}
I was wondering if it would be faster if I saved a variable in the class to specify this array length, and used that instead. Or if it matters at all.

Accessing the length attribute on an array is as fast as it gets.
You'll see people recommending that you save a data structure size before entering the loop because it means a method all for each and every iteration.
But this is the kind of micro-optimization that seldom matters. Don't worry much about this kind of thing until you have data that tells you it's the reason for a performance issue.
You should be spending more time thinking about the algorithms you're embedding in that loop, possible parallelism, etc. That'll be far more meaningful in your quest for an optimized solution.

it really doesn't matter, when you make an array with [#] it stores that number, and just returns that.
If you're using []{object1, object2, object3} then it will calculate the size, in either case, the value is calculated once only, and stored.
length is a variable, not a method, when you access it it has a value, and it just retrieves that value, it's not like it's going to calculate the length of the array each time you use the length attribute. array.length is no slower, in fact, having to set an extra variable will actually take more time then just using length.
array.length is faster, although by fractions of a milisecond, but in loops and such you can save a ton of time if cycleing through let's say 1,000,000 items (might be 0.1 seconds).

It's unlikely to make a difference.
The broader issue is you want your simulation to run as fast as possible.
This sourceforge project shows how you can use the running program to tell you what to optimize, as opposed to wondering about little things like.length.
Most code, as first written (other than little toy programs), has huge speedup potential, in ways that elude being guessed.
The biggest opportunities for speedup are usually diffuse, like your choice of data structure.
They aren't localized to particular routines, and measuring tools don't point them out.
The method demonstrated in that project, random pausing, does identify them.
The changes you need to make may not be easy to do, but they will yield speedup.

avoiding calculation every time a class method is called

I don't know if the title is appropriate but this is a design question.
I am designing a Java class which has a method which does heavy calculation and I am wondering there is a clean way to avoid this calculation every time the method is called. I know that the calling code can handle this but should it always be the responsibility of the calling code?.
To elaborate - I was writing a class for thousand dimensional vectors with a method to calculate the magnitude.So every time this method will be called it will calculate the magnitude over all the dimensions.

The concept you are looking for is called Memoization

Just cache the results in some structure internal to your class. Once the method is called, it looks if it has the previously calculated result in cache and returns it. In the other case it does the calculation and stores the result in cache. Be careful with the memory though.

Use flag to indicate whether there is a change to your vectors or not. If there is a change, then the method should do a full calculation or apply the calculation to only the changes but you will need to becareful with all the implementations of the rest of your class and make sure that the flag is properly set every time the value is modified.
The second method is to use cache. This is done by storing the previously calculated result and look it up before doing the calculation. However, this method is only work well if you don't have many variety in the key values of your objects oterwise you will end up using a lot of memory. Especially, if your key value has type of double, it is possible that the key value will never be found if they aren't exactly equal.

If the "thousand dimensional vectors" are passed in c'tor you can calculate the magnitude in c'tor and store in some private member variable.
Few things to take care of are:
If there are methods to add / delete vectors or contents of vectors then you need to update the magnitude in those methods.
If your class is supposed to be thread-safe then ensure appropriate write functions are atomic.

How often are the magnitudes changed? Is this immutable? How much of the interface for the vector do you control? Specifically, do you have any way to identify rotations or other magnitude-preserving transformations in your 1000 dimensional space? You could just store state for the magnitude, flag when the value changes, and recalculate only when necessary. If your transformations have nice internals, you might be able to skip the calculation based on that knowledge.

Performance of Collection class in Java

All,
I have been going through a lot of sites that post about the performance of various Collection classes for various actions i.e. adding an element, searching and deleting. But I also notice that all of them provide different environments in which the test was conducted i.e. O.S, memory, threads running etc.
My question is, if there is any site/material that provides the same performance information on best test environment basis? i.e. the configurations should not be an issue or catalyst for poor performance of any specific data structure.
[Updated]: Example, HashSet and LinkedHashSet both have a complexity of O (1) for inserting an element. However, Bruce Eckel' test claims that insertion is going to take more time for LinkedHashSet than for HashSet [http://www.artima.com/weblogs/viewpost.jsp?thread=122295]. So should I still go by the Big-Oh notation ?

Here are my recommendations:
First of all, don't optimize :) Not that I am telling you to design crap software, but just to focus on design and code quality more than premature optimization. Assuming you've done that, and now you really need to worry about which collection is best beyond purely conceptual reasons, let's move on to point 2
Really, don't optimize yet (roughly stolen from M. A. Jackson)
Fine. So your problem is that even though you have theoretical time complexity formulas for best cases, worst cases and average cases, you've noticed that people say different things and that practical settings are a very different thing from theory. So run your own benchmarks! You can only read so much, and while you do that your code doesn't write itself. Once you're done with the theory, write your own benchmark - for your real-life application, not some irrelevant mini-application for testing purposes - and see what actually happens to your software and why. Then pick the best algorithm. It's empirical, it could be regarded as a waste of time, but it's the only way that actually works flawlessly (until you reach the next point).
Now that you've done that, you have the fastest app ever. Until the next update of the JVM. Or of some underlying component of the operating system your particular performance bottleneck depends on. Guess what? Maybe your clients have different ones. Here comes the fun: you need to be sure that your benchmark is valid for others or in most cases (or have fun writing code for different cases). You need to collect data from users. LOTS. And then you need to do that over and over again to see what happens and if it still holds true. And then re-write your code accordingly over and over again (The - now terminated - Engineering Windows 7 blog is actually a nice example of how user data collection helps to make educated decisions to improve user experience.
Or you can... you know... NOT optimize. Platforms and compilers will change, but a good design should - on average - perform well enough.
Other things you can also do:
Have a look at the JVM's source code. It's very educative and you discover a herd of hidden things (I'm not saying that you have to use them...)
See that other thing on your TODO list that you need to work on? Yes, the one near the top but that you always skip because it's too hard or not fun enough. That one right there. Well get to it and leave the optimization thingy alone: it's the evil child of a Pandora's Box and a Moebius band. You'll never get out of it, and you'll deeply regret you tried to have your way with it.
That being said, I don't know why you need the performance boost so maybe you have a very valid reason.
And I am not saying that picking the right collection doesn't matter. Just that ones you know which one to pick for a particular problem, and that you've looked at alternatives, then you've done your job without having to feel guilty. The collections have usually a semantic meaning, and as long as you respect it you'll be fine.

In my opinion, all you need to know about a data structure is the Big-O of the operations on it, not subjective measures from different architectures. Different collections serve different purposes.
Maps are dictionaries
Sets assert uniqueness
Lists provide grouping and preserve iteration order
Trees provide cheap ordering and quick searches on dynamically changing contents that require constant ordering
Edited to include bwawok's statement on the use case of tree structures
Update
From the javadoc on LinkedHashSet
Hash table and linked list implementation of the Set interface, with predictable iteration order.
...
Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.
Now we have moved from the very general case of choosing an appropriate data-structure interface to the more specific case of which implementation to use. However, we still ultimately arrive at the conclusion that specific implementations are well suited for specific applications based on the unique, subtle invariant offered by each implementation.

What do you need to know about them, and why? The reason that benchmarks show a given JDK and hardware setup is so that they could (in theory) be reproduced. What you should get from benchmarks is an idea of how things will work. For an ABSOLUTE number, you will need to run it vs your own code doing your own thing.
The most important thing to know is the Big O runtime of various collections. Knowing that getting an element out of an unsorted ArrayList is O(n), but getting it out of a HashMap is O(1) is HUGE.
If you are already using the correct collection for a given job, you are 90% of the way there. The times when you need to worry about how fast you can, say, get items out of a HashMap should be pretty darn rare.
Once you leave single threaded land and move into multi-threaded land, you will need to start worrying about things like ConcurrentHashMap vs Collections.synchronized hashmap. Until you are multi threaded, you can just not worry about this kind of stuff and focus on which collection for which use.
Update to HashSet vs LinkedHashSet
I haven't ever found a use case where I needed a Linked Hash Set (because if I care about order I tend to have a List, if I care about O(1) gets, I tend to use a HashSet. Realistically, most code will use ArrayList, HashMap, or HashSet. If you need anything else, you are in a "edge" case.

The different collection classes have different big-O performances, but all that tells you is how they scale as they get large. If your set is big enough the one with O(1) will outperform the one with O(N) or O(logN), but there's no way to tell what value of N is the break-even point, except by experiment.
Generally, I just use the simplest possible thing, and then if it becomes a "bottleneck", as indicated by operations on that data structure taking much percent of time, then I will switch to something with a better big-O rating. Quite often, either the number of items in the collection never comes near the break-even point, or there's another simple way to resolve the performance problem.

Both HashSet and LinkedHashSet have O(1) performance. Same with HashMap and LinkedHashMap (actually the former are implemented based on the later). This only tells you how these algorithms scale, not how they actually perform. In this case, LinkHashSet does all the same work as HashSet but also always has to update a previous and next pointer to maintain the order. This means that the constant (this is an important value also when talking about actual algorithm performance) for HashSet is lower than LinkHashSet.
Thus, since these two have the same Big-O, they scale the same essentially - that is, as n changes, both have the same performance change and with O(1) the performance, on average, does not change.
So now your choice is based on functionality and your requirements (which really should be what you consider first anyway). If you only need fast add and get operations, you should always pick HashSet. If you also need consistent ordering - such as last accessed or insertion order - then you must also use the Linked... version of the class.
I have used the "linked" class in production applications, well LinkedHashMap. I used this in one case for a symbol like table so wanted quick access to the symbols and related information. But I also wanted to output the information in at least one context in the order that the user defined those symbols (insertion order). This makes the output more friendly for the user since they can find things in the same order that they were defined.

If I had to sort millions of rows I'd try to find a different way. Maybe I could improve my SQL, improve my algorithm, or perhaps write the elements to disk and use the operating system's sort command.
I've never had a case where collections where the cause of my performance issues.

I created my own experimentation with HashSets and LinkedHashSets. For add() and contains the running time is O(1) , not taking into consideration for a lot of collisions. In the add() method for a linkedhashset, I put the object in a user created hash table which is O(1) and then put the object in a separate linkedlist to account for order. So the running time to remove an element from a linkedhashset, you must find the element in the hashtable and then search through the linkedlist that has the order. So the running time is O(1) + O(n) respectively which is o(n) for remove()

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.