i'm using java on a big amount of data.
[i try to simplify the problem as much as possible]
Actually i have a small class (Element) containing an int KEY and a double WEIGHT (with getters&setters).
I read a lot of these objects from a file and i have to get the best (most weight) M objects.
Actually i'm using a PriorityQueue with a Comparator written to compare two Element, and it works, but it's too slow.
Do you know (i know you do) any faster way to do that?
Thank you
A heap-based priority queue is a good data structure for this problem. Just as a sanity check, verify that you are using the queue correctly.
If you want the highest weight items, use a min-queue—where the top of the heap is the smallest item. Adding every item to a max-queue and examining the top M items when done is not efficient.
For each item, if there are less than M items in the queue, add the current item. Otherwise, peek at the top of the heap. If it's less than the current item, discard it, and add the current item instead. Otherwise, discard the current item. When all items have been processed, the queue will contain the M highest-weight items.
Some heaps have shortcut APIs for replacing the top of the heap, but Java's Queue does not. Even so, the big-O complexity is the same.
In addition to the suggested "peek at the top of the heap" algorithm, which gives you O(n log m) complexity for getting the top-m of n items, here are two more solutions.
Solution 1: Use a Fibonacci heap.
The JDK's PriorityQueue implementation is a balanced binary heap. You should be able to squeeze more performance out of a Fibonacci heap implementation. It will have amortized constant time insert, while inserting into a binary heap has complexity Ω(log n) in the size of the heap. If you're doing that for every element, then you're at Ω(n log n). Finding the top-m of n items using a Fib heap has complexity O(n + m log n). Combine this with the suggestion to only ever insert m elements into the heap, and you have O(n + m log m), which is as close to linear time as you're going to get.
Solution 2: Traverse the list M times.
You should be able to get the kth-largest element in a set in O(n) time. Simply read everything into a list and do the following:
kthLargest(k, xs)
Pick a random pivot element p from the list
(the first one will do if your list is already random).
Go over the set once and group it into two lists.
Left: smaller than p.
Right: Larger or equal to p.
If the Right list is shorter than k, return kthLargest(k - right.size, Left)
If the Right list is longer than k, return kthLargest(k, right)
Otherwise, return p.
That gives you O(n) time. Running that m times, you should be able to get the top-m objects in your set in time O(nm), which will be strictly less than n log n for sufficiently large n and sufficiently small m. For example, getting the top-10 over a million items will take half as long as using a binary heap priority queue, all other things being equal.
If M is suitably small, then sorting all elements may waste a lot of computing time. You could only put the first M objects in priority queue (e.g. a heap, minimal element on top), and then iterate over the rest of the elements: every time an element is larger than the top of the heap, remove top and push new element into the heap.
Alternatively, you could iterate over the whole array once to find a statistical threshold value for which you can be very sure there are more than M objects with a larger value (will require some assumptions regarding the values, e.g. if they are normally distributed). You can then limit sorting to all elements with a larger value.
#Tnay: You have a point about not performing a comparison. Unfortunately, your example code still performs one. This solves the problem:
public int compare(ListElement i, ListElement j) {
return i.getValue() - j.getValue();
}
In addition, neither yours, nor BigGs compare method is strictly correct, since they never return 0. This may be a problem with some sorting algorithms, which is a very tricky bug, since it will only appear if you switch to another implementation.
From the Java docs:
The implementor must ensure that sgn(compare(x, y)) == -sgn(compare(y, x)) for all x and y.
This may or may not perform a significant constant factor speed-up.
If you combine this with erickson's solution, it will probably be hard to do it faster (depending on the size of M). If M is very large, the most efficient solution is probably to sort all the elements using Java's built-in qsort on an array and cut off one end of the array in the end.
Related
I know that given a balanced binary search tree, getting the min and max element takes O(logN), I want to ask about their implementation in C++ and Java respectively.
C++
Take std::set for example, getting min/max can be done by calling *set.begin() / *set.rbegin() , and it's constant time.
Java
Take HashSet for example, getting min/max can be done by calling HashSet.first() and HashSet.last(), but it's logarithmic time.
I wonder if this is because std::set has done some extra trick to always update the beigin() and rbegin() pointer, if so, can anyone show me that code? Btw why didn't Java add this trick too, it seems pretty convenient to me...
EDIT:
My question might not be very clear, I want to see how std::set implements insert/erase , I'm curious to see how the begin() and rbegin() iterator are updated during those operations..
EDIT2:
I'm very confused now. Say I have following code:
set<int> s;
s.insert(5);
s.insert(3);
s.insert(7);
... // say I inserted a total of n elements.
s.insert(0);
s.insert(9999);
cout<<*s.begin()<<endl; //0
cout<<*s.rbegin()<<endl; //9999
Isn't both *s.begin() and *s.rbegin() O(1) operations? Are you saying they aren't? s.rbegin() actually iterate to the last element?
My answer isn't language specific.
To fetch the MIN or MAX in a BST, you need to always iterate till the leftmost or the rightmost node respectively. This operation in O(height), which may roughly be O(log n).
Now, to optimize this retrieval, a class that implements Tree can always store two extra pointers pointing to the leftmost and the rightmost node and then retrieving them becomes O(1). Off course, these pointers brings in the overhead of updating them with each insert operation in the tree.
begin() and rbegin() only return iterators, in constant time. Iterating them isn't constant-time.
There is no 'begin()/rbegin() pointer'. The min and max are reached by iterating to the leftmost or rightmost elements, just as in Java, only Java doesn't have the explicit iterator.
For the method add of the ArrayList Java API states:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time.
I wonder if it is the same time complexity, linear, when using the add method of a LinkedList.
This depends on where you're adding. E.g. if in an ArrayList you add to the front of the list, the implementation will have to shift all items every time, so adding n elements will run in quadratic time.
Similar for the linked list, the implementation in the JDK keeps a pointer to the head and the tail. If you keep appending to the tail, or prepending in front of the head, the operation will run in linear time for n elements. If you append at a different place, the implementation will have to search the linked list for the right place, which might give you worse runtime. Again, this depends on the insertion position; you'll get the worst time complexity if you're inserting in the middle of the list, as the maximum number of elements have to be traversed to find the insertion point.
The actual complexity depends on whether your insertion position is constant (e.g. always at the 10th position), or a function of the number of items in the list (or some arbitrary search on it). The first one will give you O(n) with a slightly worse constant factor, the latter O(n^2).
In most cases, ArrayList outperforms LinkedList on the add() method, as it's simply saving a pointer to an array and incrementing the counter.
If the woking array is not large enough, though, ArrayList grows the working array, allocating a new one and copying the content. That's slower than adding a new element to LinkedList—but if you constantly add elements, that only happens O(log(N)) times.
When we talk about "amortized" complexity, we take an average time calculated for some reference task.
So, answering your question, it's not the same complexity: it's much faster (though still O(1)) in most cases, and much slower (O(N)) sometimes. What's better for you is better checked with a profiler.
If you mean the add(E) method (not the add(int, E) method), the answer is yes, the time complexity of adding a single element to a LinkedList is constant (adding n elements requires O(n) time)
As Martin Probst indicates, with different positions you get different complexities, but the add(E) operation will always append the element to the tail, resulting in a constant (amortized) time operation
I have a question about the performance of my class project.
I have about 5000 game objects formed from reading a text file. I have a Treemap (called supertree) that holds as its nodes Treemaps (mini treemaps I guess). These nodes/mini treemaps are action, strategy, adventure, sports, gametitle, etc. Basically game genres and these mini trees will will hold game objects. So the supertree itself will hold probably 8 nodes/treemaps.
When I insert a game object, it will determine which mini tree it will go to and put it in there. For example if I insert the game Super Mario World, it will check which genre it is and see that it's adventure,so Super Mario World would be inserted into the adventure tree.
So my question is what would be the performance if the question lists all the action games, since a Treemap get is O(log n)
First at the super tree it will look for the Action Node/Treemap, which will take O(log n).
Then once inside the Action treemap, it will do get for all elements which would be o(n log n) correct?
So the total performance of log n * (n * log n) is correct? Which is worst than o(n).
[edit]
Hopefully this clarified my post a bit.
While the get on the supermap is O(n_categories), and going through the other map (using an iterator) should be O(n_games). If you n_categories has an upper bound of say 10 (because the number of categories doesn't change when adding new games), you can assume the supermap lookup to be O(1).
Since the submaps can have at most n_games entries (when all belong to the same category), listing all games of type action thus gives you O(n_games). Don't forget that in order to iterate over all entries you don't have to call get() each time. That would be like reading through a book and instead of turning the page to get from page 100 to 101, start counting at the beginning and count to 101...
EDIT: Since the above paragraph stating that if the number of categories is fixed , one can assume the category lookup to be O(1) seems to be hard to accept, let me say that even if you insist category lookup is O(log n_categories), that still gives O(n_games) since the category lookup has to be done only once. Then, you iterate over the result, which is O(n_games). This leads to O(n_games + log n_categories) = O(n_games).
Okay, first thing, your big-O isn't going to change depending on language; that's why people use big-O (asymptotic) notation.
Now, think about your whole algorithm. You take your outer tree and get each element, which is indeed O(n0 lg n0). For each of those nodes, you have O(n1 lg n1). The lg n's differ by only a constant, so they can be combined, and you get O(no×n1 lg n), or O(n2 lg n).
A couple of comments regarding the OP's analysis:
I'm assuming you have already constructed your treemaps/sets and are just extracting elements from the finished (preprocessed) in-memory representation.
Let's say n is the number of genres. Let's say m is the max number of games per genre.
The complexity of getting the right 'genre map' is O(lg n) (a single get for the supertree). The complexity of iterating over the games in that genre depends on how you do it:
for (GameRef g : submap.keySet()) {
// do something with supermap.get(g)
}
This code yields O(m) 'get' operations of O(lg m) complexity each, so that's O(m lg(m)).
If you do this:
for (Map.Entry e : submap.entrySet()) {
// do something with e.getValue()
}
then the complexity is O(m) loop iterations with constant (O(1)) time access to the value.
Using the second map iteration method, your total complexity is O(lg(n) + m)
Er, you were right until the last paragraph.
Your total complexity is O(n logn), logn to look up the type and n to list all the values in that type.
If you're talking about listing everything, it's definitely not O(n^2 logn), since getting all the values in your tree is linear. It would be O(n^2).
Doing the same thing with a flat list would be O(n logn), so you're definitely losing performance (not to mention memory) by using a tree for this.
I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.
I'm seeking to display a fixed number of items on a web page according to their respective weight (represented by an Integer). The List where these items are found can be of virtually any size.
The first solution that comes to mind is to do a Collections.sort() and to get the items one by one by going through the List. Is there a more elegant solution though that could be used to prepare, say, the top eight items?
Just go for Collections.sort(..). It is efficient enough.
This algorithm offers guaranteed n log(n) performance.
You can try to implement something more efficient for your concrete case if you know some distinctive properties of your list, but that would not be justified. Furthermore, if your list comes from a database, for example, you can LIMIT it & order it there instead of in code.
Your options:
Do a linear search, maintaining the top N weights found along the way. This should be quicker than sorting a lengthly list if, for some reason, you can't reuse the sorting results between displaying the page (e.g. the list is changing quickly).
UPDATE: I stand corrected on the linear search necessarily being better than sorting. See Wikipedia article "Selection_algorithm - Selecting k smallest or largest elements" for better selection algorithms.
Manually maintain a List (the original one or a parallel one) sorted in weight order. You can use methods like Collections.binarySearch() to determine where to insert each new item.
Maintain a List (the original one or a parallel one) sorted in weight order by calling Collections.sort() after each modification, batch modifications, or just before display (possibly maintaining a modification flag to avoid sorting an already sorted list).
Use a data structure that maintains sorted weight-order for you: priority queue, tree set, etc. You could also create your own data structure.
Manually maintain a second (possibly weight-ordered) data structure of the top N items. This data structure is updated anytime the original data structure is modified. You could create your own data structure to wrap the original list and this "top N cache" together.
You could use a max-heap.
If your data originates from a database, put an index on that column and use ORDER BY and TOP or LIMIT to fetch only the records you need to display.
Or a priority queue.
using dollar:
List<Integer> topTen = $(list).sort().slice(10).toList();
without using dollar you should sort() it using Collections.sort(), then get the first n items using list.sublist(0, n).
Since you say the list of items from which to extract these top N may be of any size, and so may be large I assume, I'd augment the simple sort() answers above (which are entirely appropriate for reasonably-sized input) by suggesting most of the work here is finding the top N -- then sorting those N is trivial. That is:
Queue<Integer> topN = new PriorityQueue<Integer>(n);
for (Integer item : input) {
if (topN.size() < n) {
topN.add(item);
} else if (item > topN.peek()) {
topN.add(item);
topN.poll();
}
}
List<Integer> result = new ArrayList<Integer>(n);
result.addAll(topN);
Collections.sort(result, Collections.reverseOrder());
The heap here (a min-heap) is at least bounded in size. There's no real need to make a heap out of all your items.
No, not really. At least not using Java's built-in methods.
There are clever ways to get the highest (or lowest) N number of items from a list quicker than an O(n*log(n)) operation, but that will require you to code this solution by hand. If the number of items stays relatively low (not more than a couple of hundred), sorting it using Collections.sort() and then grabbing the top N numbers is the way to go IMO.
Depends on how many. Lets define n as the total number of keys, and m as the number you wish to display.
Sorting the entire thing: O(nlogn)
Scanning the array each time for the next highest number: O(n*m)
So the question is - What's the relation between n to m?
If m < log n, scanning will be more efficient.
Otherwise, m >= log n, which means sorting will be better. (Since for the edge case of m = log n it doesn't actually matter, but sorting will also give you the benefit of, well, sorting the array, which is always nice.
If the size of the list is N, and the number of items to be retrieved is K, you need to call Heapify on the list, which converts the list (which has to be indexable, e.g. an array) into a priority queue. (See heapify function in http://en.wikipedia.org/wiki/Heapsort)
Retrieving an item on the top of the heap (the max item) takes O (lg N) time. So your overall time would be:
O(N + k lg N)
which is better than O (N lg N) assuming k is much smaller than N.
If keeping a sorted array or using a different data structure is not an option, you could try something like the following. The O time is similar to sorting the large array but in practice this should be more efficient.
small_array = big_array.slice( number_of_items_to_find );
small_array.sort();
least_found_value = small_array.get(0).value;
for ( item in big_array ) { // needs to skip first few items
if ( item.value > least_found_value ) {
small_array.remove(0);
small_array.insert_sorted(item);
least_found_value = small_array.get(0).value;
}
}
small_array could be an Object[] and the inner loop could be done with swapping instead of actually removing and inserting into an array.