Getting max and min from BST (C++ vs Java) - java

I know that given a balanced binary search tree, getting the min and max element takes O(logN), I want to ask about their implementation in C++ and Java respectively.
C++
Take std::set for example, getting min/max can be done by calling *set.begin() / *set.rbegin() , and it's constant time.
Java
Take HashSet for example, getting min/max can be done by calling HashSet.first() and HashSet.last(), but it's logarithmic time.
I wonder if this is because std::set has done some extra trick to always update the beigin() and rbegin() pointer, if so, can anyone show me that code? Btw why didn't Java add this trick too, it seems pretty convenient to me...
EDIT:
My question might not be very clear, I want to see how std::set implements insert/erase , I'm curious to see how the begin() and rbegin() iterator are updated during those operations..
EDIT2:
I'm very confused now. Say I have following code:
set<int> s;
s.insert(5);
s.insert(3);
s.insert(7);
... // say I inserted a total of n elements.
s.insert(0);
s.insert(9999);
cout<<*s.begin()<<endl; //0
cout<<*s.rbegin()<<endl; //9999
Isn't both *s.begin() and *s.rbegin() O(1) operations? Are you saying they aren't? s.rbegin() actually iterate to the last element?

My answer isn't language specific.
To fetch the MIN or MAX in a BST, you need to always iterate till the leftmost or the rightmost node respectively. This operation in O(height), which may roughly be O(log n).
Now, to optimize this retrieval, a class that implements Tree can always store two extra pointers pointing to the leftmost and the rightmost node and then retrieving them becomes O(1). Off course, these pointers brings in the overhead of updating them with each insert operation in the tree.

begin() and rbegin() only return iterators, in constant time. Iterating them isn't constant-time.
There is no 'begin()/rbegin() pointer'. The min and max are reached by iterating to the leftmost or rightmost elements, just as in Java, only Java doesn't have the explicit iterator.

Related

How to remove maximum value from collection with 'O(log n)'_ time complexity?

I have a collection, I don't know which data structure to use yet for this.
I have two functions, add and remove.
Both of the functions need to have similar complexities because they both are as frequently used.
It's either add function will be simple as O(1) and removeMax will be O(log n) or both o(1) or one of them log n and other o(n).
removeMax should remove the maximum value and return it, and should be able to use it multiple times, so the next time u call it it removes the next new max value.
Is there a way to do both with O(1) or atleast log n for remove?
If it's a sorted structure (such as TreeSet), both add and remove would require O(logN).
If it's not sorted, add can be implemented in O(1) but removeMax would take O(N), since you must check all the elements to find the maximum in an unsorted data structure.
If you need a data structure to do both add() and removeMax() in O(logn), then you just need a sorted array. For both removeMax() and add(), you can use binary search to find the target value. (for remove, you find the max value. for add, you find the biggest value smaller than the one you want to insert, and insert the value after it).Both time complexity is O(logn).
Max heaps are probably what you are looking for, their amortized complexity of remove operation is O(logn). Fibonacci heap (see this great animation to see how it works) seems like the data structure suitable for you, as it has O(1) for insert and all other operations. Sadly, it's implementation is not a part of standard Java libraries, but there's ton of implementations to be found (for instance see the answer in the comment from #Lino).
Guava's implementation of min-max heap

Traversal of Giant LinkedList

For a project I am required to write a method that times the traversal of a LinkedList filled with 5 million random Integers using a listIterator, then with LinkedList's get(index) method.
I had no problem traversing it with the listIterator and it completed in around 75ms. HOWEVER, after trying the get method traversal on 5 million Integers, I just stopped the run at around 1.5 hours.
The getTraverse method I used is something like the code below for example (however mine was grouped with other methods in a class and was non-static, but works the same way).
public static long getTraverse(LinkedList<Integer> list) {
long start = System.currentTimeMillis();
for (int i = 0; i < linkedList.size(); i++) {
linkedList.get(i);
}
long stop = System.currentTimeMillis();
return stop - start;
}
This worked perfectly fine for LinkedLists of Integers of sizes 50, 500, 5000, 50000, and took quite a while but completed for 500000.
My professor tends to be extremely vague with instructions and very unhelpful when approached with questions. So, I don't know if my code is broken, or if he got carried away with the Integers in the guidelines. Any input is appreciated.
Think about how a LinkedList is implemented - as a chain of nodes - and you'll see that to get to a particular node you have to start at the head and traverse to that node.
You're calling .get() on a LinkedList n times, which requires traversing the list until it reaches that index. This means your getTraverse() method takes O(n^2) (or quadratic) time, because for each element it has to traverse (part of) the list.
As Elliott Frisch said, I suspect you're discovering exactly what your instructor wanted you to discover - that different algorithms can have drastically different runtimes, even if in principle they do the same thing.
A LinkedList is optimised for insertion, which is a constant time operation.
Searching a LinkedList requires you to iterate over every element to find the one you want. You provide the index to the get method, but under the covers it is traversing the list to that index.
If you add some print statements, you'll probably see that the first X elements are retrieved pretty fast and it slows down over time as you index elements further down the list.
An ArrayList (backed by an array) is optimised for retrieval and can index the desired element in constant time. Try changing your code to use an ArrayList and see how much faster get runs.

Find K max values from a N List

I got requirements-
1. Have random values in a List/Array and I need to find 3 max values .
2. I have a pool of values and each time this pool is getting updated may be in every 5 seconds, Now every time after the update , I need to find the 3 max Values from the list pool.
I thought of using Math.max thrice on the list but I dont think it as
a very optimized approach.
> Won't any sorting mechanism be costly as I am bothered about only top
3 Max Values , why to sort all these
Please suggest the best way to do it in JAVA
Sort the list, get the 3 max values. If you don't want the expense of the sort, iterate and maintain the n largest values.
Maintain the pool is a sorted collection.
Update: FYI Guava has an Ordering class with a greatestOf method to get the n max elements in a collection. You might want to check out the implementation.
Ordering.greatestOf
Traverse the list once, keeping an ordered array of three largest elements seen so far. This is trivial to update whenever you see a new element, and instantly gives you the answer you're looking for.
A priority queue should be the data structure you need in this case.
First, it would be wise to never say again, "I dont think it as a very optimized approach." You will not know which part of your code is slowing you down until you put a profiler on it.
Second, the easiest way to do what you're trying to do -- and what will be most clear to someone later if they are trying to see what your code does -- is to use Collections.sort() and pick off the last three elements. Then anyone who sees the code will know, "oh, this code takes the three largest elements." There is so much value in clear code that it will likely outweigh any optimization that you might have done. It will also keep you from writing bugs, like giving a natural meaning to what happens when someone puts the same number into the list twice, or giving a useful error message when there are only two elements in the list.
Third, if you really get data which is so large that O(n log n) operations is too slow, you should rewrite the data structure which holds the data in the first place -- java.util.NavigableSet for example offers a .descendingIterator() method which you can probe for its first three elements, those would be the three maximum numbers. If you really want, a Heap data structure can be used, and you can pull off the top 3 elements with something like one comparison, at the cost of making adding an O(log n) procedure.

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

Java - Looking for something faster than PriorityQueue

i'm using java on a big amount of data.
[i try to simplify the problem as much as possible]
Actually i have a small class (Element) containing an int KEY and a double WEIGHT (with getters&setters).
I read a lot of these objects from a file and i have to get the best (most weight) M objects.
Actually i'm using a PriorityQueue with a Comparator written to compare two Element, and it works, but it's too slow.
Do you know (i know you do) any faster way to do that?
Thank you
A heap-based priority queue is a good data structure for this problem. Just as a sanity check, verify that you are using the queue correctly.
If you want the highest weight items, use a min-queue—where the top of the heap is the smallest item. Adding every item to a max-queue and examining the top M items when done is not efficient.
For each item, if there are less than M items in the queue, add the current item. Otherwise, peek at the top of the heap. If it's less than the current item, discard it, and add the current item instead. Otherwise, discard the current item. When all items have been processed, the queue will contain the M highest-weight items.
Some heaps have shortcut APIs for replacing the top of the heap, but Java's Queue does not. Even so, the big-O complexity is the same.
In addition to the suggested "peek at the top of the heap" algorithm, which gives you O(n log m) complexity for getting the top-m of n items, here are two more solutions.
Solution 1: Use a Fibonacci heap.
The JDK's PriorityQueue implementation is a balanced binary heap. You should be able to squeeze more performance out of a Fibonacci heap implementation. It will have amortized constant time insert, while inserting into a binary heap has complexity Ω(log n) in the size of the heap. If you're doing that for every element, then you're at Ω(n log n). Finding the top-m of n items using a Fib heap has complexity O(n + m log n). Combine this with the suggestion to only ever insert m elements into the heap, and you have O(n + m log m), which is as close to linear time as you're going to get.
Solution 2: Traverse the list M times.
You should be able to get the kth-largest element in a set in O(n) time. Simply read everything into a list and do the following:
kthLargest(k, xs)
Pick a random pivot element p from the list
(the first one will do if your list is already random).
Go over the set once and group it into two lists.
Left: smaller than p.
Right: Larger or equal to p.
If the Right list is shorter than k, return kthLargest(k - right.size, Left)
If the Right list is longer than k, return kthLargest(k, right)
Otherwise, return p.
That gives you O(n) time. Running that m times, you should be able to get the top-m objects in your set in time O(nm), which will be strictly less than n log n for sufficiently large n and sufficiently small m. For example, getting the top-10 over a million items will take half as long as using a binary heap priority queue, all other things being equal.
If M is suitably small, then sorting all elements may waste a lot of computing time. You could only put the first M objects in priority queue (e.g. a heap, minimal element on top), and then iterate over the rest of the elements: every time an element is larger than the top of the heap, remove top and push new element into the heap.
Alternatively, you could iterate over the whole array once to find a statistical threshold value for which you can be very sure there are more than M objects with a larger value (will require some assumptions regarding the values, e.g. if they are normally distributed). You can then limit sorting to all elements with a larger value.
#Tnay: You have a point about not performing a comparison. Unfortunately, your example code still performs one. This solves the problem:
public int compare(ListElement i, ListElement j) {
return i.getValue() - j.getValue();
}
In addition, neither yours, nor BigGs compare method is strictly correct, since they never return 0. This may be a problem with some sorting algorithms, which is a very tricky bug, since it will only appear if you switch to another implementation.
From the Java docs:
The implementor must ensure that sgn(compare(x, y)) == -sgn(compare(y, x)) for all x and y.
This may or may not perform a significant constant factor speed-up.
If you combine this with erickson's solution, it will probably be hard to do it faster (depending on the size of M). If M is very large, the most efficient solution is probably to sort all the elements using Java's built-in qsort on an array and cut off one end of the array in the end.

Categories

Resources