I'm giving a example that mirrors my usecase:
I have a histogram say in the range [0, 10000]. I want to efficiently support queries of the type:
int j = maxYInXRange(20, 70);
Which should return maximum Y value in the given X Range.
I've come across a Data structure called "Priority Search Tree" used in Computer Graphics but there are no easily understandable resources on this topic.
I believe you are trying to solve the range minimum/maximum query problem. There are many ways you can achieve sublinear time per query, if you spend some more time precomputing information at the beginning. There is a good tutorial on several efficient approaches here.
For example, if your histogram doesn't change, you can answer queries with a sparse table in O(1), with precomputation using O(N log N) time and memory, where N is the number of elements in the histogram. If your histogram changes frequently, a segment tree can be used for O(log N) updates and queries, with O(N) time and memory for a one-time precomputation in the beginning.
What about the standard TreeMap, using the subMap(K,boolean,K,boolean) method?
TreeMap histogram = ...
return histogram.subMap(20,true,70,true).values().stream().max()
The lookup of the boundaries will be O(log n). Finding the maximum will be O(m), where m = max-min. I don't think you can find a better data structure unless you precompute everything, which would take O(n²) in both computing and storage size, I suppose.
You could sort the histogram indices by value, highest to lowest. And then, for a given range iterate over it as this:
List<Entry> histogramEntries = ... //sorted by value
for(Entry entry: histogramEntries)
if(range.contains(entry.index))
return entry.value;
This will work faster for larger ranges, since it's more likely to contain one of the higher values that are in the beginning of the list.
Related
In my COMP class last night we learned about hashing and how it generally works when trying to find an element x in a hash table.
Our scenario was that we have a dataset of 1000 elements inside our table and we want to know if x is contained within that table.
Our professor drew up a Java array of 100 and said that to store these 1000 elements that each position of the array would contain a pointer to a Linked List where we would keep our elements.
Assuming the hashing function perfectly mapped each of the 1000 elements to a value between 0 and 99 and stored the element at the position in the array, there would be 1000/100 = 10 elements contained within each linked list.
Now to know whether x is in the table, we simply hash x, find it's hash value, lookup into the array at that slot and iterate over our linked list to check whether x is in the table.
My professor concluded by saying that the expected complexity of finding whether x is in the table is O(10) which is really just O(1). I cannot understand how this is the case. In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table. Isn't this by definition not constant time, because if we scale up the data set the time will still increase?
I've looked through Stack Overflow and online and everyone says hashing is expected time complexity of O(1) with some caveats. I have read people discuss chaining to reduce these caveats. Maybe I am missing something fundamental about determining time complexity.
TLDR: Why does it take O(1) time to lookup a value in a hash table when it seems to still be determined by how large your dataset is (therefore a function of N, therefore not constant).
In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table.
This is a misconception, as hashing simply requires you to calculate the correct bucket (in this case, array index) that the object should be stored in. This calculation will not become any more complex if the size of the data set changes.
These caveats that you speak of are most likely hash collisions: where multiple objects share the same hashCode; these can be prevented with a better hash function.
The complexity of a hashed collection for lookups is O(1) because the size of lists (or in Java's case, red-black trees) for each bucket is not dependent on N. Worst-case performance for HashMap if you have a very bad hash function is O(log N), but as the Javadocs point out, you get O(1) performance "assuming the hash function disperses the elements properly among the buckets". With proper dispersion the size of each bucket's collection is more-or-less fixed, and also small enough that constant factors generally overwhelm the polynomial factors.
There is multiple issues here so I will address them 1 by 1:
Worst case analysis vs amortized analysis:
Worst case analysis refers to the absolute worst case scenario that your algorithm can be given relative to running time. As an example, if I am giving an array of unordered elements, and I am told to find an element in it, my best case scenario is when the element is at index [0] the worst possible thing that I can be given is when the element is at the end of the array, in which case if my data set is n, I run n times before finding the element. On the average case however the element is anywhere in the array so I will run n-k steps (where k is the number of elements after the element I am looking for in the array).
Worst case analysis of Hashtables:
There exists only 1 kind of Hashtable that has guaranteed constant time access O(1) to it's elements, Arrays. (And even then it's not actually true do to paging and the way OS's handle memory). The worst possible case that I could give you for a hash table is a data set where every element hashes to the same index. So for example if every single element hashes to index 1, due to collisions, the worst case running time for accessing a value is O(n). This is unavoidable, hashtables always have this behaviour.
Average and best case scenario of hashtables:
You will rarely be given a set that gives you the worst possible case scenario. In general you can expect objects to be hashed to different indexes in your hashtable. Ideally the hash function hashes things in a very spread out manner so that objects get hashed to different indexes in the hash table.
In the specific example your teacher gave you, if 2 things get hashed to the same index, they get put in a linked list. So this is more or less how the table got constructed:
get element E
use the hashing function hash(E) to find the index i in the hash table
add e to the linjed list in hashTable[i].
repeat for all the elements in the data set
So now, let's say I want to find whether element E is on the table. Then:
do hash(E) to find the index i where E is potentially hashed
go to hashTable[i] and iterate through the linked list (up to 10 iterations)
If E is found, then E is in the Hash table, if E is not found, then E is not in the table
The reason why we can guarantee that E is not in the table if we can't find it, is because if it was, it would have been hashed to hashTable[i] so it HAS to be there, if it's on the table.
I don't figure out how to implement a special hash table.
The idea would be that the hash table gives an approximate
match. So a perfect hash table (such as found in java.util)
just gives a map, such that:
Hashtable h = new Hashtable();
...
x = h.get(y);
If x is the result of applying the map h to the argument y,
i.e. basically in mathematics it would be a function
namely x = h(y). Now for the approximate match, what about a
data structure that gives me quickly:
x = h(k) where k=max { z<=y | h(z)!=null }
The problem is k can be very far away from the given y. For example
y could be 2000, and the next occupied slot k could be 1000. Some
linear search would be costly, the data structure should do the job
more quickly.
I know how to do it with a tree(*), but something with a hash, can this
also work? Or maybe combine some tree and hash properties in the sought
of data structure? Some data structure that tends toward O(1) access?
Bye
(*) You can use a tree ordered by y, and find something next below or equal y.
This is known as Spatial hashing. Keep in mind it has to be tailored for your specific domain.
It can be used when the hash tells you something about logical arrangement of objects. So when |hash(a)-hash(b)| < |hash(a)-hash(c)| means b is closer/more similar to a than c is.
Then the basic idea is that you divide the space into buckets (e.g. drop the least significant digits of the hash -- the naive approach) and your spatial hash is this bucket ID. You have to take care of the edge cases, when the objects are very near to each other, while being on the boundary of the buckets (e.g. h(1999) = 1 but h(2000)=2). You can solve this by two overlapping hashes and having two separate hash maps for them and querying both of them, or looking to the neighboring buckets etc...
As I sais in the beginning, this has to be thought through very well.
The tree (e.g. KD-tree for higher dimensions) isn't so demanding in the design phase and is generally a more convenient approach to nearest neighbor(s) querying.
The specific formula you give suggests you want a set that can retrieve the greatest item less than a given input.
One simple approach to achieving that would be to keep a sorted list of the items, and perform a binary search to locate the position in the list at which the given element would be inserted, then return the element equal to or less than that element.
As always, any set can be converted into a map by using a pair object to wrap the key-value pair, or by maintaining a parallel data structure for the values.
For an array-based approach, the runtime will be O(log n) for retrieval and O(n) for insertion of a single element. If 'add all' sorts the added elements and then merges them, it can be O(n log n).
It's not possible1 to have a constant-time algorithm that can answer what the first element less than a given element is using a hashing approach; a good hashing algorithm spreads out similar (but non-equal) items, to avoid having many similar items fall into the same bucket and destroy the desired constant-time retrieval behavior, this means the elements of a hash set (or map) are very deliberately not even remotely close to sorted order, they are as close to randomly distributed as we could achieve while using an efficient repeatable hashing algorithm.
1. Of course, proving that it's not possible is difficult, since one can't easily prove that there isn't a simple repeatable constant-time request that will reliably convince an oracle (or God, if God were that easy to manipulate) to give you the answer to the question you want, but it seems unlikely.
I have a question about the performance of my class project.
I have about 5000 game objects formed from reading a text file. I have a Treemap (called supertree) that holds as its nodes Treemaps (mini treemaps I guess). These nodes/mini treemaps are action, strategy, adventure, sports, gametitle, etc. Basically game genres and these mini trees will will hold game objects. So the supertree itself will hold probably 8 nodes/treemaps.
When I insert a game object, it will determine which mini tree it will go to and put it in there. For example if I insert the game Super Mario World, it will check which genre it is and see that it's adventure,so Super Mario World would be inserted into the adventure tree.
So my question is what would be the performance if the question lists all the action games, since a Treemap get is O(log n)
First at the super tree it will look for the Action Node/Treemap, which will take O(log n).
Then once inside the Action treemap, it will do get for all elements which would be o(n log n) correct?
So the total performance of log n * (n * log n) is correct? Which is worst than o(n).
[edit]
Hopefully this clarified my post a bit.
While the get on the supermap is O(n_categories), and going through the other map (using an iterator) should be O(n_games). If you n_categories has an upper bound of say 10 (because the number of categories doesn't change when adding new games), you can assume the supermap lookup to be O(1).
Since the submaps can have at most n_games entries (when all belong to the same category), listing all games of type action thus gives you O(n_games). Don't forget that in order to iterate over all entries you don't have to call get() each time. That would be like reading through a book and instead of turning the page to get from page 100 to 101, start counting at the beginning and count to 101...
EDIT: Since the above paragraph stating that if the number of categories is fixed , one can assume the category lookup to be O(1) seems to be hard to accept, let me say that even if you insist category lookup is O(log n_categories), that still gives O(n_games) since the category lookup has to be done only once. Then, you iterate over the result, which is O(n_games). This leads to O(n_games + log n_categories) = O(n_games).
Okay, first thing, your big-O isn't going to change depending on language; that's why people use big-O (asymptotic) notation.
Now, think about your whole algorithm. You take your outer tree and get each element, which is indeed O(n0 lg n0). For each of those nodes, you have O(n1 lg n1). The lg n's differ by only a constant, so they can be combined, and you get O(no×n1 lg n), or O(n2 lg n).
A couple of comments regarding the OP's analysis:
I'm assuming you have already constructed your treemaps/sets and are just extracting elements from the finished (preprocessed) in-memory representation.
Let's say n is the number of genres. Let's say m is the max number of games per genre.
The complexity of getting the right 'genre map' is O(lg n) (a single get for the supertree). The complexity of iterating over the games in that genre depends on how you do it:
for (GameRef g : submap.keySet()) {
// do something with supermap.get(g)
}
This code yields O(m) 'get' operations of O(lg m) complexity each, so that's O(m lg(m)).
If you do this:
for (Map.Entry e : submap.entrySet()) {
// do something with e.getValue()
}
then the complexity is O(m) loop iterations with constant (O(1)) time access to the value.
Using the second map iteration method, your total complexity is O(lg(n) + m)
Er, you were right until the last paragraph.
Your total complexity is O(n logn), logn to look up the type and n to list all the values in that type.
If you're talking about listing everything, it's definitely not O(n^2 logn), since getting all the values in your tree is linear. It would be O(n^2).
Doing the same thing with a flat list would be O(n logn), so you're definitely losing performance (not to mention memory) by using a tree for this.
I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.
i'm using java on a big amount of data.
[i try to simplify the problem as much as possible]
Actually i have a small class (Element) containing an int KEY and a double WEIGHT (with getters&setters).
I read a lot of these objects from a file and i have to get the best (most weight) M objects.
Actually i'm using a PriorityQueue with a Comparator written to compare two Element, and it works, but it's too slow.
Do you know (i know you do) any faster way to do that?
Thank you
A heap-based priority queue is a good data structure for this problem. Just as a sanity check, verify that you are using the queue correctly.
If you want the highest weight items, use a min-queue—where the top of the heap is the smallest item. Adding every item to a max-queue and examining the top M items when done is not efficient.
For each item, if there are less than M items in the queue, add the current item. Otherwise, peek at the top of the heap. If it's less than the current item, discard it, and add the current item instead. Otherwise, discard the current item. When all items have been processed, the queue will contain the M highest-weight items.
Some heaps have shortcut APIs for replacing the top of the heap, but Java's Queue does not. Even so, the big-O complexity is the same.
In addition to the suggested "peek at the top of the heap" algorithm, which gives you O(n log m) complexity for getting the top-m of n items, here are two more solutions.
Solution 1: Use a Fibonacci heap.
The JDK's PriorityQueue implementation is a balanced binary heap. You should be able to squeeze more performance out of a Fibonacci heap implementation. It will have amortized constant time insert, while inserting into a binary heap has complexity Ω(log n) in the size of the heap. If you're doing that for every element, then you're at Ω(n log n). Finding the top-m of n items using a Fib heap has complexity O(n + m log n). Combine this with the suggestion to only ever insert m elements into the heap, and you have O(n + m log m), which is as close to linear time as you're going to get.
Solution 2: Traverse the list M times.
You should be able to get the kth-largest element in a set in O(n) time. Simply read everything into a list and do the following:
kthLargest(k, xs)
Pick a random pivot element p from the list
(the first one will do if your list is already random).
Go over the set once and group it into two lists.
Left: smaller than p.
Right: Larger or equal to p.
If the Right list is shorter than k, return kthLargest(k - right.size, Left)
If the Right list is longer than k, return kthLargest(k, right)
Otherwise, return p.
That gives you O(n) time. Running that m times, you should be able to get the top-m objects in your set in time O(nm), which will be strictly less than n log n for sufficiently large n and sufficiently small m. For example, getting the top-10 over a million items will take half as long as using a binary heap priority queue, all other things being equal.
If M is suitably small, then sorting all elements may waste a lot of computing time. You could only put the first M objects in priority queue (e.g. a heap, minimal element on top), and then iterate over the rest of the elements: every time an element is larger than the top of the heap, remove top and push new element into the heap.
Alternatively, you could iterate over the whole array once to find a statistical threshold value for which you can be very sure there are more than M objects with a larger value (will require some assumptions regarding the values, e.g. if they are normally distributed). You can then limit sorting to all elements with a larger value.
#Tnay: You have a point about not performing a comparison. Unfortunately, your example code still performs one. This solves the problem:
public int compare(ListElement i, ListElement j) {
return i.getValue() - j.getValue();
}
In addition, neither yours, nor BigGs compare method is strictly correct, since they never return 0. This may be a problem with some sorting algorithms, which is a very tricky bug, since it will only appear if you switch to another implementation.
From the Java docs:
The implementor must ensure that sgn(compare(x, y)) == -sgn(compare(y, x)) for all x and y.
This may or may not perform a significant constant factor speed-up.
If you combine this with erickson's solution, it will probably be hard to do it faster (depending on the size of M). If M is very large, the most efficient solution is probably to sort all the elements using Java's built-in qsort on an array and cut off one end of the array in the end.