I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.
Related
I have a collection, I don't know which data structure to use yet for this.
I have two functions, add and remove.
Both of the functions need to have similar complexities because they both are as frequently used.
It's either add function will be simple as O(1) and removeMax will be O(log n) or both o(1) or one of them log n and other o(n).
removeMax should remove the maximum value and return it, and should be able to use it multiple times, so the next time u call it it removes the next new max value.
Is there a way to do both with O(1) or atleast log n for remove?
If it's a sorted structure (such as TreeSet), both add and remove would require O(logN).
If it's not sorted, add can be implemented in O(1) but removeMax would take O(N), since you must check all the elements to find the maximum in an unsorted data structure.
If you need a data structure to do both add() and removeMax() in O(logn), then you just need a sorted array. For both removeMax() and add(), you can use binary search to find the target value. (for remove, you find the max value. for add, you find the biggest value smaller than the one you want to insert, and insert the value after it).Both time complexity is O(logn).
Max heaps are probably what you are looking for, their amortized complexity of remove operation is O(logn). Fibonacci heap (see this great animation to see how it works) seems like the data structure suitable for you, as it has O(1) for insert and all other operations. Sadly, it's implementation is not a part of standard Java libraries, but there's ton of implementations to be found (for instance see the answer in the comment from #Lino).
Guava's implementation of min-max heap
I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.
One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.
basically i'm looking for a best data structure in java which i can store pairs and retrieve top N number of element by the value. i'd like to do this in O(n) time where n is number of entires in the data structure.
example input would be,
<"john", 32>
<"dave", 3>
<"brian", 15>
<"jenna", 23>
<"rachael", 41>
and if N=3, i should be able to return rachael, john, jenna if i wanted descending order.
if i use some kind of hashMap, insertion is fast, but retrieving them by order gets expensive.
if i use some data structure that keeps things ordered, then insertion becomes expensive while retrieving is cheaper. i was not able to find the best data structure that can do both very well and very fast.
any input is appreciated. thanks.
[updated]
let me ask the question in other way if that make it clearer.
i know i can insert at constant time O(1) into hashMap.
now, how can i retrieve elements from sorted order by value in O(n) time where n=number of entires in the data structure? hope it makes sense.
If you want to sort, you have to give up constant O(1) time.
That is because unlike inserting an unsorted key / value pair, sorting will minimally require you to compare the new entry to something, and odds are to a number of somethings. Once you have an algorithm that will require more time with more entries (due to more comparisons) you have overshot "constant" time.
If you can do better, then by all means, do so! There is a Dijkstra prize awaiting for you, if not a Fields Medal to boot.
Don't dispair, you can still do the key part as a HashMap, and the sorting part with a Tree like implementation, that will give you O(log n). TreeMap is probably what you desire.
--- Update to match your update ---
No, you cannot iterate over a hashmap in O(n) time. To do so would assume that you had a list; but, that list would have to already be sorted. With a raw HashMap, you would have to search the entire map for the next "lower" value. Searching part of the map would not do, because the one element you didn't check would possibly be the correct value.
Now, there are some data structures that make a lot of trade offs which might get you closer. If you want to roll your own, perhaps a custom Fibonacci heap can give you an amortized performance close to what you wish, but it cannot guarantee a worst-case performance. In any case, some operations (like extract-min) will still require O(log n) performance.
Seems they both let you retrieve the minimum, which is what I need for Prim's algorithm, and force me to remove and reinsert a key to update its value. Is there any advantage of using one over the other, not just for this example, but generally speaking?
Generally speaking, it is less work to track only the minimum element, using a heap.
A tree is more organized, and it requires more computation to maintain that organization. But if you need to access any key, and not just the minimum, a heap will not suffice, and the extra overhead of the tree is justified.
There are 2 differences I would like to point out (and this may be more relevant to Difference between PriorityQueue and TreeSet in Java? as that question is deemed a dup of this question).
(1) PriorityQueue can have duplicates where as TreeSet can NOT have dups. So in Treeset, if your comparator deems 2 elements as equal, TreeSet will keep only one of those 2 elements and throw away the other one.
(2) TreeSet iterator traverses the collection in a sorted order, whereas PriorityQueue iterator does NOT traverse in sorted order. For PriorityQueue If you want to get the items in sorted order, you have to destroy the queue by calling remove() repeatedly.
I am assuming that the discussion is limited to Java's implementation of these data structures.
Totally agree with Erickson on that priority queue only gives you the minimum/maximum element.
In addition, because the priority queue is less powerful in maintaining the total order of the data, it has the advantage in some special cases. If you want to track the biggest M elements in an array of N, the time complexity would be O(N*LogM) and the space complexity would be O(M). But if you do it in a map, the time complexity is O(N*logN) and the space is O(N). This is quite fundamental while we must use priority queue in some cases for example M is just a constant like 10.
Rule of thumb about it is:
TreeMap maintains all elements orderly. (So intuitively, it takes time to construct it)
PriorityQueue only guarantees min or max. It's less expensive but less powerful.
It all depends what you want to achieve. Here are the main points to consider before you choose one over other.
PriorityQueue Allows Duplicate(i.e with same priority) while TreeMap doesn't.
Complexity of PriorityQueue is O(n)(when is increases its size), while that of TreeMap is O(logn)(as it is based on Red Black Tree)
PriorityQueue is based on Array while in TreeMap nodes are linked to each other, so contains method of PriorityQueue would take O(n) time while TreeMap would take O(logn) time.
One of the differences is that remove(Object) and contains(Object) are linear O(N) in a normal heap based PriorityQueue (like Oracle's), but O(log(N)) for a TreeSet/Map.
So if you have a large number of elements and do a lot of remove(Object) or contains(Object), then a TreeSet/Map may be faster.
I may be late to this answer but still.
They have their own use-cases, in which either one of them is a clear winner.
For Example:
1: https://leetcode.com/problems/my-calendar-i TreeMap is the one you are looking at
2: https://leetcode.com/problems/top-k-frequent-words you don't need the overhead of keys and values.
So my answer would be, look at the use-case, and see if that could be done without key and value, if yes, go for PQueue else move to TreeMap.
It depends on how you implement you Priority Queue. According to Cormen's book 2nd ed the fastest result is with a Fibonacci Heap.
I find TreeMap to be useful, when there is a need to do something like:
find the minimal/least key, which is greater equal some value, using ceilingKey()
find the maximum/greatest key, which is less equal some value, using floorKey()
If the above is not required, and it's mostly about having a quick option to retrieve the min/max - PriorityQueue might be preferred.
Their difference on time complexity is stated clearly in Erickson's answer.
On space complexity, although a heap and a TreeMap both take O(n) space complexity, building them in actual programs takes up different amount of space and effort.
Say if you have an array of numbers, you can build a heap in place with O(n) time and constant extra space. If you build a TreeMap based on the given array, you need O(nlogn) time and O(n) extra space to accomplish that.
One more thing to take into consideration, PriorityQueue offers an api which return the max/min value without removing it, the time complexity is O(1) while for a TreeMap this will still cost you O(logn)
This could be clear advantage in case of readonly cases where you are only interested in the top end value.
I'm looking for a collection that offers list semantics, but also allows array semantics. Say I have a list with the following items:
apple orange carrot pear
then my container array would:
container[0] == apple
container[1] == orangle
container[2] == carrot
Then say I delete the orange element:
container[0] == apple
container[1] == carrot
I want to collapse gaps in the array without having to do an explicit resizing, Ie if I delete container[0], then the container collapses, so that container[1] is now mapped as container[0], and container[2] as container[1], etc. I still need to access the list with array semantics, and null values aren't allow (in my particular use case).
EDIT:
To answer some questions - I know O(1) is impossible, but I don't want a container with array semantics approaching O(log N). Sort of defeats the purpose, I could just iterate the list.
I originally had some verbiage here on sort order, I'm not sure what I was thinking at the time (Friday beer-o-clock most likely). One of the use-cases is Qt list that contains images - deleting an image from the list should collapse the list, not necessary take the last item from the list and throw it in it's place. In this case, yet, I do want to preserve list semantics.
The key differences I see as separating list and array:
Array - constant-time access
List - arbitrary insertion
I'm also not overly concerned if rebalancing invalidates iterators.
You could do an ArrayList/Vector (Java/C++) and when you delete, instead swap the last element with the deleted element first. So if you have A B C D E, and you delete C, you'll end up with A B E D. Note that references to E will have to look at 2 instead of 4 now (assuming 0 indexed) but you said sort order isn't a problem.
I don't know if it handles this automatically (optimized for removing from the end easily) but if it's not you could easily write your own array-wrapper class.
O(1) might be too much to ask for.
Is O(logn) insert/delete/access time ok? Then you can have a balanced red-black tree with order statistics: http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-seven/
It allows you to insert/delete/access elements by position.
As Micheal was kind enough to point out, Java Treemap supports it: http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeMap.html
Also, not sure why you think O(logN) will be as bad as iterating the list!
From my comments to you on some other answer:
For 1 million items, using balanced
red-black trees, the worst case is
2log(n+1) i.e ~40. You need to do no
more than 40 compares to find your
element and that is the absolute worst
case. Red-black trees also cater to
the hole/gap disappearing. This is
miles ahead of iterating the list (~
1/2 million on average!).
With AVL trees instead of red-black
trees, the worst case guarantee is
even better: 1.44 log(n+1), which is
~29 for a million items.
You should use a HashMap, the you will have O(1)- Expected insertion time, just do a mapping from integers to whatever.
If the order isn't important, then a vector will be fine. Access is O(1), as is insertion using push_back, and removal like this:
swap(container[victim], container.back());
container.pop_back();
EDIT: just noticed the question is tagged C++ and Java. This answer is for C++ only.
I'm not aware of any data structure that provides O(1) random access, insertion, and deletion, so I suspect you'll have to accept some tradeoffs.
LinkedList in Java provides O(1) insertion/deletion from the head or tail of the list is O(1), but random access is O(n).
ArrayList provides O(1) random access, but insertion/deletion is only O(1) at the tail of the list. If you insert/delete from the middle of the list, it has to move around the remaining elements in the list. On the bright side, it uses System.arraycopy to move elements, and it's my understanding that this is essentially O(1) on modern architectures because it literally just copies blocks of memory around instead of processing each element individually. I say essentially because there is still work to find enough contiguous blocks of free space, etc. and I'm not sure what the big-O might be on that.
Since you seem to want to insert at arbitrary positions in (near) constant time, I think using a std::deque is your best bet in C++. Unlike the std::vector, a deque (double-ended queue) is implemented as a list of memory pages, i.e. a chunked vector. This makes insertion and deletion at arbitrary positions a constant-time operation (depending only on the page size used in the deque). The data structure also provides random access (“array access”) in near-constant time – it does have to search for the correct page but this is a very fast operation in practice.
Java’s standard container library doesn’t offer anything similar but the implementation is straightforward.
Does the data structure described at http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html do anything like what you want?
What about Concurent SkipList Map?
It do O(Log N) ?