Reading this Oracle tutorial I came across this explanation of the difference between the range-view operations of a List and the ones provided by the SortedSet interface.
Here is the bit interested:
The range-view operations are somewhat analogous to those provided by
the List interface, but there is one big difference. Range views of a
sorted set remain valid even if the backing sorted set is modified
directly. This is feasible because the endpoints of a range view of a
sorted set are absolute points in the element space rather than
specific elements in the backing collection, as is the case for lists.
Is anybody capable to explain the bold part with, let's say, other words?
Thanks in advance.
Let's say you have a list and a set both containing the integers 11, 13, 15 and 17.
You could write set.subSet(12, 15) to construct a view, and then insert 12 into the original set. If you do this, 12 will appear in the view.
This is not possible with the list. Even though you can construct a view, the moment you modify the original list structurally (e.g. insert an element), the view becomes invalid.
The short answer is that sorted sets are backed directly by the set, unlike lists where you are working with, essentially, pointers. Changes to the underlying list changes the pointers (indexes) making holding views of the list for long problemactic. Since a set is sorted and it's a set, you are pointing at specific objects at the range boundry. This means that the references can't become invalid if an insertion or deletion occurs within the range while you hold the view.
More technically, the definition of range in this context:
A range, sometimes known as an interval, is a convex (contiguous) portion of a particular domain. Convexity means that for any a <= b <= c, range.contains(a) && range.contains(c) implies that range.contains(b). Ranges may extend to infinity; for example, the range "x > 3" contains arbitrarily large values -- or may be finitely constrained, for example "2 <= x < 5".
Related
Having read this question and its answers, I came to the conclusion that there are no standard implementations of those two algorithms. Some background first, though:
Most of us are familiar with binarySearch. The idea is, given a sorted array (or Collection, if using the search facilities from that class), it efficiently (in logarithmic - O(log2n) time) finds the position of a given element in the array/collection. The particular link I provided consists of the following documentation:
[...] If the array contains multiple elements with the specified value, there is no guarantee which one will be found.
Sometimes, we do not care whether we found (or failed to find) the first or the last occurrence of the element we're interested in. But what if we do care?
If we do care, we use variations of the binary search called lower bound and upper bound. They return the first and the last1 occurrence of the given element, respectively.
I come from C++ background and I really love the fact that I can use std::lower_bound and std::upper_bound (and their member function versions for containers that maintain the ordering, e.g. std::map or std::set) on containers.
The simplest use case is, given a sorted collection, determinate how many elements equal to some x are there. This answer from the question I originally linked contains the following:
[After performing a binarySearch] Then you continue iterating linearly until you hit to the end of the equal range.
The problem is that this operation is linear and, for collections with random-access, we can do much better - we can use lower bound and upper bound, then subtract the returned indexes and we get the result in logarithmic, rather than linear, time.
Essentially, it amazes me that there could be no upper- and lower-bound algrithms implemented in Java. I get that I can implement them easily myself, but, for example, what if my data is stored in a TreeMap or TreeSet? They are not random-access but given their implementation, upper- and lower-bounds could be easily implemented as their methods.
Finally, my question is - are there implementations of upper bound and/or lower bound in Java, preferably efficient regarding TreeSet and TreeMap?
1That depends on the convention, though. In C++, upper bound returns the first element that is greater than the element looked for.
Isn't TreeSet.floor() and TreeSet.ceiling() what you're asking for?
Or, alternatively, higher() and lower(), if you wish to exclude equality.
I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.
One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.
Suppose that I have a collection of 50 million different strings in a Java ArrayList. Let foo be a set of 40 million arbitrarily chosen (but fixed) strings from the previous collection. I want to know the index of every string in foo in the ArrayList.
An obvious way to do this would be to iterate through the whole ArrayList until we found a match for the first string in foo, then for the second one and so on. However, this solution would take an extremely long time (considering also that 50 million was an arbitrary large number that I picked for the example, the collection could be in the order of hundreds of millions or even billions but this is given from the beginning and remains constant).
I thought then of using a Hashtable of fixed size 50 million in order to determine the index of a given string in foo using someStringInFoo.hashCode(). However, from my understanding of Java's Hashtable, it seems that this will fail if there are collisions as calling hashCode() will produce the same index for two different strings.
Lastly, I thought about first sorting the ArrayList with the sort(List<T> list) in Java's Collections and then using binarySearch(List<? extends T> list,T key,Comparator<? super T> c) to obtain the index of the term. Is there a more efficient solution than this or is this as good as it gets?
You need additional data structure that is optimized for searching strings. It will map string to it's index. The idea is that you iterate your original list populating your data structure and then iterate your set, performing searches in that data structure.
What structure should you choose?
There are three options worth considering:
Java's HashMap
TRIE
Java's IdentityHashMap
The first option is simple to implement but provides not the best possible performance. But still, it's population time O(N * R) is better than sorting the list, which is O(R * N * log N). Searching time is better then in sorted String list (amortized O(R) compared to O(R log N).
Where R is the average length of your strings.
The second option is always good for maps of strings, providing guaranteed population time for your case of O(R * N) and guaranteed worst-case searching time of O(R). The only disadvantage of it is that there is no out-of-box implementation in Java standard libraries.
The third option is a bit tricky and suitable only for your case. In order to make it work you need to ensure that strings from the first list are literally used in second list (are the same objects). Using IdentityHashMap eliminates String's equals cost (the R above), as IdentityHashMap compares strings by address, taking only O(1). Population cost will be amortized O(N) and search cost amortized O(1). So this solution provides the best performance and out-of-box implementation. However please note that this solution will work only if there are no duplicates in the original list.
If you have any questions please let me know.
You can use a Java Hashtable with no problems. According to the Java Documentation "in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially."
I think you have a misconception on how hash tables work. Hash collisions do NOT ruin the implementation. A hash table is simply an array of linked-lists. Each key goes through a hash function to determine the index in the array which the element will be placed. If a hash collision occurs, the element will be placed at the end of the linked-list at the index in the hash-table array. See link below for diagram.
I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.
I'm looking for a collection that offers list semantics, but also allows array semantics. Say I have a list with the following items:
apple orange carrot pear
then my container array would:
container[0] == apple
container[1] == orangle
container[2] == carrot
Then say I delete the orange element:
container[0] == apple
container[1] == carrot
I want to collapse gaps in the array without having to do an explicit resizing, Ie if I delete container[0], then the container collapses, so that container[1] is now mapped as container[0], and container[2] as container[1], etc. I still need to access the list with array semantics, and null values aren't allow (in my particular use case).
EDIT:
To answer some questions - I know O(1) is impossible, but I don't want a container with array semantics approaching O(log N). Sort of defeats the purpose, I could just iterate the list.
I originally had some verbiage here on sort order, I'm not sure what I was thinking at the time (Friday beer-o-clock most likely). One of the use-cases is Qt list that contains images - deleting an image from the list should collapse the list, not necessary take the last item from the list and throw it in it's place. In this case, yet, I do want to preserve list semantics.
The key differences I see as separating list and array:
Array - constant-time access
List - arbitrary insertion
I'm also not overly concerned if rebalancing invalidates iterators.
You could do an ArrayList/Vector (Java/C++) and when you delete, instead swap the last element with the deleted element first. So if you have A B C D E, and you delete C, you'll end up with A B E D. Note that references to E will have to look at 2 instead of 4 now (assuming 0 indexed) but you said sort order isn't a problem.
I don't know if it handles this automatically (optimized for removing from the end easily) but if it's not you could easily write your own array-wrapper class.
O(1) might be too much to ask for.
Is O(logn) insert/delete/access time ok? Then you can have a balanced red-black tree with order statistics: http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-seven/
It allows you to insert/delete/access elements by position.
As Micheal was kind enough to point out, Java Treemap supports it: http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeMap.html
Also, not sure why you think O(logN) will be as bad as iterating the list!
From my comments to you on some other answer:
For 1 million items, using balanced
red-black trees, the worst case is
2log(n+1) i.e ~40. You need to do no
more than 40 compares to find your
element and that is the absolute worst
case. Red-black trees also cater to
the hole/gap disappearing. This is
miles ahead of iterating the list (~
1/2 million on average!).
With AVL trees instead of red-black
trees, the worst case guarantee is
even better: 1.44 log(n+1), which is
~29 for a million items.
You should use a HashMap, the you will have O(1)- Expected insertion time, just do a mapping from integers to whatever.
If the order isn't important, then a vector will be fine. Access is O(1), as is insertion using push_back, and removal like this:
swap(container[victim], container.back());
container.pop_back();
EDIT: just noticed the question is tagged C++ and Java. This answer is for C++ only.
I'm not aware of any data structure that provides O(1) random access, insertion, and deletion, so I suspect you'll have to accept some tradeoffs.
LinkedList in Java provides O(1) insertion/deletion from the head or tail of the list is O(1), but random access is O(n).
ArrayList provides O(1) random access, but insertion/deletion is only O(1) at the tail of the list. If you insert/delete from the middle of the list, it has to move around the remaining elements in the list. On the bright side, it uses System.arraycopy to move elements, and it's my understanding that this is essentially O(1) on modern architectures because it literally just copies blocks of memory around instead of processing each element individually. I say essentially because there is still work to find enough contiguous blocks of free space, etc. and I'm not sure what the big-O might be on that.
Since you seem to want to insert at arbitrary positions in (near) constant time, I think using a std::deque is your best bet in C++. Unlike the std::vector, a deque (double-ended queue) is implemented as a list of memory pages, i.e. a chunked vector. This makes insertion and deletion at arbitrary positions a constant-time operation (depending only on the page size used in the deque). The data structure also provides random access (“array access”) in near-constant time – it does have to search for the correct page but this is a very fast operation in practice.
Java’s standard container library doesn’t offer anything similar but the implementation is straightforward.
Does the data structure described at http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html do anything like what you want?
What about Concurent SkipList Map?
It do O(Log N) ?