Approximate hashtable, does such a data structure exist?

Approximate hashtable, does such a data structure exist? - java

I don't figure out how to implement a special hash table.
The idea would be that the hash table gives an approximate
match. So a perfect hash table (such as found in java.util)
just gives a map, such that:
Hashtable h = new Hashtable();
...
x = h.get(y);
If x is the result of applying the map h to the argument y,
i.e. basically in mathematics it would be a function
namely x = h(y). Now for the approximate match, what about a
data structure that gives me quickly:
x = h(k) where k=max { z<=y | h(z)!=null }
The problem is k can be very far away from the given y. For example
y could be 2000, and the next occupied slot k could be 1000. Some
linear search would be costly, the data structure should do the job
more quickly.
I know how to do it with a tree(*), but something with a hash, can this
also work? Or maybe combine some tree and hash properties in the sought
of data structure? Some data structure that tends toward O(1) access?
Bye
(*) You can use a tree ordered by y, and find something next below or equal y.

This is known as Spatial hashing. Keep in mind it has to be tailored for your specific domain.
It can be used when the hash tells you something about logical arrangement of objects. So when |hash(a)-hash(b)| < |hash(a)-hash(c)| means b is closer/more similar to a than c is.
Then the basic idea is that you divide the space into buckets (e.g. drop the least significant digits of the hash -- the naive approach) and your spatial hash is this bucket ID. You have to take care of the edge cases, when the objects are very near to each other, while being on the boundary of the buckets (e.g. h(1999) = 1 but h(2000)=2). You can solve this by two overlapping hashes and having two separate hash maps for them and querying both of them, or looking to the neighboring buckets etc...
As I sais in the beginning, this has to be thought through very well.
The tree (e.g. KD-tree for higher dimensions) isn't so demanding in the design phase and is generally a more convenient approach to nearest neighbor(s) querying.

The specific formula you give suggests you want a set that can retrieve the greatest item less than a given input.
One simple approach to achieving that would be to keep a sorted list of the items, and perform a binary search to locate the position in the list at which the given element would be inserted, then return the element equal to or less than that element.
As always, any set can be converted into a map by using a pair object to wrap the key-value pair, or by maintaining a parallel data structure for the values.
For an array-based approach, the runtime will be O(log n) for retrieval and O(n) for insertion of a single element. If 'add all' sorts the added elements and then merges them, it can be O(n log n).
It's not possible1 to have a constant-time algorithm that can answer what the first element less than a given element is using a hashing approach; a good hashing algorithm spreads out similar (but non-equal) items, to avoid having many similar items fall into the same bucket and destroy the desired constant-time retrieval behavior, this means the elements of a hash set (or map) are very deliberately not even remotely close to sorted order, they are as close to randomly distributed as we could achieve while using an efficient repeatable hashing algorithm.
1. Of course, proving that it's not possible is difficult, since one can't easily prove that there isn't a simple repeatable constant-time request that will reliably convince an oracle (or God, if God were that easy to manipulate) to give you the answer to the question you want, but it seems unlikely.

Related

Data structure for random access linked list

I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.

One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key

ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.

which data structure in java can add key/value pair at constant time and maintain sorted by value?

basically i'm looking for a best data structure in java which i can store pairs and retrieve top N number of element by the value. i'd like to do this in O(n) time where n is number of entires in the data structure.
example input would be,
<"john", 32>
<"dave", 3>
<"brian", 15>
<"jenna", 23>
<"rachael", 41>
and if N=3, i should be able to return rachael, john, jenna if i wanted descending order.
if i use some kind of hashMap, insertion is fast, but retrieving them by order gets expensive.
if i use some data structure that keeps things ordered, then insertion becomes expensive while retrieving is cheaper. i was not able to find the best data structure that can do both very well and very fast.
any input is appreciated. thanks.
[updated]
let me ask the question in other way if that make it clearer.
i know i can insert at constant time O(1) into hashMap.
now, how can i retrieve elements from sorted order by value in O(n) time where n=number of entires in the data structure? hope it makes sense.

If you want to sort, you have to give up constant O(1) time.
That is because unlike inserting an unsorted key / value pair, sorting will minimally require you to compare the new entry to something, and odds are to a number of somethings. Once you have an algorithm that will require more time with more entries (due to more comparisons) you have overshot "constant" time.
If you can do better, then by all means, do so! There is a Dijkstra prize awaiting for you, if not a Fields Medal to boot.
Don't dispair, you can still do the key part as a HashMap, and the sorting part with a Tree like implementation, that will give you O(log n). TreeMap is probably what you desire.
--- Update to match your update ---
No, you cannot iterate over a hashmap in O(n) time. To do so would assume that you had a list; but, that list would have to already be sorted. With a raw HashMap, you would have to search the entire map for the next "lower" value. Searching part of the map would not do, because the one element you didn't check would possibly be the correct value.
Now, there are some data structures that make a lot of trade offs which might get you closer. If you want to roll your own, perhaps a custom Fibonacci heap can give you an amortized performance close to what you wish, but it cannot guarantee a worst-case performance. In any case, some operations (like extract-min) will still require O(log n) performance.

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?

O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.

If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.

If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...

you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.

I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.

Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers

Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.

As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

What is the general purpose of using hashtables as a collection? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.

There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.

Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.

Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.

For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.

Mapping from String to integer - performance of various approaches

Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:
Hello -> 0
World -> 1
Foo -> 2
Bar -> 3
Spam -> 4
Eggs -> 5
etc.
There are at least two straightforward ways to do it. With a hashmap:
HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.
Or with a list:
List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.
Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficiency, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).
Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?

A third option and possibly my favorite would be to use a trie:
I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).
When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:
Readability
The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place
I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.
A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.
A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.

You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.
I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.

Your question is totally correct on all points:
HashMaps are better (they use a hash)
Benchmarking Java code is hard
But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.
One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.
Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.

Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.
In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".
In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.
In addition to the trie alternative suggested by #aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)
The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.
But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).
(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)

From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.
I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.
I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.

I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).
When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.
Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).
As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.
Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.