fastest Java collection for string lookup?

fastest Java collection for string lookup? - java

I have a Java class which contains two Strings, for example the name of a person and the name of the group.
I also have a list of groups (about 10) and a list of persons (about 100). The list of my data objects is larger, it can exceed 10.000 items.
Now I would like to search through my data objects such that I find all objects having a person from the person list and a group from the group list.
My question is: what is the best data structure for the person and group list?
I could use an ArrayList and simply iterate until I find a match, but that is obviously inefficient. A HashSet or HashMap would be much better.
Are there even more efficient ways to solve this? Please advise.

Every data structure has pro and cons.
A Map is used to retrieve data in O(1) if you have an access key.
A List is used to mantain an order between elements, but accessing an element using a key is not possible and you need to loop the whole list that happens in O(n).

A good data-structure for storing and lookup strings is a Trie:
It's essentially a tree structure which uses characters or substrings to denote paths to follow.
Advantages over hash-maps (quote from Wikipedia):
Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.

I agree with #Davide answer..If we want fast lookup as well as to maintain the order too, then we can go for LinkedHashMap implementation of Map.
By using it, we can have both things:
Data retrieval, If we have access key.
We can maintain the insertion order, so while iterating we will get the data in the same order as of during insertion.

Depending on the scenario (If you have the data before receiving lists of groups/people), preprocessing the data would save you time.
Comparing the data to the groups/people lists will require at least 10,000+ lookups. Comparing the groups/people lists to the data will require a max 10*100 = 1,000 lookups,.. less if you compare against each group one at a time (10+100 = 110 lookups).

Related

Hash Table implement iterate and findMin

I have coded a standard Hash Table class in java. It has a large array of buckets, and to insert, retrieve or delete elements, I simply calculate the hash of the element and look at the appropriate index in the array to get the right bucket.
However, I would like to implement some sort of iterator. Is there an other way than looping through all the indices in the array and ignoring those that are empty? Because my hash table might contain hundreds of empty entries, and only a few elements that have been hashed and inserted. Is there a O(n) way to iterate instead of O(size of table) when n<<size of table?
To implement findMin, I could simply save the smallest element each time I insert a new one, but I want to use the iterator approach.
Thanks!

You can maintain a linked list of the map entries, like LinkedHashMap does in the standard library.
Or you can make your hash table ensure that the capacity is always at most kn, for some suitable value of k. This will ensure iteration is linear in n.

You could store a sorted list of the non-empty buckets, and insert a bucket's id into the list (if it's not already there) when you insert something in the hash table.
But maybe it's not too expensive to search through a few hundred empty buckets, if it's not buried too deep inside a loop. A little inefficiency might be better than a more complex design.

If order is important to you you should consider using a Binary Search Tree (a left leaning red black tree for example) or a Skip List to implement your Dictionary. They are better for the job in these cases.

How to quickly know the indexes in a massive ArrayList of a very large number of strings from this ArrayList in Java?

Suppose that I have a collection of 50 million different strings in a Java ArrayList. Let foo be a set of 40 million arbitrarily chosen (but fixed) strings from the previous collection. I want to know the index of every string in foo in the ArrayList.
An obvious way to do this would be to iterate through the whole ArrayList until we found a match for the first string in foo, then for the second one and so on. However, this solution would take an extremely long time (considering also that 50 million was an arbitrary large number that I picked for the example, the collection could be in the order of hundreds of millions or even billions but this is given from the beginning and remains constant).
I thought then of using a Hashtable of fixed size 50 million in order to determine the index of a given string in foo using someStringInFoo.hashCode(). However, from my understanding of Java's Hashtable, it seems that this will fail if there are collisions as calling hashCode() will produce the same index for two different strings.
Lastly, I thought about first sorting the ArrayList with the sort(List<T> list) in Java's Collections and then using binarySearch(List<? extends T> list,T key,Comparator<? super T> c) to obtain the index of the term. Is there a more efficient solution than this or is this as good as it gets?

You need additional data structure that is optimized for searching strings. It will map string to it's index. The idea is that you iterate your original list populating your data structure and then iterate your set, performing searches in that data structure.
What structure should you choose?
There are three options worth considering:
Java's HashMap
TRIE
Java's IdentityHashMap
The first option is simple to implement but provides not the best possible performance. But still, it's population time O(N * R) is better than sorting the list, which is O(R * N * log N). Searching time is better then in sorted String list (amortized O(R) compared to O(R log N).
Where R is the average length of your strings.
The second option is always good for maps of strings, providing guaranteed population time for your case of O(R * N) and guaranteed worst-case searching time of O(R). The only disadvantage of it is that there is no out-of-box implementation in Java standard libraries.
The third option is a bit tricky and suitable only for your case. In order to make it work you need to ensure that strings from the first list are literally used in second list (are the same objects). Using IdentityHashMap eliminates String's equals cost (the R above), as IdentityHashMap compares strings by address, taking only O(1). Population cost will be amortized O(N) and search cost amortized O(1). So this solution provides the best performance and out-of-box implementation. However please note that this solution will work only if there are no duplicates in the original list.
If you have any questions please let me know.

You can use a Java Hashtable with no problems. According to the Java Documentation "in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially."
I think you have a misconception on how hash tables work. Hash collisions do NOT ruin the implementation. A hash table is simply an array of linked-lists. Each key goes through a hash function to determine the index in the array which the element will be placed. If a hash collision occurs, the element will be placed at the end of the linked-list at the index in the hash-table array. See link below for diagram.

Inconsistencies in Big-O of removing from an ArrayList vs a Hash Table?

I'm looking at this website that lists Big O complexities for various operations. For Dynamic Arrays, the removal complexity is O(n), while for Hash Tables it's O(1).
For Dynamic Arrays like ArrayLists to be O(n), that must mean the operation of removing some value from the center and then shifting each index over one to keep the block of data contiguous. Because if we're just deleting the value stored at index k and not shifting, it's O(1).
But in Hash Tables with linear probing, deletion is the same thing, you just run your value through the Hash function, go to the Dynamic Array holding your data, and delete the value stored in it.
So why do Hash Tables get O(1) credit while Dynamic Arrays get O(n)?

This is explained here. The key is that the number of values per Dynamic Array is kept under a constant value.
Edit: As Dukeling pointed out, my answer explains why a Hash Table with separate chaining has O(1) removal complexity. I should add that, on the website you were looking at, Hash Tables are credited with O(1) removal complexity because they analyse a Hash Table with separate chaining and not linear probing.

The point of hash tables is that they keep close to the best case, where the best case means a single entry per bucket. Clearly, you have no trouble accepting that to remove the sole entry from a bucket takes O(1) time.

When there are many hash conflicts, you certainly need to do a lot of shifting when using linear probing.
But the complexity for hash tables are under the assumption of Simply Uniform Hashing, meaning that it assumes that there will be a minimal number of hash conflicts.
When this happens, we only need to delete some value and shift either no values or a small (essentially constant) amount of values.

When you talk about the complexity of an algorithm, you actually need to discuss a concrete implementation.
There is no Java class called a "Hash Table" (obviously!) or "HashTable".
There are Java classes called HashMap and Hashtable, and these do indeed have O(1) deletion.
But they don't work the way that you seem to think (all?) hash tables work. Specifically, HashMap and Hashtable are organized as an array of pointers to "chains".
This means that deletion consists of finding the appropriate chain, and then traversing the chain to find the entry to remove. The first step is constant time (including the time to calculate the hash code. The second step is proportional to the length of the hash chains. But assuming that the hash function is good, the average length of the hash chain is a small constant. Hence the total time for deletion is O(1) on average.
The reason that the hash chains are short on average is that the HashMap and Hashtable classes automatically resize the main hash array when the "load factor" (the ratio of the array size to the number of entries) exceeds a predetermined value. Assuming that the hash function distributes the (actual) keys pretty evenly, you will find that the chains are roughly the same length. Assuming that the array size is proportional to the total number of entries, the actual load factor will the average hash chain length.
This reasoning breaks down if the hash function does not distribute the keys evenly. This leads to a situation where you get a lot of hash collisions. Indeed, the worst-case behaviour is when all keys have the same hash value, and they all end up on a single hash chain with all N entries. In that case, deletion involves searching a chain with N entries ... and that makes it O(N).
It turns out that the same reasoning can be applied to other forms of hash table, including those where the entries are stored in the hash array itself, and collisions are handled by rehashing scanning. (Once again, the "trick" is to expand the hash table when the load factor gets too high.)

which data structure in java can add key/value pair at constant time and maintain sorted by value?

basically i'm looking for a best data structure in java which i can store pairs and retrieve top N number of element by the value. i'd like to do this in O(n) time where n is number of entires in the data structure.
example input would be,
<"john", 32>
<"dave", 3>
<"brian", 15>
<"jenna", 23>
<"rachael", 41>
and if N=3, i should be able to return rachael, john, jenna if i wanted descending order.
if i use some kind of hashMap, insertion is fast, but retrieving them by order gets expensive.
if i use some data structure that keeps things ordered, then insertion becomes expensive while retrieving is cheaper. i was not able to find the best data structure that can do both very well and very fast.
any input is appreciated. thanks.
[updated]
let me ask the question in other way if that make it clearer.
i know i can insert at constant time O(1) into hashMap.
now, how can i retrieve elements from sorted order by value in O(n) time where n=number of entires in the data structure? hope it makes sense.

If you want to sort, you have to give up constant O(1) time.
That is because unlike inserting an unsorted key / value pair, sorting will minimally require you to compare the new entry to something, and odds are to a number of somethings. Once you have an algorithm that will require more time with more entries (due to more comparisons) you have overshot "constant" time.
If you can do better, then by all means, do so! There is a Dijkstra prize awaiting for you, if not a Fields Medal to boot.
Don't dispair, you can still do the key part as a HashMap, and the sorting part with a Tree like implementation, that will give you O(log n). TreeMap is probably what you desire.
--- Update to match your update ---
No, you cannot iterate over a hashmap in O(n) time. To do so would assume that you had a list; but, that list would have to already be sorted. With a raw HashMap, you would have to search the entire map for the next "lower" value. Searching part of the map would not do, because the one element you didn't check would possibly be the correct value.
Now, there are some data structures that make a lot of trade offs which might get you closer. If you want to roll your own, perhaps a custom Fibonacci heap can give you an amortized performance close to what you wish, but it cannot guarantee a worst-case performance. In any case, some operations (like extract-min) will still require O(log n) performance.

Linked list in hash tables

Suppose we wish to repeatedly search a linked list of length N elements, each of which contains a very long string key. How might we take advantage of the hash value when searching the list for an element with a given key?

Insert the keys into a hash table. Then you can search in (theoretically) constant time.

You need to have the hashes prepared before searching the list and you need to be able to access hash of a string in constant time. Then you can first compare the hashes and only compare strings when hashes match. You can use a hashtable instead of a linked list.

The hash value of a String (hashCode) is a bit like an id for the string. Not completely unique, but usually pretty unique. You can a HashMap to store the String keys and their values. HashMap, as its name suggests, uses the Strings' hash values to efficiently store and retrieve values.

Not sure what constraints you're working under here (i.e. this might take way too much memory depending on how big the strings are), but if you have to maintain the linked list you could create a HashMap that maps the strings to their position in the list which would allow you to retrieve any string from the list with 2 constant time operations.

Put it in HashSet. The search algorithms will use the hash for each String value inserted.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.