Data structure with two way O(1) lookup. Hashtable? - java

I'm implementing a system where I have a list of names and each person has 1 phone number. I need to be able to take a name and look up the phone number, or take a phone number and look up a name.
I know I could do this by having two hashtables - one which goes from names to phone numbers and one which goes from phone numbers to names. Then I can look up in either direction in O(1) time. However this seems like I am storing too much data - every name and every phone number is stored twice.
Is there any way to do this more efficiently? What data structure should I use to store the names and phone numbers?
I am coding in Java if that is relevant.
Many thanks!

Java does not provide a two-way hash table out-of-the-box. Your solution that relies on two hash tables is as good as it gets, unless you are willing to go with third-party libraries (which would hide the two hash tables for you) or re-implement a significant portion of HashMap<K,V>.
Then I can look up in either direction in O(1) time. However this seems like I am storing too much data - every name and every phone number is stored twice.
Not necessarily: you can use the same object that represents the phone number, in which case there would be a single object for the phone number, with two references to it stored from two hash tables.

Consider using Guava's HashBiMap, which is basically two HashMaps linked together behind the scenes. See also the BiMap interface and its related article.

Remember that the object itself is stored only once, and not in both maps. You only need double number of references - so it might not be that bad.
You can use Gauva BiMap that offers that functionality (and its interface HashBiMap)

Related

Is 'hashing' more efficient than 'linear' search?

I decided to revise Java collection framework, so I started with internal implementation. One question came on my mind, which I can't solve. Hope someone can make a clear explanation on following.
ArrayList uses linear or binary search (both have pros/cons), but we can do anything with them! My question is why do all 'hashing' classes (like HashMap f.e.) use hashing principle? Couldn't they settle with linear or binary search for example? Why just not store Key/Value pair inside array? And the opposite, why isn't (for example ArrayList stored in hashTable)?
The intention of the collections framework is that the programmer will choose the data structure appropriate to the use case. Depending on what you're using it for, different data structures are appropriate.
Hashing classes use the hashing principle, as you put it, because if you choose them, then that's what you want to use. (Hashing is generally the best choice for simple, straightforward lookups.) A screwdriver uses the screwing principle because if you pick up a screwdriver, you want to screw something in; if you had a nail you needed to put in, you would have picked up the hammer instead.
But if you're not going to be performing lookups, or if linear search is good enough for you, then an ArrayList is what you want. It's not worth adding a hash table to a collection that's never going to use it, and it costs CPU and memory to do things you aren't going to need.
I had a large hash of values (about 1,500). The nature of the code was that once the hashmap was loaded it would never be altered. The hashmap was accessed many times per web page, and I had wondered if it could be sped up for faster page loading.
One day I had some time, so I did a series of time tests (using the nano time function). I then reworked the hashmap use over to an array. Not an ArrayList, but an actual array[]. I stored the index with the key class used to get the hash value.
There was a difference, that the array lookup was faster. I calculated that over a days worth of activity I would have saved almost a full second!
So yes, using an array is faster than using a hash, YMMV :-)
And I reverted my code back to using a hashmap, as it was easier to maintain...

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

What data structure should I use if I want to search an object in different ways according to different attributes?

I'm programming in java, but that's just a detail.
I have this object person class with attributes like: name, age, weight... And I need to have people stored in my application and being able to search them. Now, I can search them by name, age, weight... all that person's attributes. What's the best data structure/implementation that allows me to do this efficiently?
K-D tree is a good choice for that. It partitions multi-dimensional data (any object with multiple attributes), and enables binary search tree like O(logN) search complexity. However, this will require few modifications to main variant.
If you don't know about it yet, go read about it first. So now you know, K-D tree doesn't exactly allow that "Given name="John Doe",find the guy" kind of queries. Instead, what it allows is "given this entire John Doe guy, find who is *closest* to him".
At every level of the tree, it chooses left or right sub-tree based on corresponding dimension of that level. But for 1st kind of query, your data for all dimensions except one is null. So, to search, you create an input object anyway with special dummy data for all but that one dimension. In your search function, when you encounter those special data, you carry on search on both sub-trees. Instead of closeness, unlike K-D tree, you can check for exact match.
You are unlikely to see the effect of this data structure if you are dealing with small amount of data. Interestingly though, when you search against more than 1 attribute, like "given age=20 and name = "John", find the guy(s)", the search will be lot faster.
What's your end-goal? If you just want to see trends between different people ad-hoc, I'd recommend just using R or Python PANDAS. That way you can quickly look up, compare, and visualize groups/individuals based on different attributes.
If you want to make an application in Java with multiple searchable options, and you don't care too much about space, I'd use multiple hash tables, with each hash corresponding to a different attribute. Have the values in an array be pointers pointing to a person. https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
K is your attribute (age, sex, etc.) and V are pointers to people.

Need help storing/retrieveing data

A user has an ArrayList of numbers. Those numbers correspond to the applications that they have access to. Certain applications have different entitlement numbers(ie- 3, 72, etc.). What I want to do is store this data in a map so that when the user logs in, they can input their numbers into the Map and quickly get the applications that they are entitled to. However, some applications require 2 to 3 entitlement numbers. For example, one entry could be: ("101 and 234", "Application 1"). I was wondering if there was an ideal way of retrieving all of the values from the map that that users entitlement numbers satisfies.
How I currently have it, the program compares the users data to each application and confirms or denies access. This seems inefficient. Any help us greatly appreciated!
Note: I am reading the applications and their numbers in from an XML, so I can store them as I wish.
If there are large numbers of numbers required per application, the best approach is to use set intersection. If the numbers are contiguous or at least dense, you can optimize this into a bitset. For only one or two numbers though, I'd recommend just testing each number individually, since it's likely to be faster than full set operations.
The solution:
Define a class for each application (let's call it App). The class contains the name of the application, and a (sorted) List/array of entitlements.
Use Map to map from String to App: Map<String, App> for all single entitlement apps (you can use HashMap or TreeMap - your choice). If there are multiple apps that only need one and the same entitlement, consider Map<String, List<App>>. Exclude the apps that need multiple entitlements from the Map, and store them in a separate List/array instead.
When you are given a list of entitlements to retrieve apps, loop through the list and just grab everything that the Map maps the String to. For those that needs multiple entitlements, just check individually (you can speed up the checking a bit by sorting the list of entitlements given, and storing the entitlements to each App in sorted order - but it may not even matter since the size is small).
The solution reduces the time complexity for the operation. However, a few hundred apps times the number of entitlements of around 10 is quite small in my opinion, unless you call this many times. And it's better to time your original approach and this approach to compare - since the overhead might shadows any improvement in time.
A bit of further improvement (or not) is:
Use Map<String, List<App>> and include even apps that need multiple entitlements (those app will be mapped to by many entitlements).
When we search for app, we will use a Map<App, Integer> to keep track of how many entitlements that we have confirmed for multiple entitlement apps. So the flow will be like:
new mapAppInteger
foreach entitlement in inputListOfEntitlement
listOfApps = mapStringAppList.get(entitlement)
if listOfApps found
for each app in listOfApps
if app needs singleEntitlement
put app in output list
else // needs multiple
if app is in mapAppInteger
map app --> count + 1
if mapAppInteger.get(app) == app.numberOfRequiredEntitlement
put app in output list
remove app from mapAppInteger
else // not in mapAppInteger
map app --> 1

Memory Friendly Fast Key-Value Access Solution for Android

I have an Android application that iterates through an array of thousands of integers and it uses them as key values to access pairs of integers (let us call them id's) in order to make calculations with them. It needs to do it as fast as possible and in the end, it returns a result which is crucial to the application.
I tried loading a HashMap into the memory for fast access to those numbers but it resulted in OOM Exception. I also tried writing those id's to a RandomAccessFile and storing their offsets on the file to another HashMap but it was way too slow. Also, the new HashMap that only stores the offsets is still occupying a large memory.
Now I am considering SQLite but I am not sure if it will be any faster. Are there any structures or libraries that could help me with that?
EDIT: Number of keys are more than 20 million whereas I only need to access thousands of them. I do not know which ones I will access beforehand because it changes with user input.
You could use Trove's TIntLongHashMap to map primitive ints to primitive longs (which store the ints of your value pair). This saves you the object overhead of a plain vanilla Map, which forces you to use wrapper types.
EDIT
Since your update states you have more than 20 million mappings, there will likely be more space-efficient structures than a hash map. An approach to partition your keys into buckets, combined with some sub-key compression will likely save you half the memory over even the most efficient hash map implementation.
SQLite is an embedded relational database, which uses indexes. I would bet it is much faster than using RandomAccessFile. You can give it a try.
My suggestion is to rearrange the keys in Buckets - what i mean is identify (more or less) the distribution of your keys, then make files that corresponds to each range of keys (the point is that every file must contain just as much integers that can get in memory and no more then that) then when you search for a key, you just read the whole file to the memory and look for it.
exemple, assuming the distribution of the key is uniform, store 500k values corresponding to the 0-500k key values, 500k values corresponding to 500k-1mil keys and so on...
EDIT : if you did try this approach, and it still went slow, i still have some tricks in my sleaves:
First make sure that your division is actually close to equal between all the buckets.
Try to make the buckets smaller, by making more buckets.
The idea about correct division to buckets by ranges is that when you search for a key, you go to the corresponding range bucket and The key either in it or that it is not in the whole collection. so there is no point on Concurnetly reading another bucket.
I never done that, cause im not sure concurrency works on I\O's, but it may be helpfull to Read the whole file with 2 threads one from top to bottom and the other from bottom to top until they meet. (or something like that)
While you read the whole bucket into memory, split it to 3-4 arraylists, Run 3-4 working threads to search your key on each of the arrays, the search must end way faster then.

Categories

Resources