Suggestions for data structure for keyword search

Suggestions for data structure for keyword search - java

I need to put together a data structure that will efficiently provide keyword search facilities.
My metrics are:
Circa 500,000 products.
Circa 20+ keywords per product (a guess).
Products are identified by an ID of about 10 digits but may be any ASCII codes going forward.
I would like to try to fit the data structure in memory if possible. I will be on a server so I can assume some significant memory availability.
Speed is important. Using LIKE database queries will not be an acceptable solution.
Any ideas for a data structure?
My thoughts:
TrieMap
Very efficient for the keywords but there would need to be a list of product IDs hanging off any leaf so seriously memory hungry. Any ideas that could help with that?
Compression
Various compression schemes come to mind but none jump out as of significant value.
Has anyone else put something like this together? Could you share your experiences?
The data may change but not often. It would be reasonable to rebuild the structure on a daily basis to accommodate changes.

Have you thought about using lucene either in memory or as a file system index?
It is quite fast and has lots of room for further requirements that might arise in the future.

Related

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?

After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.

There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.

You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.

The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa

Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

Store and search sets (with many possible values) in a database (from Java)

The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?

Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.

How B-Tree works in term of serialisation?

In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.

you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.

You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.

For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks

This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".

To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…

Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.

This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

What database to use?

I'm new to databases, but I think I finally have a situation where flat files won't work.
I'm writing a program to analyze the outcomes of multiplayer games, where each game could have any number of players grouped into any number of teams. I want to allow players can win, tie, or leave partway through the game (and win/lose based on team performance).
I also might want to store historical player ratings (unless it's faster to just recompute that from their game history), so I don't know if that means storing each player's rating alongside each game played, or having a separate table for each player, or what.

I don't see any criteria that impacts database choice, but I'll list the free ones:
PostgreSQL
MySQL
SQL Server Express
Oracle Express
I don't recommend an embedded database like SQLite, because embedded databases make trade-offs in features to accommodate space & size concerns. I don't agree with their belief that data typing should be relaxed - it's lead to numerous questions on SO about about to deal with date/time filtration, among others...
You'll want to learn about normalization, getting data to Third Normal Form (3NF) because it enforces referential integrity, which also minimizes data redundancy. For example, your player stats would not be stored in the database - they'd be calculated at the time of the request based on the data onhand.

You didn't mention any need for locking mechanisms where multiple users may be competing to write the same data to the same resource (a database record or file in the case of flat files) simultaneously. What I would suggest is get a good book on database design and try to understand normalization rules in depth. Distributing data across separate tables have a performance impact, but they also have an effect on the ease-of-use of query construction. This is a very involving topic, and there's no simple answer to it. That's why companies hire database administrators to keep their data structures optimized.
You might want to look at SQLite, if you need a lightweight database engine.

Some good options were mentioned already, but I really think that on Java platform, H2 is a very good choice. It is perfect for testing (in-memory test database), but works very well also for embedded use cases and as stand-alone "real database". Plus it is easy to export as dump file, import from that, to move around. And works efficiently too.
It is developed by a very good Java DB guy, and is not his first take, and you can see this from maturity of the project. On top of this it is still being actively developed as well as supported.

A word on why nobody even mentions any of the "NoSQL" databases while you have used it as a tag:
Non-SQL databases are getting a lot of attention (or even outright hype) recently, because of some high-profile usecases, because they're new (and therefore interesting), and because their promise of incredible scalability (which is "sexy" to programmers). However, only a very few very big players actually need that kind of scalability - and you certainly don't.
Another factor is that SQL databases require you to define your DB schema (the structure of tables and columns) beforehand, and changing it is somewhat problematic (especially if you already have a very large database). Non-SQL databases are more flexible in that regard, but you pay for it with more complex code (e.g. after you introduce a new field, your code needs to be able to deal with elements where it's not yet present). It doesn't sound like you need this kind of flexibility either.

Try also OrientDB. It's free (Apache 2 license), run everywhere, supports SQL and it's really fast. Can insert 1,000,000 of records in 6 seconds on common hw.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.