AppEngine Approximate Partial String Matching Algorithm

AppEngine Approximate Partial String Matching Algorithm - java

So, I realize that this covers a wide array of topics and pieces of them have been covered before on StackOverflow, such as this question. Similarly, Partial String Matching and Approximate String Matching are popular algorithmic discussions, it seems. However, using these ideas in conjunction to suit a problems where both need to be discussed seems highly inefficient. I'm looking for a way to combine the two problems in to one solution, efficiently.
Right now, I'm using AppEngine with Java and the Persistent DataStore. This is somewhat annoying, since it doesn't seem to have any arithmetic usage in the queries to make things easier, so I'm currently considering doing some precalculation and storing it as an extra field in the database. Essentially, this is the idea that a friend and I were having on how to possibly implement a system for matching and I was more or less hoping for suggestions on how to make it more efficient. If it needs to be scrapped in favor of something better that already exists, I can handle that, as well.
Let's start off with a basic example of what I'd look to do a search for. Consider the following nonsense sentence:
The isolating layer rackets the principal beneath your hypocritical rubbish.
If a user does a search for
isalatig pri
I would think that this would be a fairly good starting match for the string, and the value should be returned. The current method that we are considering using basically assigns a value to test divisibility. Essentially, there is a table with the following data
A: 2 B: 3 C: 5
D: 7 E: 11 F: 13
...
with each character being mapped to a prime number (multiple characters don't make a difference, only one character is needed). And if the query string divides the string in the database, then the value is returned as a possible match.
After this, keywords that aren't listed as stopwords are compared from the search string to see if they are starting substrings of words in the possible match under a given threshold of an edit distance (currently using the Levenshtein distance).
distance("isalatig", "isolating") == 2
distance("pri", "principal") == 0 // since principal has a starting
// substring of pri it passes
The total distance for each query is then ranked in ascending order and the top n values are then returned back to the person doing the querying.
This is the basic idea behind the algorithm, though since this is my first time dealing with such a scenario, I realize that I'm probably missing something very important (or my entire idea may be wrong). What is the best way to handle the current situation that I'm trying to implement. Similarly, if there are any utilities that AppEngine currently offers to combat what I'm trying to do, please let me know.

First off, a clarification: App Engine doesn't allow arithmetic in queries because there's no efficient way to query on the result of an arbitrary arithmetic expression. When you do this in an SQL database, the planner is forced to select an inefficient query plan, which usually involves scanning all the candidate records one by one.
Your scheme will not work for the same reason: There's no way to index an integer such that you can efficiently query for all numbers that are divisible by your target number. Other potential issues include words that translate into numbers that are too large to store in a fixed length integer, and being unable to distinguish between 'rental', 'learnt' and 'antler'.
If we discard for the moment your requirement for matching arbitrary prefixes of strings, what you are searching for is full-text indexing, which is typically implemented using an inverted index and stemming. Support for fulltext search is on the App Engine roadmap but hasn't been released yet; in the meantime your best option appears to be SearchableModel, or using an external search engine such as Google Site Search.

Related

Using self-built approaches in Lucene search engine

I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.

This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?

After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.

There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.

You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.

The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa

Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

What does the TopScoreDocCollector in Lucene by default use for Scoring?

I want to use Lucene to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working.
By several tutorials throughout the web I found the class TopScoreDocCollector to be of high relevance for querying the Lucene index.
You create it like this
int hitsPerPage = 10000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
and it will later collect the results of your query (only the amount you defined in hitsPerPage). I initially thought the results taken in would be just randomly distributed or something (like you have 100.000 documents that match your query and just get random 10.000). I think now that I was wrong.
After reading more and more about Lucene I came to the javadoc of the class (please see here). Here it says
A Collector implementation that collects the top-scoring hits,
returning them as a TopDocs
So for me it now seems that Lucene is using some very smart technology to somehow return me the top scored Documents for my input query. But how does that Scorer work? What does he take into account? I've extended my research on this topic but could not find an answer I completely understood so far.
Can you explain me how the Scorer in TopcScoreDocCollector scores my news documents and if this can be of use for me?

Lucene uses an inverted index to produce an iterator over the list of doc ids that matches your query.
It then goes through each one of them and computes a score. By default that score is based on so-called Tf-idf. In a nutshell, it will take in account how many times the terms of your query appear in the document, and also the number of documents that contain the term.
The idea is that if you look for (warehouse work), having the word work many times is not as significant as having the word warehouse many times.
Then, rather than sorting the whole set of matching documents, lucene takes in account the fact that you really only need the top K documents or so. Using a heap (or priority queue), one can compute these top K, with a complexity of O(N log K) instead of O(N log N).
That's the role of the TopScoreDocCollector.
You can implement your own logic for a scorer (assign a score to a document), or a collector (aggregates results).

This might be not the best answer since there is defently sooner or later someone available to explain the internal behaviour of Lucene but based on my days as a student there are two things regarding "information retrieval" - one is taking benefit of existing solutions such as Lucene and those - the other is the whole theory behind it.
If you are interested in the later one i recomend to take http://en.wikipedia.org/wiki/Information_retrieval as a starting point to get a overview and dig into the whole thematics.
I personally think it is one of the most interesting fields with a huge potential yet i never had the "academic hardskills" to realy get in touch with it.
To parametrize the solutions available it is crucial to at least have a overview of the theory - there are for example "challanges" whereas there has information been manually indexed/ valued as a reference to be able to compare the quality of a programmic solution.
Based on such a challange we managed to aquire a slightly higher quality than "luceene out of the box" after we feed luceene with 4 different information bases (sorry its a few years back i can barely remember hence the missing key words..) which all were the result of luceene itself but with different parameters.
To come back to your question i can answer none directly but hope to give you a certain base to determine if you realy need/ want to know whats behind luceene or if you rather just want to use it as a blackbox (and or make it a greybox by parameterization)
Sorry if i got you totally wrong.

Is there a fast Java library to search for a string and its position in file?

I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?

600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?

Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.

Usage examples of binary search

I just realized that in my 4+ years of Java programming (mostly desktop apps) I never used the binary search methods in the Arrays class for anything practical. Not even once. Some reasons I can think of:
100% of the time you can get away with linear search, maps or something else that isn't binary search.
The incoming data is almost never sorted, and making it sorted requires an extra sorting step.
So I wonder if it's just me, or do a lot of people never use binary search? And what are some good, practical usage examples of binary search?

On the desktop, you're probably just dealing with the user's data, which might not be all that big. If you are querying over very large datasets, shared by many users, then it can be a different matter. A lot of people don't necessarily deal with binary search directly, but anyone using a database is probably using it implicitly. If you use AppEngine, for example, datastore queries almost certainly use binary search.

I would say it boils down to this:
If we are going to do a binary search, we're going to have a key to search by. If we have a key, we're probably using a map instead of an array.
There's another very important thing to keep in mind:
Binary search is a clear-cut example of how thinking like a good programmer is very different than thinking like a normal person. It's one of those cognitive leaps that makes you really think about taking operations that are traditionally done (when done by humans) in order-n time and taking it down to order-lg-n time. And that makes it very, very useful even if it's never used in production code.

I hardly ever, if ever use a binary search.
But I would if:
I needed to search the same list multiple times
the list was long enough to have a performance problem (although I'm often guilty of micro-optimization)
However, I often use hash tables / dictionaries for fast lookups.

For production code on my day job, a Set or Map is always good enough so far.
For algorithmic problems that a I solve for fun, binary search is a very useful technique. For starters, if the set of elements never changes (i.e. you are never going to insert or delete elements in the set being queried) a Map/Set has no advantage over binary search - and a binary search over a simple array avoids a lot of the overhead associated with querying a more complex data structure. In many cases I have seen it to be actually faster than a HashMap.
Binary search is also a more general technique than just querying for membership in a set. Binary search can be performed on any monotone function to find a value for which the function satisfies a certain criteria. You can find a more detailed explanation here. But as I said, my line of work does not bring up enough computationally involved problems for this to be applicable.

Assume you have to search an element in a list.
You could use linear search, you’ll get O(n).
Alternatively, you could sort it by fastest algorithm (O(log n)*n), and binary search(O(log n)). You’ll get O((log n)*n + log n).
That means when searching large size of list, binary search is better. Also, it depends data structure of list. If list is a link based list, binary search is bad practice.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.