Efficient way of storing and matching names against large data sets

Efficient way of storing and matching names against large data sets - java

For a data loss prevention like tool, I have a requirement where I need to lookup different types of data such as driver's license number, social security number, names etc. While most of this is pattern based and hence could be looked up using pattern matching with regular expressions, name happens to be a very broad category. There could be virtually any set of characters that could form a name. However, to make it a meaningful lookup, I think I should only lookup them against a defined dictionary of names. Here is what I am thinking.
Provide a dictionary of names as a configuration item. This looks more sensible as for each use case, the names might vary from different geographic regions. I am looking for best practices for doing this in Java. Basically these are the questions-
What is a good data structure to store the names. Set comes to mind as the first option, are there better options like in memory databases.
How should I go about searching these names in the large data sets. These data sets are really large and I only have the facility to read them row by row.
Any other option?

Take a look at concurrent-trees and CQEngine projects.

You can do it with full text indexing or online search.
I would prefer full text indexing, e.g. with Lucene. You will have to define how the indexer finds tokens in the text (by defining the token patterns and the dont-care-patterns).
Known patterns (e.g. license numbers) should be annotated at indexing time with their type. Querying the index for an annotated type (e.g. license number) will return you all contained license numbers.
Flexible patterns (like names) should be index as tokens. You can then iterate over the collection of legal names and query the index with it.
This approach is not the most flexible, but it is very robust to changes to the set of data files (simply put the new file to the index) or to the set of names (simply query the new name in the index).
In this approach it is not really performance relevant how you store the set of names
The other approach would be to search for multiple strings (names). Note that there are special search algorithms for multiple strings and that most algorithms have a preferred range of params (pattern size, alphabet size, number of patterns to search). You can get some impressions at StringBench.
This approach allows you more flexible string patterns.
However it is not robust to modifications to the set of names (then the complete search has to be repeated).
Multi-string usually would accept a set of strings to search, but they will store this set in a algorithm-specific way (most use a trie)
edit:
Efficient search for multiple patterns/strings can be done with DFA-based automata.
The first time I wanted to search efficiently in text I chose dk.brics.automaton. Its automaton is very efficient, yet it is optimized for matching not for searching (search is done in naive way).
I then shifted to my own implementation rexlex. It is DFA-based, but slightly slower than brics. The search algorithm is not as naive as in brics, but adds some overhead.
You find a link to a benchmark comparing both. The benchmark visualizes the problem of DFA-based regexes - the time to compile such a DFA can get very expensive if the regex is large.
I currently favor the stringandchars implementation of multi-string/pattern-search. It is focused on search performance, yet I do not know how it compares to the solutions above. The most common case of searching multiple regex patterns in text will be much more performant as in the above solutions.

Related

Using self-built approaches in Lucene search engine

I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.

This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.

500,000 street names - what data structure and to use to implement a fast search?

So we have many street names. They come in a file. Id probably cache them when booting the server up in production. The search should be auto complete like - e.g. you type 'lang ' and you would get maybe 8 hits : langstr, langestr. Etc

What you are looking for is some sort of compressed trie representation. You might want to look into succinct tries or DAWGs as a starting point, as they give excellent efficiency and very good space usage.
Hope this helps!

Autocomplete is usually implemented using one of the following:
Trees. By indexing the searchable text in a tree structure (prefix tree, suffix tree, dawg, etc..) one can execute very fast searches at the expense of memory storage. The tree traversal can be adapted for approximate matching.
Pattern Partitioning. By partitioning the text into tokens (ngrams) one can execute searches for pattern occurrences using a simple hashing scheme.
Filtering. Find a set of potential matches and then apply a sequential algorithm to check each candidate.
Take a look at completely, a Java autocomplete library that implements some of the latter concepts.

Is there a fast Java library to search for a string and its position in file?

I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?

600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?

Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.

How to search for multiple strings in a text file

i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.
If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?

It depends a lot on the size of the text file. There are usually several cases you should consider:
Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.
Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.
If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.
There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.

Some more detail is required before suggesting an approach:
Are you searching for whole words only or any substring?
Are you going to search for many different words in the same unchanged file?
Do you know the words you want to search for all at once?
There are many efficient (linear) search algorithms for strings. If possible I'd suggest using one that's already been written for you.
http://en.wikipedia.org/wiki/String_searching_algorithm
One simple idea is to use a sliding window hash with the window the same size as the search string. Then in a single pass you can quickly check to see where the window hash matches the hash of your search string. Where it matches you double check to see if you've got a real match.

how to approach phrase queries and term grouping

I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.

You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.