I have a large document containing string like this, basically a non-delimited string -
mynameisjohnsmith
I also have a collection of names, this could be really large, assume a million records. What I intend to do it to check if the document contains a name that is available in the collection. One way to do it is to index the document and iterate over the collection and for each entry search the index for the name. This could be really inefficient in case the names is not there in the collection (1 million iterations).
I am wondering if there are better ways of doing it. Something like indexing both the document and the names and finding an intersection.
Thanks.
The Aho-Corasick string searching algorithm uses a finite state machine to search for a large number of strings simultaneously in a document. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. It's how virus scanning software is able to efficiently search for a large number of virus signatures in files in reasonable time.
Related
I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.
There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.
Is there a better way to solve this problem?
Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.
The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.
So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.
This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:
Return only results that match enough NGrams with Solr
Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:
Sybase n-gram text index
Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.
First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.
Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.
Use Solr (or Lucene) could be a suitable solution for you?
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.
You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!
It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.
You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)
You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).
I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms.
Now, suppose I have the following data structure:
A[numTerms][numMatchedDocuments][numOccurInADocument]
where numMatchedDocuments is the size of D and numOccurInADocument is the number of occurrences a particular term occurs in a particular document, for example:
A[stackoverflow][document1][occurance1]=3;
means, the term "stackoverflow" occurs in document "document1" and its first occurance is at position "3".
Then I pick the term that occur the least and loop over all its positions to find if "forum" occurs at a position+1 the current term "stackoverflow" positions. In other words, if I find "forum" at position 4 then that is a phrase and I've found a match for it.
the matching is straightforward per document and runs reasonably fast but when the number of documents exceed 2,000,000 it gets very slow. I've distributed it over cores and it gets faster of course but wonder if there is algorithmically better way of doing this.
thanks,
Psudo-Code:
boolean docPhrase=true;
int numOfTerms=2;
// 0 for "stackoverflow" and 1 for "forums"
for (int d=0;d<D.size();d++){
//D is a set containing the matched documents
int minId=getTheLeastOccuringTerm();
for (int i=0; i<A[minId][d].length;i++){ // For every position for LeastOccuringTerm
for( int t=0;t<numOfTerms;t++){ // For every terms
int id=BinarySearch(A[t][d], A[minId][d][i] - minId + t);
if (id<0) docPhrase=false;
}
}
}
As I mentioned in comments, Suffix Array can solve this sort of problem. I answered a similar question ( Fastest way to search a list of names in C# ) with a simple c# implementation of a Suffix Array.
The basic idea is you have an array of index pairs that point to a document index, and a position within that document. The index pair represents the string that starts at that point in the document, and continues to the end of the document. But the actual documents and their contents exist only once in your original store. The Suffix Array is just an array of these index pairs, with a pair for every position in every document. You then sort the Suffix Array in the order of the text they point to. Once sorted, you can now very quickly find any phrase among any of the documents by doing a simple Binary Search on the Suffix Array. Constructing (mainly sorting) the Suffix Array can be time consumptive. But once constructed, it is very fast to search on. It's fairly easy on memory since the actual document contents only exist once.
It would be trivial to extend it to returning counts of phrase matches within each document.
This is a little different than the classic description of a Suffix Array where they are usually talking about the Suffix Array operating over one single, very large string. But the changes to make it work for an array of strings/documents is not that large, although it can increase the amount of memory consumed by the Suffix Array depending on the maximum number of documents and the maximum document length, and how you encode the index pairs.
I have a text with about 300 - 500 words. Also i got about 200k keywords and i want to know if each of the keywords is contained in the text. A String contains ist quite slow, is there some way to preprocess the String?
I thought about using a SuffixTree but im not sure this is the best choice.
Also, are there any good librarys for this task? semanticdiscoverytoolkit for example has a suffixtree implementation but after adding the string i cant figure out how to look up if a string is contained in the tree.
Greetings,
Nico
you can try the rabin-karp string search algorithm. since you are doing mostly hash (integer) comparisons, the performance is much better than string comparisons.
compute the hash of the keyword
compute the rolling hash of the text
compare these 2 hashes. if they match, perform the actual string comparison.
advance the position by 1 character and repeat from step 2 until you reach the end of the text.
as a analogy, the rolling hash is like a "sliding window" that scrolls along the text. the hash comparison is done using the hash of the substring in the "sliding window" against the hash of the keyword.
You can use StringTokenizer to get each of the words then populate a hashmap which you check afterwards. This requires going through each list only once. Lookup times should then be very fast which is important given the amount of keywords you have.
It may be worth it to profile this method against something like Lucene.
Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.
I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.
I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.
If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.