I am working on a small project that will essentially search for a user-given word in multiple text files. I plan on accomplishing this by hashing each file into a large hash table prior to the search, then hashing the user's choice of word and comparing it to the hashtable.
My issue is that I would like to exclude certain common words like "the" from my hashing. The two ways I have thought of to do this are as follows:
Create a regex which is essentially "\bword1\b|\bword2\b|" and so on and so forth, and then do a String.split(regex,"") to remove those words from the text before I start hashing
As I process each word, do a String.matches(regex) to check to see if that word falls into my regex of excluded words. If it does, simply skip to the next word.
I feel like these two solutions are very similar, and am wondering if there might be a more efficient way of doing this.
I would suggest maintaining a HashSet of stopwords (that's the official term in the field of Information Retrieval). You just check stopwords.contains(word).
Let me also suggest a technique used to quickly search for words in documents: the inverted index. Don't maintain a hashmap per file; maintain a single hashmap where keys are words and values are sets of document IDs containing the word.
Then, if you want to search for all documents containing two given words, you'll be able to serve that request by just fetching two sets and calculating their intersection.
Related
I am working on a text classification program. My training data is 700+ text categories and each categories contains 1-350 text phrases. 16k+ phrases total. The data that needs to be classified are text phrases. I want to classify the data so it gives me 5 most similar categories. The training data shares a lot of common words.
My first attempt was using Naive Bayes Theorem using this library on github because this library was very easy to use and allowed me to load my training data as strings. But other users reported issues and when I tried to classify my data, my input is either classified wrong or not classified.
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier
So I think the library was the issue, so Im going try different libraries and look into k means clustering since my data is high variance.
So when I looking at other libraries, they all require input and training data as a vector matrix. I looked at word2vec and td-idf to convert text vectors. I understand tf-idf, and that I can get the weight of the word compared to the rest of the documents. But how can I use it classify my input data to categories? Would each category be a document? Or would all categories be a single document?
edit:data sample
SEE_BILL-see bill
SEE_BILL-bill balance
SEE_BILL-wheres my bill
SEE_BILL-cant find bill
PAY_BILL-pay bill
PAY_BILL-make payment
PAY_BILL-lower balance
PAY_BILL-remove balance
PAST_BILL-last bill
PAST_BILL-previous bill
PAST_BILL-historical bill
PAST_BILL-bill last year
First of all, the end of your question doesn't make any sense because you didn't say what are the classes you want to to classify the text phrases to. Now, I can help you with the vectorization of the text phrases.
Tf-idf is pretty good but you have to have a good preprocessing to do it. Also, you would have to create the vectors yourself. The problem with it is that you will give the vector of length of all the distinct words in your dataset, even the same words in different forms in which they occur in the dataset. So if you have the word go in your dataset, it's likely that there will be several forms of that word including going, Go, gone, went and so on. That's why you have to have a good preprocessing tho put all of those forms of word go to it's root form. Also, you have to lowercase the whole dataset because words go and Go are not the same. But even if you do all of that and make a perfect preprocessing pipeline, you will get the vector of length 20k+. You would then have to manually select the features (words) you want to leave in the vector and delete the others. That means, if you want to have vector of size 300 you would have to delete the 19 700 words from the vector. Of course, you would be left with the 300 best distinctive. If you want to dive into it deeper and see how exactly it works, you can check it out here
On the other hand, word2vec maps any word to a 300 dimensional vector. Of course, you would have to do some preprocessing, similar to the tf-idf, but this method is much less sensitive. You can find how word2vec works here
In conclusion, I would recommend you go with word2vec because it's much easier to start with. There is pretrained model from Google which you can download here
The two most popular approaches would be to:
represent each phrase/sentence as a bag of words where you basically one-hot encode each word of the phrase and the dimension of the encoding is the dimension of your vocabulary (total number of words)
use embeddings based on popular models, like word2vec which put each word in a X-dimensional vector space (e.g. 300-dimensional) so each of your phrases/sentences would be a sequence of X-dimensional vectors
An even more extreme approach would be to embed whole sentences using models like universal-sentence-encoder. In short: it's similar to word2vec but instead of words converts whole sentences to (512-dimensional) vector space. Than it's easier to find "similar" sentences.
For a data loss prevention like tool, I have a requirement where I need to lookup different types of data such as driver's license number, social security number, names etc. While most of this is pattern based and hence could be looked up using pattern matching with regular expressions, name happens to be a very broad category. There could be virtually any set of characters that could form a name. However, to make it a meaningful lookup, I think I should only lookup them against a defined dictionary of names. Here is what I am thinking.
Provide a dictionary of names as a configuration item. This looks more sensible as for each use case, the names might vary from different geographic regions. I am looking for best practices for doing this in Java. Basically these are the questions-
What is a good data structure to store the names. Set comes to mind as the first option, are there better options like in memory databases.
How should I go about searching these names in the large data sets. These data sets are really large and I only have the facility to read them row by row.
Any other option?
Take a look at concurrent-trees and CQEngine projects.
You can do it with full text indexing or online search.
I would prefer full text indexing, e.g. with Lucene. You will have to define how the indexer finds tokens in the text (by defining the token patterns and the dont-care-patterns).
Known patterns (e.g. license numbers) should be annotated at indexing time with their type. Querying the index for an annotated type (e.g. license number) will return you all contained license numbers.
Flexible patterns (like names) should be index as tokens. You can then iterate over the collection of legal names and query the index with it.
This approach is not the most flexible, but it is very robust to changes to the set of data files (simply put the new file to the index) or to the set of names (simply query the new name in the index).
In this approach it is not really performance relevant how you store the set of names
The other approach would be to search for multiple strings (names). Note that there are special search algorithms for multiple strings and that most algorithms have a preferred range of params (pattern size, alphabet size, number of patterns to search). You can get some impressions at StringBench.
This approach allows you more flexible string patterns.
However it is not robust to modifications to the set of names (then the complete search has to be repeated).
Multi-string usually would accept a set of strings to search, but they will store this set in a algorithm-specific way (most use a trie)
edit:
Efficient search for multiple patterns/strings can be done with DFA-based automata.
The first time I wanted to search efficiently in text I chose dk.brics.automaton. Its automaton is very efficient, yet it is optimized for matching not for searching (search is done in naive way).
I then shifted to my own implementation rexlex. It is DFA-based, but slightly slower than brics. The search algorithm is not as naive as in brics, but adds some overhead.
You find a link to a benchmark comparing both. The benchmark visualizes the problem of DFA-based regexes - the time to compile such a DFA can get very expensive if the regex is large.
I currently favor the stringandchars implementation of multi-string/pattern-search. It is focused on search performance, yet I do not know how it compares to the solutions above. The most common case of searching multiple regex patterns in text will be much more performant as in the above solutions.
I'm creating a mini search engine in Java which basically grabs all of the RSS feeds that a user specifies and then allows him or her to choose a single word to search for. Since the RSS feed documents are fairly limited in number, I'm thinking about processing the documents first before the user enters his or her search term. I want to process them by creating hashmaps linking certain keywords to a collection of records which contain the articles themselves and the number of times the word appears in the article. But, how would I determine the keywords? How can I tell which words are meaningless and which aren't?
The concept of "what words should I ignore?" is generally named stopwords. The best search engines do not use stopwords. If I am a fan of the band "The The", I would be bummed if your search engine couldn't find them. Also, searching for exact phrases can be screwed up by a naive stopwords implementation.
By the way, the hashmap you're talking about is called an inverted index. I recommend reading this (free, online) book to get an introduction to how search engines are built: http://nlp.stanford.edu/IR-book/information-retrieval-book.html
In Solr, I believe these are called 'stopwords'.
I believe they just use a text file to define all the words that they will not search on.
A small extract re. stopwords from NLTK from Ch. 2:
There is also a corpus of stopwords, that is, high-frequency words
like the, to and also that we sometimes want to filter out of a
document before further processing. Stopwords usually have little
lexical content, and their presence in a text fails to distinguish it
from other texts.
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]
Stopwords are one thing you should use. Lots of stopword lists are available on the web.
However I'm writing an answer because the previous ones didn't mention TF-IDF which is a metric for how important a word is in the context of your corpus of documents.
A word is more likely to be a keyword foe a document if it appears a lot in it (term frequency) and doesn't appear frequently in other documents (inverse document frequency). This way words like a, the, where, are naturally ignored, because they appear in every document.
P.S. On a related topic, you'll probably be interested in other lists, i.e. swearwords :)
P.P.S. Hashmaps are a good thing, but you should also check suffix trees for your task.
i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.
If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?
It depends a lot on the size of the text file. There are usually several cases you should consider:
Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.
Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.
If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.
There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.
Some more detail is required before suggesting an approach:
Are you searching for whole words only or any substring?
Are you going to search for many different words in the same unchanged file?
Do you know the words you want to search for all at once?
There are many efficient (linear) search algorithms for strings. If possible I'd suggest using one that's already been written for you.
http://en.wikipedia.org/wiki/String_searching_algorithm
One simple idea is to use a sliding window hash with the window the same size as the search string. Then in a single pass you can quickly check to see where the window hash matches the hash of your search string. Where it matches you double check to see if you've got a real match.
I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.
You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.