create and query a n-gram index with lucene

create and query a n-gram index with lucene - java

I would like to build an index containing n-grams of each line from my input file, which looks like this:
Segeln bei den Olympischen Sommerspielen
Erdmond
Olympische Spiele
Turnen bei den Olympischen Sommerspielen
Tennis bei den Olympischen Sommerspielen
Geschichte der Astronomie
I need the n-grams because I would like to search in the index but I have to assume that there are many typing errors in the search-term. For example I would like to find "Geschichte der Astronomie" if I search with the term "schichte astrologie". It would be even better if it could give me a list of the best possible matches, lets say the best 10 matches, no matter how bad they maybe are.
I hope you can point me in the right direction if there would be a better way to achieve this, than with n-grams, or that you have a hint how to create the index and how to query it. I would be very happy to have an example that helps me to understand how to do it.
I currently use lucene 4.3.1. I would prefer to implement it in java and not built the index on the command line.

There are a lot of different ways to approach to this problem, and Lucene has a lot of tools to help with them. N-Grams are probably not the best approach in this situation, to my mind.
Stemmers to reduce terms to their root, based on linguistic rules (ex. matching "fishing" "fished" and "fish) (I don't claim to know how GermanStemmer handles the "ge" prefix, but that would be a good example of something that a stemmer might deal with)
Synonym Filter can handle specific known synonyms you want to recognize (ex. "astrology" = "astronomy")
Fuzzy queries can be used to obtain matches with low edit distances.
Among other possibilities.
As far as implementing on NGrams, NGramTokenizer would be the correct tokenizer for that.

Related

Best algorithm for analyzing unique sentences and filtering them?

I am in the middle of writing some code to filter sentences into different groups.
The sentences are formed from the descriptions of incident tickets that my servicedesk have processed.
I have to filter them based on 5 catergories; Laptop,Telephony,Network, Printer,Application.
An example of a description from the application catergory is: "Please can you install CMS on XXXX YYYYYYY laptop"
I understand that it is impossible to get this perfect. But I was wondering what the best way to tackle this is? As you can see from the example it falls into the application category but contains a keyword "laptop".
If theres any more information I can provide you with please let me know. Every little helps. Thanks

Maintain different list or queues for different categories.
When you receive sentence, check for keyword occurrence in that sentence and add/push to appropriate list/queue.
you can maintain a map which tells you which list/queue for which keyword.

Interesting question! As seen in your example, there can be multiple keywords within the same sentence, making it difficult to decipher which category the sentence will belong to.
In order to get around this, I would suggest possibly using a separate priority queue for each category, containing keywords for each category in order of priority.
For example, you would have a priority queue of keywords for the Application category, and (within that priority queue) "install" would be of higher priority than "laptop" or "computer", because "install" is more closely related to applications than "laptop".
In your algorithm for choosing which category a sentence is part of, I would do a round-robin search through all five priority queues until a match is found - the highest priority match out of all five categories takes the sentence. This is one possible solution I can think of.
NOTE: For this to work properly, of course it is important to pick and choose carefully which keywords go into which categories; for example, in the Laptop category, it may seem natural to have "laptop" be the highest priority keyword - however, this would cause lots of collisions because laptop will probably be a very commonly used word in sentences. You should have very specific keywords pertaining to each category, rather than having broad/surface level keywords like "laptop" (or have "laptop" be a very low priority keyword).

This is actually a machine learning problem (text categorization) that you could solve using several algorithms: support vector machines, multinomial logistic regression, naive bayes and more.
There are many libraries which will help you, here is one (java)
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
Also python has a very good library:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier
If you want to take this approach, you are going to need a training dataset, meaning that you need to manually label a set of documents that the algorithm will use to automatically learn which keywords are important.
Hope it helps!

If you only have the reach from receiving these sentences and sending/doing logic,
why not just filter them by regex?
See for example,
Regex to find a specific word in a string in java
e.g.
List<String> LaptopList = new ArrayList<String>();
for (String item : sentenceList) {
if item.matches(".*\\blaptop\\b.*"){
LaptopList.add(item);
}
}

You are looking at the keyword "Laptop". But there is a keyword install "install" which primary tells about installation of some application.
So you can try like
if( sentence.contains("install") || (sentence.contains("install") && sentence.contains("laptop") )
{
applicationTickets.add(sentence);
}
else if(sentence.contains("laptop") || other conditions)
{
laptopTickets.add(sentence);
}
else if( )
..........
else if( )
..........
If you observe the code, the applications category is placed first because It is matching with the terms of Laptop. So through this code trying to fall that sentence into laptop category.
You can use loops for checking all the conditions. The keywords can be added to the specify list for every category.

Handling large search queries on relatively small index documents in Lucene

I'm working on a project where we index relatively small documents/sentences, and we want to search these indexes using large documents as query. Here is a relatively simple example :
I'm indexing document :
docId : 1
text: "back to black"
And i want to query using the following input :
"Released on 25 July 1980, Back in Black was the first AC/DC album recorded without former lead singer Bon Scott, who died on 19 February at the age of 33, and was dedicated to him."
What is the best approach for this in Lucene ? For simple examples, where the text i want to find is exactly the input query, i get better results using my own analyzer + a PhraseQuery than using QueryParser.parse(QueryParser.escape(...my large input...)) - which ends up creating a big Boolean/Term Query.
But i can't try to use a PhraseQuery approach for a real world example, i think i have to use a word N-Gram approach like the ShingleAnalyzerWrapper but as my input documents can be quite large the combinatorics will become hard to handle...
In other words, i'm stuck and any idea would be greatly appreciated :)
P.S. i didn't mention it but one of the annoying thing with indexing small documents is also that due to "norms"-value (float) being encoded on only 1 byte, all 3-4 words sentences get the same Norm Value, so searching sentences like "A B C" makes results "A B C" and "A B C D" show up with the same score.
Thanks !

I don't know how many sentences you have, but you may want to inverse the problem: store your sentences as queries, index incoming documents in a transient in-memory index and run all your queries on it to find the matching ones.
(Note: this is how Elasticsearch's percolator works.)
Edit (2013-06-21):
If you have a very large number of sentences, it might still be better to store sentences in an index. But instead of using phrase queries, you could try to index using Lucene's ShingleFilter. At query time, your approach to build the query manually instead of using QueryParser is the good one, but if you index shingles, you could just build a pure boolean query where each clause matches a shingle instead of a phrase query.

Which is the best choice to indexing a Boolean value in lucene?

Indexing a Boolean value(true/false) in lucene(not need to store)
I want to get more disk space usage and higher search performance
doc.add(new Field("boolean","true",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new Field("boolean","1",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new NumericField("boolean",Integer.MAX_VALUE,Field.Store.NO,true).setIntValue(1));
Which should I choose? Or any other better way?
thanks a lot

An interesting question!
I don't think the third option (NumericField) is a good choice for a boolean field. I can't think of any use case for this.
The Lucene search index (leaving to one side stored data, which you aren't using anyway) is stored as an inverted index
Leaving your first and second options as (theoretically) identical
If I was faced with this, I think I would choose option one ("true" and "false" terms), if it influences the final decision.
Your choice of NOT_ANALYZED_NO_NORMS looks good, I think.

Lucene jumps through an elaborate set of hoops to make NumericField searchable by NumericRangeQuery, so definitely avoid it an all cases where your values don't represent quantities. For example, even if you index an integer, but only as a unique ID, you would still want to use a plain String field. Using "true"/"false" is the most natural way to index a boolean, while using "1"/"0" gives just a slight advantage by avoiding the possibility of case mismatch or typo. I'd say this advantage is not worth much and go for true/false.

Use Solr (a flavour of lucene) - it indexes all basic java types natively.
I've used it and it rocks.

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

how to perform word clustering using k-means algorithm in java

Please help me how to perform word clustering using k-means algorithm in java. From the set of documents, I get word and its frequency count. Then i dont know how to start for clustering.I already search google. But no idea. Please tell me steps to perform word clustering. Very needful now. Thanks in advance.

"Programming Collective Intelligence" by Toby Segaran has a wonderful chapter on how to do this. The examples are in Python, but they should be easy to port to Java.

In clustering most important thing is to build a method, which check how to things (for example) are "close" together. E.g. is you are interested in string with same lang, this could be like:
int calculateDistance(String s1, String s2) {
return Math.abs(s1.length() - s2.length());
}
Then I'm not so sure, but in can be like this:
1. choose (can be randomly) first k string,
2. iterate for all string, and relate them to their "nearest" string.
Then can be something, like choosing from every "cluster" middle of it, and start it again. I don't remember it for 100% but I thing it is good way to start.
And remember, that most important is the method calculateDistance()!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.