Lucene Analyzer for a simple direct field search

Lucene Analyzer for a simple direct field search - java

I tried many lucene analzers and found keyword analyzer to be the best match for my requirement. I am using the same keyword analyzer for both updating the document and searching the same using QueryParser.
I want to search for the values with wildcard support.
For example : if a field "country" contains the value "india"
I can search for the same field as "ind*", "ndi", india etc.
I am getting the match for all other searches except the exact match.
ie. when i am searching the exact word (country:india), i am not getting any match.
If i am changing the same query as "country:india*" or "country:indi?", i am getting the
match.
Also i have another doubt, if there is a country with the name "not", how can i search for the same.
I tried "country:"not"" and "country:\not". But both failed.
What is actually happening in both these cases?
Please help.

I suspect you have some whitespace or other extraneous characters after the country name. You could either trim your input before adding it to Lucene, or implement a custom keyword analyzer, and add a TrimFilter, something like:
public final class CustomKeywordAnalyzer extends Analyzer {
public CustomKeywordAnalyzer() {
}
#Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
Tokenizer tokenizer = new KeywordTokenizer(reader)
TokenStream filter = new TrimFilter(Version.LUCENE_43, tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
As far as searching for "not", it simply being lowercase should be adequate for it not to be interpreted as a boolean operator (AND, OR, and NOT operators must be uppercase, per the documentation). Those words will get caught by a standard English StopFilter though, such as the one used by StandardAnalyzer. Are you sure you are just using a KeywordAnalyzer when querying?
Barring that, though, the sure way to avoid query parser reserved words would be to just bypass the query parser entirely, and construct the query yourself:
Query query = new TermQuery(new Term("country", userQuery));

Related

Mimic Elasticsearch MatchQuery

I'm currently writing a program that currently uses elasticsearch as a back-end database/search index. I'd like to mimic the functionality of the /_search endpoint, which currently uses a match query:
{
"query": {
"match" : {
"message" : "Neural Disruptor"
}
}
}
Doing some sample queries, yielded the following results on a massive World of Warcraft database:
Search Term Search Result
------------------ -----------------------
Neural Disruptor Neural Needler
Lovly bracelet Ruby Bracelet
Lovely bracelet Lovely Charm Bracelet
After looking through elasticsearch's documentation, I found that the match query is fairly complex. What's the easiest way that I can simulate a match query with just lucene in java? (It appears to be doing some fuzzy matching, as well as looking for terms)
Importing elasticsearch code for MatchQuery (I believe org.elasticsearch.index.search.MatchQuery) doesn't seem to be that easy. It's heavily embedded into Elasticsearch, and doesn't look like something that can be easily pulled out.
I don't need a full proof "Must match exactly what elasticsearch matches", I just need something close, or that can fuzzy match/find the best match.

Whatever is sent to the q= parameter of the _search endpoint is used as is by the query_string query (not org.elasticsearch.index.search.MatchQuery) which understands the Lucene expression syntax.
The query parser syntax is defined in the Lucene project using JavaCC and the grammar can be found here if you wish to have a look. The end-product is a class called QueryParser (see below).
The class inside the ES source code that is responsible for parsing the query string is QueryStringQueryParser which delegates to Lucene's QueryParser class (generated by JavaCC).
So basically, if you get an equivalent query string as what gets passed to _search?q=..., then you can use that query string with QueryParser.parse("query-string-goes-here") and run the reified Query using just Lucene.

It's been awhile since I've worked directly with lucene, but what you want should be, initially, fairly straightforward. The base behavior of a lucene query is very similar to the match query (query_string is exactly equivalent to lucene, but match is very close). I put together a small example that works just with lucene (7.2.1) if you want to try it out. The main code is as follows:
public static void main(String[] args) throws Exception {
// Create the in memory lucence index
RAMDirectory ramDir = new RAMDirectory();
// Create the analyzer (has default stop words)
Analyzer analyzer = new StandardAnalyzer();
// Create a set of documents to work with
createDocs(ramDir, analyzer);
// Query the set of documents
queryDocs(ramDir, analyzer);
}
private static void createDocs(RAMDirectory ramDir, Analyzer analyzer)
throws IOException {
// Setup the configuration for the index
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
// IndexWriter creates and maintains the index
IndexWriter writer = new IndexWriter(ramDir, config);
// Create the documents
indexDoc(writer, "document-1", "hello planet mercury");
indexDoc(writer, "document-2", "hi PLANET venus");
indexDoc(writer, "document-3", "howdy Planet Earth");
indexDoc(writer, "document-4", "hey planet MARS");
indexDoc(writer, "document-5", "ayee Planet jupiter");
// Close down the writer
writer.close();
}
private static void indexDoc(IndexWriter writer, String name, String content)
throws IOException {
Document document = new Document();
document.add(new TextField("name", name, Field.Store.YES));
document.add(new TextField("body", content, Field.Store.YES));
writer.addDocument(document);
}
private static void queryDocs(RAMDirectory ramDir, Analyzer analyzer)
throws IOException, ParseException {
// IndexReader maintains access to the index
IndexReader reader = DirectoryReader.open(ramDir);
// IndexSearcher handles searching of an IndexReader
IndexSearcher searcher = new IndexSearcher(reader);
// Setup a query
QueryParser parser = new QueryParser("body", analyzer);
Query query = parser.parse("hey earth");
// Search the index
TopDocs foundDocs = searcher.search(query, 10);
System.out.println("Total Hits: " + foundDocs.totalHits);
for (ScoreDoc scoreDoc : foundDocs.scoreDocs) {
// Get the doc from the index by id
Document document = searcher.doc(scoreDoc.doc);
System.out.println("Name: " + document.get("name")
+ " - Body: " + document.get("body")
+ " - Score: " + scoreDoc.score);
}
// Close down the reader
reader.close();
}
The important parts to extending this is going to be the analyzer and understanding lucene query parser syntax.
The Analyzer is used by both indexing and queries to tell both how to parse text so they can think about the text in the same way. It sets up how to tokenize (what to split on, whether to toLower(), etc). The StandardAnalyzer splits on spaces and a few others (I don't have this handy) and also looks to apply toLower().
The QueryParser is going to do some of the work for you. If you see above in my example. I do two things, I tell the parser what the default field is and I pass a string of hey earth. The parser is going to turn this into a query that looks like body:hey body:earth. This will look for documents that have either hey or earth in the body. Two documents will be found.
If we were to pass hey AND earth the query is parsed to look like +body:hey +body:earth which will require docs to have both terms. Zero documents will be found.
To apply fuzzy options you add a ~ to the terms you want to be fuzzy. So if the query is hey~ earth it will apply fuzziness to hey and the query will look like body:hey~2 body:earth. Three documents will be found.
You can more directly write the queries and the parser still handles things. So if you pass it hey name:\"document-1\" (it token splits on -) it will create a query like body:hey name:"document 1". Two documents will be returned as it looks for the phrase document 1 (since it still tokenizes on the -). Where if I did hey name:document-1 it writes body:hey (name:document name:1) which returns all documents since they all have document as a term. There is some nuance to understanding here.
I'll try to cover a bit more on how they are similar. Referencing match query. Elastic says the main difference will be, "It does not support field name prefixes, wildcard characters, or other "advanced" features." These would probably stand out more going the other direction.
Both the match query and the lucene query, when working with an analyzed field will take the query string and apply the analyzer to it (tokenize it, toLower, etc). So they will both turn HEY Earth into a query that looks for the terms hey or earth.
A match query can set the operator by providing "operator" : "and". This change our query to look for hey and earth. The analogy in lucene is to do something like parser.setDefaultOperator(QueryParser.Operator.AND);
The next thing is fuzziness. Both are working with the same settings. I believe elastic's "fuzziness": "AUTO" is equivalent to lucene's auto when applying ~ to a query (though I think you have to add it each term yourself which is a little cumbersome).
Zero terms query appears to be an elastic construct. If you wanted the ALL setting you would have to replicate the match all query if the query parser removed all tokens from the query.
Cutoff frequery looks to be related to the CommonTermsQuery. I've not used this so you may have some digging if you want to use it.
Lucene has a synonym filter to be applied to an analyzer but you may need to build the map yourself.
The differences you may find will probably be in scoring. When I run they query hey earth against lucene. It get document-3 and document-4 both returned with a score of 1.3862944. When I run the query in the form of:
curl -XPOST http://localhost:9200/index/_search?pretty -d '{
"query" : {
"match" : {
"body" : "hey earth"
}
}
}'
I get the same documents, but with a score of 1.219939. You can run an explain on both of them. In lucene by printing each document with
System.out.println(searcher.explain(query, scoreDoc.doc));
And in elastic by querying each document like
curl -XPOST http://localhost:9200/index/docs/3/_explain?pretty -d '{
"query" : {
"match" : {
"body" : "hey earth"
}
}
}'
I get some differences, but I cannot exactly explain them. I do actually get a value for the doc of 1.3862944 but the fieldLength is different and that affects the weight.

Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Regular expression keyword match in URL

I have a list of URL in a large file (20 mb), and I have a set of keywords. If the set of keywords matches the url then I want to extract the URL.
Example:keyword="contact"
URL:http://www.365media.com/offices-and-contact.html
I need a regular expression to match the keywords with my list of URLs.
My Java code:
public class FileRead {
public static void main(String[] ags) throws FileNotFoundException
{
Scanner in=new Scanner(new File("D:\\Log\\Links.txt"));
String input;
String[] reg=new String[]{".*About.*",".*Available.*",".*Author.*",".*Blog.*",".*Business.*",
".*Career.*",".*category.*",".*City.*",".*Company.*",".*Contain.*",".*Contact.*",".*Download.*",
".*Email.*"};
while(in.hasNext())
{
input=in.nextLine();
//for(String s:reg)
patternFind(input,".*email.*");
}
}
public static void patternFind(String input,String reg)
{
Pattern p=Pattern.compile(reg);
Matcher m=p.matcher(input);
while(m.find())
System.out.println(m.group());
}
}

If you only want to match for the existence of any Keyword in the current line, you can simply use
for (String s: reg) {
if (input.contains(s)) {
// do something
}
}
instead of
patternFind(input,".email.");
Anyways, a regular expression equivalent to match any of the words would be:
.*(About|Available|Author|And|So|On...).*
I'm not sure which one is faster. String.contains() is simpler, a Pattern is precompiled which could perform better when applied many times, as it is the case here.

Why you can't do this:
For all line (URLs) in the file check if some of your pattern works on the URL
the code is pretty obvious

I'm going to give a bit general solution. I think you should be able to adapt the idea to your code.
Supposed you have a list of bare keywords in a file and you read it into a String[], or you hard-code the list of keywords in a String[], for example:
String keywords[] = {"about", "available", "email"};
For all the keywords, use Pattern.quote() to make sure they are recognized as literal string. Then concatenate the keywords with bar character | as separator (OR), and surround everything with parentheses (). The end result will be like this. Alternatively, you can look at the keywords yourself and write the regex without the quoting \Q and \E. You can also just ignore the Pattern.quote() step, if you are sure that the keywords do not contain regex.
(\Qabout\E|\Qavailable\E|\Qemail\E)
Add .* to 2 ends to make it matches the rest of the URL, plus (?i) at the beginning to enable case-insensitive match.
(?i).*(\Qabout\E|\Qavailable\E|\Qemail\E).*
Then you can compile the Pattern and call matcher(inputString).matches() on each line of input to check whether the URL has the keyword.
If the keyword is too common in a URL, such as "com", "net", "www", and you want to make the search more fine grain, more tweaking must be done.

Lucene Highlighter Isn't Match Prefixes

I'm using Lucene's Highlighter to highlight parts of a string. The code below seems to work fine for finding the stemmed words but not for prefix matching.
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Query query = parser.parse(pQuery);
QueryScorer scorer = new QueryScorer(query);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 40);
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);
String[] frags = highlighter.getBestFragments(analyzer, "", pText, 4);
I've read in a few different places I need to call Query.rewrite to get the prefix matching to work. That method takes an IndexReader arguement though and I'm not sure how to get it. All of the example's I've found that call Query.rewreite don't show where the IndexReader came from. I'll add that that this is the only Lucene code I'm using. I'm not using Lucene to do the searching itself, just for the highlighting.
How do I create an IndexReader and is it possible to create one if I'm using Lucene the way that I am. Or perhaps there's a different way to get it to highlight the prefix matches? I'm very new to Lucene and I'm sure what all of these pieces do or if they're all necessary. I've just copied them from various example's I've found online. So if I've doing anything else wrong please let me know. Thanks.

Suppose you have a query field:abc* . What query.rewrite basically does is: it reads the index(this why you need an IndexReader) finds all terms that start with abc and changes your query as ,for ex., field:abc1 field:abc2 field:abc3. If you know the location of the index, you can use IndexReader.Open to get an IndexReader. If you don't have an index at all, you should search your pText, find all words that start with abc and update your query accordingly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.