Lucene sorted search with spaces and hyphens

Lucene sorted search with spaces and hyphens - java

Lucene 8.0.0
I have a field that I am indexing like this:
doc.add(new SortedDocValuesField("nameS", new BytesRef(name.toLowerCase())));
doc.add(new StoredField("name", name));
The values of name look something like this:
London-UK
Bristol-UK
Bristol-AUS
New York-USA
Washington-USA
So they have spaces and hyphens in it. However I can't seem to get my search to behave - it seems to give up when it gets to a space or hyphen.
The analyser I'm using for indexing and search is the StandardAnalyser.
I search using code like this:
String escapedSearch = QueryParserUtil.escape(search.toLowerCase())
Query query = qp.parse("nameS:" + escapedSearch + "*");
TopFieldDocs results = searcher.search(query, 100, new Sort(new SortField("nameS", Type.STRING)));
Where am I doing wrong to allow the search to work with spaces and hyphens? As an aside, I've called .toLowerCase() on the name when indexing and searching, is that standard practise?

Related

Searching for Terms with whitespace using Lucene

I'm trying to use Lucene to add a search feature but can't seem to get an index to work with significant whitespace. I've got the following test case setup:
RAMDirectory directory = new RAMDirectory();
KeywordAnalyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("content", analyzer);
parser.setSplitOnWhitespace(false);
Query query = parser.parse("Bill E");
TopDocs docs = searcher.search(query, 1);
assertTrue(docs.totalHits > 0);
I'm using Lucene 6.6.0 and from what I understand the KeywordAnalyzer is what I'm looking for:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
But I can't seem to get any matching documents that contain whitespace.
Any ideas on how to solve this?

When you index, you have a single document with a single field and with a single term with value - Bill Evans
When you are going to search, TermQuery produced by QueryParser tries to search with term value - Bill E and that term obviously doesn't exist in index so you get zero hits.
if you replace your search string with - Bill Evans , you will get results.
Please refer this question too
First , you need to separate your indexing and searching concerns. You can only search what is indexed. If you are indexing full texts without breaking into tokens then at search times - you need to produce WildCardQuery , FuzzyQuery , PhraseQuery etc if your input string at search time is different than what in indexed. TermQuery searches for exact term values.
My suggestion would to be to store full text value ( without tokens - StringField would do that ) as well as generate additional tokens breaking on space using something like - SimpleAnalyzer .
So Something like,
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
doc.add(new StringField("storedcontent", "Bill Evans", Field.Store.YES));
Above code with SimpleAnalyzer , you will now have terms - bill & evans
( as well as full text as stored field ) and if you now search with same analyzer , your query would be like - content:bill content:e & you will get a result.
All in all - system is working the way you have coded it :)
So understand your requirements first as what you wish to index and what kind of queries you wish to perform on that index.

Lucene Phrase Query not working

I have a String address = "456 SOME STREET";
which I have to search in Lucene, I have created the index for this
StringField address = new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES);
And I am using Phrase Query to search this String using below Code
String[] tokens = address.split("\\s+");
PhraseQuery addressQuery = new PhraseQuery(Constants.ORGANIZATION_ADDRESS, tokens);
finalQuery.add(addressQuery, BooleanClause.Occur.MUST);
But its not giving me any result,I have tried TermQuery as well but that is also not working. Would really appreciate any help because I have tried many options now and I am unable to figure out whats wrong
I have also tried below
For Indexing :
doc.add(new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES));
Search using Term Query :
fullAddressExact= fullAddressExact.toLowerCase();
TermQuery tq = new TermQuery(new Term(Constants.ORGANIZATION_ADDRESS,fullAddressExact));
finalQuery.add(tq, BooleanClause.Occur.MUST);
Even this doesnt give any result. My intention to get the exact match

You should probably use TextField, not StringField when indexing the documents.
StringField stores the string as is, without breaking it into tokens, so in your example the index will contain "456 SOME STREET". Only a TermQuery with this term will retrieve it (or a PrefixQuery).
TextField is the standard field when indexing text, it splits the text into tokens (using a Tokenizer) and indexes the words separately, in your example, 456, SOME, STREET can all be used to find the document.
Read more about it here (a bit old, but relevant).

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Remove Lucene document by query of not analyzed text field

I'm 99% sure I had this working in the past, maybe I'm wrong.
Anyway, I'd like to delete a Lucene document by a Field which is stored but not analyzed and contains text.
So the problem, it seems, is that calling luceneWriter.deleteDocuments(query) doesn't delete the document unless the field referenced in query is Field.Index.ANALYZED or a simple number.
Some code:
Integer myId = 1234;
Document doc = new Document();
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
indexWriter.add(doc);
indexWriter.commit();
...
QueryParser parser = new QueryParser(VERSION, "MyIdField", ANALYZER);
Query query = parser.parse("MyIdField:1234");
indexWriter.deleteDocuments(query);
indexWriter.commit();
Everything works!
Sweet.. what if the field is not analyzed?
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
Still works!
Awesome, what if it's not just a number?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
Doesn't work!.. darn.
The query doesn't match the document and so it isn't deleted.
What if we do index it?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
It works again!
Ok, so if the field is not analyzed it can still be queried if it only contains a number? Am I missing something?
Thanks for taking some time.
Note:
Technically, there are two fields, making it an AND query. As such, I'd prefer to delete the documents with a Query rather than a Term. I'm not sure if that makes a difference but wanted to emphasize I would like to stick with a solution using a Query.

According to this question, you have to use a PhraseQuery to search a not analyzed field. Your code
Query query = parser.parse("MyIdField:ID1234");
would yield a TermQuery instead, and thus won't match.
I recommend you to try a KeywordAnalyzer instead (remember that, even if your field isn't analyzed, the query parser could still analyze your query string and therefore your match could fail anyway).

Lucene: delete from index, based on multiple fields

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.