Lucene: delete from index, based on multiple fields

Lucene: delete from index, based on multiple fields - java

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Related

Apache Lucene 6 QueryParser range query is not working with IntPoint

I'm using Lucene 6 new IntPoint and I want to do some range search
Using IntPoint.newRangeQuery the search works and the correct documents are returned, however when I'm using QueryParser (classic) or the new StandardQueryParser nothing is returned.
// This works
Query query = IntPoint.newRangeQuery("duration",1,20);
System.out.println(query);
//This doesn't work
QueryParser parser = new QueryParser("name", analyzer);
Query query = parser.parse("duration:[1 TO 20]");
System.out.println(query);
//This doesn't work
StandardQueryParser queryParserHelper = new StandardQueryParser();
Query query = queryParserHelper.parse("timestamp:[1 TO 20]", "timestamp");
System.out.println(query);
// In all 3 cases it prints: timestamp:[1 TO 20]
Is this a bug or am I missing something?

It's not a bug, and I wouldn't say you are missing anything, really. QueryParser doesn't have any support for IntPoint fields, or any other numeric (PointValues) field types. Range queries in QueryParser syntax will always generate a TermRangeQuery, which will search for that field based on lexicographic order in the inverted index, which will not be work for searching PointValues fields. Generating these using IntPoint.newRangeQuery and similar methods is the correct thing to do.

Lucene Phrase Query not working

I have a String address = "456 SOME STREET";
which I have to search in Lucene, I have created the index for this
StringField address = new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES);
And I am using Phrase Query to search this String using below Code
String[] tokens = address.split("\\s+");
PhraseQuery addressQuery = new PhraseQuery(Constants.ORGANIZATION_ADDRESS, tokens);
finalQuery.add(addressQuery, BooleanClause.Occur.MUST);
But its not giving me any result,I have tried TermQuery as well but that is also not working. Would really appreciate any help because I have tried many options now and I am unable to figure out whats wrong
I have also tried below
For Indexing :
doc.add(new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES));
Search using Term Query :
fullAddressExact= fullAddressExact.toLowerCase();
TermQuery tq = new TermQuery(new Term(Constants.ORGANIZATION_ADDRESS,fullAddressExact));
finalQuery.add(tq, BooleanClause.Occur.MUST);
Even this doesnt give any result. My intention to get the exact match

You should probably use TextField, not StringField when indexing the documents.
StringField stores the string as is, without breaking it into tokens, so in your example the index will contain "456 SOME STREET". Only a TermQuery with this term will retrieve it (or a PrefixQuery).
TextField is the standard field when indexing text, it splits the text into tokens (using a Tokenizer) and indexes the words separately, in your example, 456, SOME, STREET can all be used to find the document.
Read more about it here (a bit old, but relevant).

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Remove Lucene document by query of not analyzed text field

I'm 99% sure I had this working in the past, maybe I'm wrong.
Anyway, I'd like to delete a Lucene document by a Field which is stored but not analyzed and contains text.
So the problem, it seems, is that calling luceneWriter.deleteDocuments(query) doesn't delete the document unless the field referenced in query is Field.Index.ANALYZED or a simple number.
Some code:
Integer myId = 1234;
Document doc = new Document();
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
indexWriter.add(doc);
indexWriter.commit();
...
QueryParser parser = new QueryParser(VERSION, "MyIdField", ANALYZER);
Query query = parser.parse("MyIdField:1234");
indexWriter.deleteDocuments(query);
indexWriter.commit();
Everything works!
Sweet.. what if the field is not analyzed?
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
Still works!
Awesome, what if it's not just a number?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
Doesn't work!.. darn.
The query doesn't match the document and so it isn't deleted.
What if we do index it?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
It works again!
Ok, so if the field is not analyzed it can still be queried if it only contains a number? Am I missing something?
Thanks for taking some time.
Note:
Technically, there are two fields, making it an AND query. As such, I'd prefer to delete the documents with a Query rather than a Term. I'm not sure if that makes a difference but wanted to emphasize I would like to stick with a solution using a Query.

According to this question, you have to use a PhraseQuery to search a not analyzed field. Your code
Query query = parser.parse("MyIdField:ID1234");
would yield a TermQuery instead, and thus won't match.
I recommend you to try a KeywordAnalyzer instead (remember that, even if your field isn't analyzed, the query parser could still analyze your query string and therefore your match could fail anyway).

Lucene 4.0 IndexWriter updateDocument for Numeric Term

I just wanted to know how it is possible to to update (delete/insert) a document based on a numeric field.
So far I did this:
LuceneManager.updateDocument(writer, new Term("id", NumericUtils.intToPrefixCoded(sentenceId)), newDoc);
But now with Lucene 4.0 the NumericUtils class has changed to this which I don't really understand.
Any help?

With Lucene 5.x, this could be solved by code below:
int id = 1;
BytesRefBuilder brb = new BytesRefBuilder();
NumericUtils.intToPrefixCodedBytes(id, 0, brb);
Term term = new Term("id", brb.get());
indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);

You can use it this way:
First you must set the FieldType's numeric type:
FieldType TYPE_ID = new FieldType();
...
TYPE_ID.setNumericType(NumericType.INT);
TYPE_ID.freeze();
and then:
int idTerm = 10;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(id, 0, bytes);
Term idTerm = new Term("id", bytes);
and now you'll be able to use idTerm to update the doc.

I would recommend, if possible, it would be better to store an ID as a keyword string, rather than a number. If it is simply a unique identifier, indexing as a keyword makes much more sense. This removes any need to mess with numeric formatting.
If it is actually being used as a number, then you might need to perform the update manually. That is, search for and fetch the document you wish to update, delete the old document with tryDeleteDocument, and then add the updated version with addDocument. This is basically what updateDocument does anyway, to my knowledge.
The first option would certainly be the better way, though. A non-numeric field to use as an update ID would make life easier.

With Lucene 4, you can now create IntField, LongField, FloatField or DoubleField like this:
document.add(new IntField("id", 6, Field.Store.NO));
To write the document once you modified it, it's still:
indexWriter.updateDocument(new Term("pk", "<pk value>"), document);
EDIT:
And here is a way to make a query including this numeric field:
// Query <=> id <= 7
Query query = NumericRangeQuery.newIntRange("id", Integer.MIN_VALUE, 7, true, true);
TopDocs topDocs = indexSearcher.search(query, 10);

According to the documentation of Lucene 4.0.0, the ID field must to be used with StringField class:
"A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache."
I had the same problem as you and I solved it by making this change. After that, my update and delete worked perfectly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene: delete from index, based on multiple fields - java

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Related

Apache Lucene 6 QueryParser range query is not working with IntPoint

Lucene Phrase Query not working

Lucene: Multiple words in a single term

Remove Lucene document by query of not analyzed text field

Lucene 4.0 IndexWriter updateDocument for Numeric Term

Categories

Resources