Lucene 4.0 IndexWriter updateDocument for Numeric Term

Lucene 4.0 IndexWriter updateDocument for Numeric Term - java

I just wanted to know how it is possible to to update (delete/insert) a document based on a numeric field.
So far I did this:
LuceneManager.updateDocument(writer, new Term("id", NumericUtils.intToPrefixCoded(sentenceId)), newDoc);
But now with Lucene 4.0 the NumericUtils class has changed to this which I don't really understand.
Any help?

With Lucene 5.x, this could be solved by code below:
int id = 1;
BytesRefBuilder brb = new BytesRefBuilder();
NumericUtils.intToPrefixCodedBytes(id, 0, brb);
Term term = new Term("id", brb.get());
indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);

You can use it this way:
First you must set the FieldType's numeric type:
FieldType TYPE_ID = new FieldType();
...
TYPE_ID.setNumericType(NumericType.INT);
TYPE_ID.freeze();
and then:
int idTerm = 10;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(id, 0, bytes);
Term idTerm = new Term("id", bytes);
and now you'll be able to use idTerm to update the doc.

I would recommend, if possible, it would be better to store an ID as a keyword string, rather than a number. If it is simply a unique identifier, indexing as a keyword makes much more sense. This removes any need to mess with numeric formatting.
If it is actually being used as a number, then you might need to perform the update manually. That is, search for and fetch the document you wish to update, delete the old document with tryDeleteDocument, and then add the updated version with addDocument. This is basically what updateDocument does anyway, to my knowledge.
The first option would certainly be the better way, though. A non-numeric field to use as an update ID would make life easier.

With Lucene 4, you can now create IntField, LongField, FloatField or DoubleField like this:
document.add(new IntField("id", 6, Field.Store.NO));
To write the document once you modified it, it's still:
indexWriter.updateDocument(new Term("pk", "<pk value>"), document);
EDIT:
And here is a way to make a query including this numeric field:
// Query <=> id <= 7
Query query = NumericRangeQuery.newIntRange("id", Integer.MIN_VALUE, 7, true, true);
TopDocs topDocs = indexSearcher.search(query, 10);

According to the documentation of Lucene 4.0.0, the ID field must to be used with StringField class:
"A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache."
I had the same problem as you and I solved it by making this change. After that, my update and delete worked perfectly.

Related

Java Couchbase Querying to find a document's ID?

I'm new to couchbase. I'm using Java for this. I'm trying to remove a document from a bucket by looking up its ID with query parameters(assuming the ID is unknown).
Lets say I have a bucket called test-data. In that bucked I have a document with ID of 555 and Content of {"name":"bob","num":"10"}
I want to be able to remove that document by querying using 'name' and 'num'.
So far I have this (hardcoded):
String statement = "SELECT META(`test-data`).id from `test-data` WHERE name = \"bob\" and num = \"10\"";
N1qlQuery query = N1qlQuery.simple(statement);
N1qlQueryResult result = bucket.query(query);
List<N1qlQueryRow> row = result.allRows();
N1qlQueryRow res1 = row.get(0);
System.out.println(res1);
//output: {"id":"555"}
So I'm getting a json that has the document's ID in it. What would be the best way to extract that ID so that I can then remove the queryed document from the bucket using its ID? Am I doing to many steps? Is there a better way to extract the document's ID?
bucket.remove(docID)
Ideally I'd like to use something like a N1q1QueryResult to get this going but I'm not sure how to set that up.
N1qlQueryResult result = bucket.query(select("META.id").fromCurrentBucket().where((x("num").eq("\""+num+"\"")).and(x("name").eq("\""+name+"\""))));
But that isn't working at the moment.
Any help or direction would be appreciated. Thanks.

There might be a better way which is running this kind of query:
delete from `test-data` use keys '00000874a09e749ab6f199c0622c5cb0' returning raw META(`test-data`).id
or if your fields has index:
delete from `test-data` where name='bob' and num='10' returning raw META(`test-data`).id
This query deletes the specified document with given document key (which is meta.id) and returns document id of deleted document if it deletes any document. Returns empty if no documents deleted.
You can implement this query with couchbase sdk as follows:
Statement statement = deleteFrom("test-data")
.where(x("name").eq(s("bob")).and(x("num").eq(s("10"))))
.returningRaw(meta(i("test-data")).get("id"));
You can make this statement parameterized or just execute like that.

How to get internal doc id set by lucene

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Remove Lucene document by query of not analyzed text field

I'm 99% sure I had this working in the past, maybe I'm wrong.
Anyway, I'd like to delete a Lucene document by a Field which is stored but not analyzed and contains text.
So the problem, it seems, is that calling luceneWriter.deleteDocuments(query) doesn't delete the document unless the field referenced in query is Field.Index.ANALYZED or a simple number.
Some code:
Integer myId = 1234;
Document doc = new Document();
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
indexWriter.add(doc);
indexWriter.commit();
...
QueryParser parser = new QueryParser(VERSION, "MyIdField", ANALYZER);
Query query = parser.parse("MyIdField:1234");
indexWriter.deleteDocuments(query);
indexWriter.commit();
Everything works!
Sweet.. what if the field is not analyzed?
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
Still works!
Awesome, what if it's not just a number?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
Doesn't work!.. darn.
The query doesn't match the document and so it isn't deleted.
What if we do index it?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
It works again!
Ok, so if the field is not analyzed it can still be queried if it only contains a number? Am I missing something?
Thanks for taking some time.
Note:
Technically, there are two fields, making it an AND query. As such, I'd prefer to delete the documents with a Query rather than a Term. I'm not sure if that makes a difference but wanted to emphasize I would like to stick with a solution using a Query.

According to this question, you have to use a PhraseQuery to search a not analyzed field. Your code
Query query = parser.parse("MyIdField:ID1234");
would yield a TermQuery instead, and thus won't match.
I recommend you to try a KeywordAnalyzer instead (remember that, even if your field isn't analyzed, the query parser could still analyze your query string and therefore your match could fail anyway).

Lucene: delete from index, based on multiple fields

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene 4.0 IndexWriter updateDocument for Numeric Term - java

With Lucene 5.x, this could be solved by code below: int id = 1; BytesRefBuilder brb = new BytesRefBuilder(); NumericUtils.intToPrefixCodedBytes(id, 0, brb); Term term = new Term("id", brb.get()); indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);

Related

Java Couchbase Querying to find a document's ID?

How to get internal doc id set by lucene

Lucene: Multiple words in a single term

Remove Lucene document by query of not analyzed text field

Lucene: delete from index, based on multiple fields

Categories

Resources