How to get internal doc id set by lucene

How to get internal doc id set by lucene - java

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

Related

Java Couchbase Querying to find a document's ID?

I'm new to couchbase. I'm using Java for this. I'm trying to remove a document from a bucket by looking up its ID with query parameters(assuming the ID is unknown).
Lets say I have a bucket called test-data. In that bucked I have a document with ID of 555 and Content of {"name":"bob","num":"10"}
I want to be able to remove that document by querying using 'name' and 'num'.
So far I have this (hardcoded):
String statement = "SELECT META(`test-data`).id from `test-data` WHERE name = \"bob\" and num = \"10\"";
N1qlQuery query = N1qlQuery.simple(statement);
N1qlQueryResult result = bucket.query(query);
List<N1qlQueryRow> row = result.allRows();
N1qlQueryRow res1 = row.get(0);
System.out.println(res1);
//output: {"id":"555"}
So I'm getting a json that has the document's ID in it. What would be the best way to extract that ID so that I can then remove the queryed document from the bucket using its ID? Am I doing to many steps? Is there a better way to extract the document's ID?
bucket.remove(docID)
Ideally I'd like to use something like a N1q1QueryResult to get this going but I'm not sure how to set that up.
N1qlQueryResult result = bucket.query(select("META.id").fromCurrentBucket().where((x("num").eq("\""+num+"\"")).and(x("name").eq("\""+name+"\""))));
But that isn't working at the moment.
Any help or direction would be appreciated. Thanks.

There might be a better way which is running this kind of query:
delete from `test-data` use keys '00000874a09e749ab6f199c0622c5cb0' returning raw META(`test-data`).id
or if your fields has index:
delete from `test-data` where name='bob' and num='10' returning raw META(`test-data`).id
This query deletes the specified document with given document key (which is meta.id) and returns document id of deleted document if it deletes any document. Returns empty if no documents deleted.
You can implement this query with couchbase sdk as follows:
Statement statement = deleteFrom("test-data")
.where(x("name").eq(s("bob")).and(x("num").eq(s("10"))))
.returningRaw(meta(i("test-data")).get("id"));
You can make this statement parameterized or just execute like that.

Java method for MongoDB collection.save()

I'm having a problem with MongoDB using Java when I try adding documents with customized _id field. And when I insert new document to that collection, I want to ignore the document if it's _id has already existed.
In Mongo shell, collection.save() can be used in this case but I cannot find the equivalent method to work with MongoDB java driver.
Just to add an example:
I have a collection of documents containing websites' information
with the URLs as _id field (which is unique)
I want to add some more documents. In those new documents, some might be existing in the current collection. So I want to keep adding all the new documents except for the duplicate ones.
This can be achieve by collection.save() in Mongo Shell but using MongoDB Java Driver, I can't find the equivalent method.
Hopefully someone can share the solution. Thanks in advance!

In the MongoDB Java driver, you could try using the BulkWriteOperation object with the initializeOrderedBulkOperation() method of the DBCollection object (the one that contains your collection). This is used as follows:
MongoClient mongo = new MongoClient("localhost", port_number);
DB db = mongo.getDB("db_name");
ArrayList<DBObject> objectList; // Fill this list with your objects to insert
BulkWriteOperation operation = col.initializeOrderedBulkOperation();
for (int i = 0; i < objectList.size(); i++) {
operation.insert(objectList.get(i));
}
BulkWriteResult result = operation.execute();
With this method, your documents will be inserted one at a time with error handling on each insert, so documents that have a duplicated id will throw an error as usual, but the operation will still continue with the rest of the documents. In the end, you can use the getInsertedCount() method of the BulkWriteResult object to know how many documents were really inserted.
This can prove to be a bit ineffective if lots of data is inserted this way, though. This is just sample code (that was found on journaldev.com and edited to fit your situation.). You may need to edit it so it fits your current configuration. It is also untested.

I guess save is doing something like this.
fun save(doc: Document, col: MongoCollection<Document>) {
if (doc.getObjectId("_id") != null) {
doc.put("_id", ObjectId()) // generate a new id
}
col.replaceOne(Document("_id", doc.getObjectId("_id")), doc)
}
Maybe they removed save so you decide how to generate the new id.

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Lucene - get document ids from term

In Lucene 4.1, I see you can use DirectoryReader.docFreq() to get the number of documents in an index containing a given term. Is there a way to actually get those documents? Either the objects or id numbers would be fine. I think AtomicReader.termDocsEnum() would be useful, but I'm not sure if I can use AtomicReader - I don't see how to create an AtomicReader instance on a given directory.

Why not just search for it?
IndexSearcher searcher = new IndexSearcher(directoryReader);
TermQuery query = new TermQuery(new Term("field", "term"));
TopDocs topdocs = searcher.query(query, numberToReturn);

Lucene 4.0 IndexWriter updateDocument for Numeric Term

I just wanted to know how it is possible to to update (delete/insert) a document based on a numeric field.
So far I did this:
LuceneManager.updateDocument(writer, new Term("id", NumericUtils.intToPrefixCoded(sentenceId)), newDoc);
But now with Lucene 4.0 the NumericUtils class has changed to this which I don't really understand.
Any help?

With Lucene 5.x, this could be solved by code below:
int id = 1;
BytesRefBuilder brb = new BytesRefBuilder();
NumericUtils.intToPrefixCodedBytes(id, 0, brb);
Term term = new Term("id", brb.get());
indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);

You can use it this way:
First you must set the FieldType's numeric type:
FieldType TYPE_ID = new FieldType();
...
TYPE_ID.setNumericType(NumericType.INT);
TYPE_ID.freeze();
and then:
int idTerm = 10;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(id, 0, bytes);
Term idTerm = new Term("id", bytes);
and now you'll be able to use idTerm to update the doc.

I would recommend, if possible, it would be better to store an ID as a keyword string, rather than a number. If it is simply a unique identifier, indexing as a keyword makes much more sense. This removes any need to mess with numeric formatting.
If it is actually being used as a number, then you might need to perform the update manually. That is, search for and fetch the document you wish to update, delete the old document with tryDeleteDocument, and then add the updated version with addDocument. This is basically what updateDocument does anyway, to my knowledge.
The first option would certainly be the better way, though. A non-numeric field to use as an update ID would make life easier.

With Lucene 4, you can now create IntField, LongField, FloatField or DoubleField like this:
document.add(new IntField("id", 6, Field.Store.NO));
To write the document once you modified it, it's still:
indexWriter.updateDocument(new Term("pk", "<pk value>"), document);
EDIT:
And here is a way to make a query including this numeric field:
// Query <=> id <= 7
Query query = NumericRangeQuery.newIntRange("id", Integer.MIN_VALUE, 7, true, true);
TopDocs topDocs = indexSearcher.search(query, 10);

According to the documentation of Lucene 4.0.0, the ID field must to be used with StringField class:
"A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache."
I had the same problem as you and I solved it by making this change. After that, my update and delete worked perfectly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get internal doc id set by lucene - java

Related

Java Couchbase Querying to find a document's ID?

Java method for MongoDB collection.save()

Lucene: Multiple words in a single term

Lucene - get document ids from term

Lucene 4.0 IndexWriter updateDocument for Numeric Term

Categories

Resources