Lucene 6 How to avoid duplicate entries

Lucene 6 How to avoid duplicate entries - java

Story:
I need to search for a list of transactionIds be a given username query e.g "Peter M*".
Question: How is it possible to keep the stored transactionIds unique?
I have populated my index with following documents:
Document doc = new Document();
doc.add(new StoredField(TRANSACTION_ID, data.getTransactionId()));
doc.add(new TextField(MARCHANT_NAME, data.getName(), Store.NO));
I have tried allready two strategies (to avoid duplicate entries) to add a new entry.
IndexWriter.updateDocument with a Term holding the transactionId to store.
Search for the current transactionId, delete it and store it:

You are using a StoredField for the TRANSACTION_ID field. That means it can be retrieved from the index, but is not indexed and can't be searched, and as such, it can't be used as a key to updateDocument. Use a StringField, instead.

Related

How to avoid empty documents in a Firestore query?

I'm trying to display some data which is hosted under a subcollection coll1 that exist in doc1 using this code:
val query = db!!.collection("coll1").document(doc1)
.collection("coll2").orderBy("field1", Query.Direction.ASCENDING)
The problem is when I use this code, the doc1 variable isn't generated yet so I get this error:
Invalid document reference. Document references must have an even number of segments
How can I avoid this error till I generate doc1?

You need to know the exact names of all collections and documents in order to perform a query, so your query should only be performed after you have the string value for doc1 that identified the document where the subcollection has been organized.

Java Couchbase Querying to find a document's ID?

I'm new to couchbase. I'm using Java for this. I'm trying to remove a document from a bucket by looking up its ID with query parameters(assuming the ID is unknown).
Lets say I have a bucket called test-data. In that bucked I have a document with ID of 555 and Content of {"name":"bob","num":"10"}
I want to be able to remove that document by querying using 'name' and 'num'.
So far I have this (hardcoded):
String statement = "SELECT META(`test-data`).id from `test-data` WHERE name = \"bob\" and num = \"10\"";
N1qlQuery query = N1qlQuery.simple(statement);
N1qlQueryResult result = bucket.query(query);
List<N1qlQueryRow> row = result.allRows();
N1qlQueryRow res1 = row.get(0);
System.out.println(res1);
//output: {"id":"555"}
So I'm getting a json that has the document's ID in it. What would be the best way to extract that ID so that I can then remove the queryed document from the bucket using its ID? Am I doing to many steps? Is there a better way to extract the document's ID?
bucket.remove(docID)
Ideally I'd like to use something like a N1q1QueryResult to get this going but I'm not sure how to set that up.
N1qlQueryResult result = bucket.query(select("META.id").fromCurrentBucket().where((x("num").eq("\""+num+"\"")).and(x("name").eq("\""+name+"\""))));
But that isn't working at the moment.
Any help or direction would be appreciated. Thanks.

There might be a better way which is running this kind of query:
delete from `test-data` use keys '00000874a09e749ab6f199c0622c5cb0' returning raw META(`test-data`).id
or if your fields has index:
delete from `test-data` where name='bob' and num='10' returning raw META(`test-data`).id
This query deletes the specified document with given document key (which is meta.id) and returns document id of deleted document if it deletes any document. Returns empty if no documents deleted.
You can implement this query with couchbase sdk as follows:
Statement statement = deleteFrom("test-data")
.where(x("name").eq(s("bob")).and(x("num").eq(s("10"))))
.returningRaw(meta(i("test-data")).get("id"));
You can make this statement parameterized or just execute like that.

How to get internal doc id set by lucene

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

How to STORE numeric values in lucene documents?

I am writing a code in java and use lucene 3.4 to index the text documents. Each document has an id and some other numerical values as well as content and title.
I add each document to the index according to the following code:
Document doc = new Document();
doc.add(new NumericField("id").setIntValue(writer.numDocs()));
doc.add(new NumericField("year").setIntValue(1988));
doc.add(new Field("content", new FileReader(file)));
writer.addDocument(doc);
writer.close();
But when I search and want to get the results, it returns null for these fields. I know that whenever I add a field and set the Field.Store.NO, it returns null, but why it happens right now? What should I do to get the value of these fields?
doc.get("id"); //why it returns null? what should I do?

Numeric fields are by default not stored.
Use the NumericField(String, Field.Store, boolean) constructor to specify that it should be stored if you would like to retrieve it later.

How to index date field in lucene

I am new to lucene. I have to index date field.
i am using Following IndexWriter constructor in lucene 3.0.0.
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED)
my point is:
Why it needs a analyzer when date fields are not analyzed,while indexing I used Field.Index.NOT_ANALYZED.

You can store date field in this fashion..
Document doc = new Document();
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.NOT_ANALYZED));
where f is a file object...
Now use the above document for indexwriter...
checkout the sample code comes with lucene... and the following link...
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/DateTools.html
UPDATE
Field.Index NOT_ANALYZED
Index the field's value without using
an Analyzer, so it can be searched. As
no analyzer is used the value will be
stored as a single term. This is
useful for unique Ids like product
numbers.
As per lucene javadoc you don't need analyzer for fields using Field.Index NOT_ANALYZED but i think by design the IndexWriter expects an analyzer as indexing the exact replica of data is not efficient in terms of storage and searching.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene 6 How to avoid duplicate entries - java

You are using a StoredField for the TRANSACTION_ID field. That means it can be retrieved from the index, but is not indexed and can't be searched, and as such, it can't be used as a key to updateDocument. Use a StringField, instead.

Related

How to avoid empty documents in a Firestore query?

Java Couchbase Querying to find a document's ID?

How to get internal doc id set by lucene

How to STORE numeric values in lucene documents?

How to index date field in lucene

Categories

Resources