How to index date field in lucene

How to index date field in lucene - java

I am new to lucene. I have to index date field.
i am using Following IndexWriter constructor in lucene 3.0.0.
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED)
my point is:
Why it needs a analyzer when date fields are not analyzed,while indexing I used Field.Index.NOT_ANALYZED.

You can store date field in this fashion..
Document doc = new Document();
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.NOT_ANALYZED));
where f is a file object...
Now use the above document for indexwriter...
checkout the sample code comes with lucene... and the following link...
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/DateTools.html
UPDATE
Field.Index NOT_ANALYZED
Index the field's value without using
an Analyzer, so it can be searched. As
no analyzer is used the value will be
stored as a single term. This is
useful for unique Ids like product
numbers.
As per lucene javadoc you don't need analyzer for fields using Field.Index NOT_ANALYZED but i think by design the IndexWriter expects an analyzer as indexing the exact replica of data is not efficient in terms of storage and searching.

Related

Searching for Terms with whitespace using Lucene

I'm trying to use Lucene to add a search feature but can't seem to get an index to work with significant whitespace. I've got the following test case setup:
RAMDirectory directory = new RAMDirectory();
KeywordAnalyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("content", analyzer);
parser.setSplitOnWhitespace(false);
Query query = parser.parse("Bill E");
TopDocs docs = searcher.search(query, 1);
assertTrue(docs.totalHits > 0);
I'm using Lucene 6.6.0 and from what I understand the KeywordAnalyzer is what I'm looking for:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
But I can't seem to get any matching documents that contain whitespace.
Any ideas on how to solve this?

When you index, you have a single document with a single field and with a single term with value - Bill Evans
When you are going to search, TermQuery produced by QueryParser tries to search with term value - Bill E and that term obviously doesn't exist in index so you get zero hits.
if you replace your search string with - Bill Evans , you will get results.
Please refer this question too
First , you need to separate your indexing and searching concerns. You can only search what is indexed. If you are indexing full texts without breaking into tokens then at search times - you need to produce WildCardQuery , FuzzyQuery , PhraseQuery etc if your input string at search time is different than what in indexed. TermQuery searches for exact term values.
My suggestion would to be to store full text value ( without tokens - StringField would do that ) as well as generate additional tokens breaking on space using something like - SimpleAnalyzer .
So Something like,
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
doc.add(new StringField("storedcontent", "Bill Evans", Field.Store.YES));
Above code with SimpleAnalyzer , you will now have terms - bill & evans
( as well as full text as stored field ) and if you now search with same analyzer , your query would be like - content:bill content:e & you will get a result.
All in all - system is working the way you have coded it :)
So understand your requirements first as what you wish to index and what kind of queries you wish to perform on that index.

Lucene 6 How to avoid duplicate entries

Story:
I need to search for a list of transactionIds be a given username query e.g "Peter M*".
Question: How is it possible to keep the stored transactionIds unique?
I have populated my index with following documents:
Document doc = new Document();
doc.add(new StoredField(TRANSACTION_ID, data.getTransactionId()));
doc.add(new TextField(MARCHANT_NAME, data.getName(), Store.NO));
I have tried allready two strategies (to avoid duplicate entries) to add a new entry.
IndexWriter.updateDocument with a Term holding the transactionId to store.
Search for the current transactionId, delete it and store it:

You are using a StoredField for the TRANSACTION_ID field. That means it can be retrieved from the index, but is not indexed and can't be searched, and as such, it can't be used as a key to updateDocument. Use a StringField, instead.

How to get internal doc id set by lucene

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

How to STORE numeric values in lucene documents?

I am writing a code in java and use lucene 3.4 to index the text documents. Each document has an id and some other numerical values as well as content and title.
I add each document to the index according to the following code:
Document doc = new Document();
doc.add(new NumericField("id").setIntValue(writer.numDocs()));
doc.add(new NumericField("year").setIntValue(1988));
doc.add(new Field("content", new FileReader(file)));
writer.addDocument(doc);
writer.close();
But when I search and want to get the results, it returns null for these fields. I know that whenever I add a field and set the Field.Store.NO, it returns null, but why it happens right now? What should I do to get the value of these fields?
doc.get("id"); //why it returns null? what should I do?

Numeric fields are by default not stored.
Use the NumericField(String, Field.Store, boolean) constructor to specify that it should be stored if you would like to retrieve it later.

Remove Lucene document by query of not analyzed text field

I'm 99% sure I had this working in the past, maybe I'm wrong.
Anyway, I'd like to delete a Lucene document by a Field which is stored but not analyzed and contains text.
So the problem, it seems, is that calling luceneWriter.deleteDocuments(query) doesn't delete the document unless the field referenced in query is Field.Index.ANALYZED or a simple number.
Some code:
Integer myId = 1234;
Document doc = new Document();
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
indexWriter.add(doc);
indexWriter.commit();
...
QueryParser parser = new QueryParser(VERSION, "MyIdField", ANALYZER);
Query query = parser.parse("MyIdField:1234");
indexWriter.deleteDocuments(query);
indexWriter.commit();
Everything works!
Sweet.. what if the field is not analyzed?
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
Still works!
Awesome, what if it's not just a number?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
Doesn't work!.. darn.
The query doesn't match the document and so it isn't deleted.
What if we do index it?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
It works again!
Ok, so if the field is not analyzed it can still be queried if it only contains a number? Am I missing something?
Thanks for taking some time.
Note:
Technically, there are two fields, making it an AND query. As such, I'd prefer to delete the documents with a Query rather than a Term. I'm not sure if that makes a difference but wanted to emphasize I would like to stick with a solution using a Query.

According to this question, you have to use a PhraseQuery to search a not analyzed field. Your code
Query query = parser.parse("MyIdField:ID1234");
would yield a TermQuery instead, and thus won't match.
I recommend you to try a KeywordAnalyzer instead (remember that, even if your field isn't analyzed, the query parser could still analyze your query string and therefore your match could fail anyway).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.