How to STORE numeric values in lucene documents? - java

I am writing a code in java and use lucene 3.4 to index the text documents. Each document has an id and some other numerical values as well as content and title.
I add each document to the index according to the following code:
Document doc = new Document();
doc.add(new NumericField("id").setIntValue(writer.numDocs()));
doc.add(new NumericField("year").setIntValue(1988));
doc.add(new Field("content", new FileReader(file)));
writer.addDocument(doc);
writer.close();
But when I search and want to get the results, it returns null for these fields. I know that whenever I add a field and set the Field.Store.NO, it returns null, but why it happens right now? What should I do to get the value of these fields?
doc.get("id"); //why it returns null? what should I do?

Numeric fields are by default not stored.
Use the NumericField(String, Field.Store, boolean) constructor to specify that it should be stored if you would like to retrieve it later.

Related

Lucene 6 How to avoid duplicate entries

Story:
I need to search for a list of transactionIds be a given username query e.g "Peter M*".
Question: How is it possible to keep the stored transactionIds unique?
I have populated my index with following documents:
Document doc = new Document();
doc.add(new StoredField(TRANSACTION_ID, data.getTransactionId()));
doc.add(new TextField(MARCHANT_NAME, data.getName(), Store.NO));
I have tried allready two strategies (to avoid duplicate entries) to add a new entry.
IndexWriter.updateDocument with a Term holding the transactionId to store.
Search for the current transactionId, delete it and store it:
You are using a StoredField for the TRANSACTION_ID field. That means it can be retrieved from the index, but is not indexed and can't be searched, and as such, it can't be used as a key to updateDocument. Use a StringField, instead.

Remove Lucene document by query of not analyzed text field

I'm 99% sure I had this working in the past, maybe I'm wrong.
Anyway, I'd like to delete a Lucene document by a Field which is stored but not analyzed and contains text.
So the problem, it seems, is that calling luceneWriter.deleteDocuments(query) doesn't delete the document unless the field referenced in query is Field.Index.ANALYZED or a simple number.
Some code:
Integer myId = 1234;
Document doc = new Document();
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.ANALYZED);
doc.add(field);
indexWriter.add(doc);
indexWriter.commit();
...
QueryParser parser = new QueryParser(VERSION, "MyIdField", ANALYZER);
Query query = parser.parse("MyIdField:1234");
indexWriter.deleteDocuments(query);
indexWriter.commit();
Everything works!
Sweet.. what if the field is not analyzed?
Field field = new Field("MyIdField", myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
Still works!
Awesome, what if it's not just a number?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.NOT_ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
Doesn't work!.. darn.
The query doesn't match the document and so it isn't deleted.
What if we do index it?
Field field = new Field("MyIdField", "ID" + myId, Field.Store.YES, Field.Index.ANALYZED);
...
Query query = parser.parse("MyIdField:ID1234");
It works again!
Ok, so if the field is not analyzed it can still be queried if it only contains a number? Am I missing something?
Thanks for taking some time.
Note:
Technically, there are two fields, making it an AND query. As such, I'd prefer to delete the documents with a Query rather than a Term. I'm not sure if that makes a difference but wanted to emphasize I would like to stick with a solution using a Query.
According to this question, you have to use a PhraseQuery to search a not analyzed field. Your code
Query query = parser.parse("MyIdField:ID1234");
would yield a TermQuery instead, and thus won't match.
I recommend you to try a KeywordAnalyzer instead (remember that, even if your field isn't analyzed, the query parser could still analyze your query string and therefore your match could fail anyway).

Apache lucene indexing

I am creating a text search application for log files using apache lucene. I am using the bellow code to index the files
doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));
doc.add(new StoredField("filename", file.getCanonicalPath()));
Here i am creating 3 indexes for each file But when searching i can retrieve the value of only one index other two come as null. This is the search side code
Document d = searcher.doc(docId);
System.out.println(i+":File name is"+d.get("filename"));
System.out.println(i+":File name is"+d.get("modified"));
System.out.println(i+":File name is"+d.get("contents"));
The output I am getting is
2 total matching documents
0:File name is/home/maclean/NetBeansProjects/LogSearchEngine/src/SimpleSearcher.java
0:File name isnull
0:File name isnull
1:File name is/home/maclean/NetBeansProjects/LogSearchEngine/src/SimpleFileIndexer.java
1:File name isnull
1:File name isnull
What am i doing wrong
In Lucene, if you want to retrieve the value for a field, you need to store that field. If a field is not stored, on searching its value will be null.
For modified, you've explicitly specified it as a un-stored field by passing the argument Field.Store.NO; as a result it's value is not being stored in the index and hence, null is returned on search. To store and retrieve its value, you need to change the constructor call to:
doc.add(new LongField("modified", file.lastModified(), Field.Store.YES));
For contents, the constructor you've used creates un-stored field. You need to change its constructor to:
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8")), Field.Store.YES));
After these changes, you should be able to retrieve both the fields.
You are able to retrieve values for filename because you are using a constructor that creates stored fields by default.

Lucene 4.0 IndexWriter updateDocument for Numeric Term

I just wanted to know how it is possible to to update (delete/insert) a document based on a numeric field.
So far I did this:
LuceneManager.updateDocument(writer, new Term("id", NumericUtils.intToPrefixCoded(sentenceId)), newDoc);
But now with Lucene 4.0 the NumericUtils class has changed to this which I don't really understand.
Any help?
With Lucene 5.x, this could be solved by code below:
int id = 1;
BytesRefBuilder brb = new BytesRefBuilder();
NumericUtils.intToPrefixCodedBytes(id, 0, brb);
Term term = new Term("id", brb.get());
indexWriter.updateDocument(term, doc); // or indexWriter.deleteDocument(term);
You can use it this way:
First you must set the FieldType's numeric type:
FieldType TYPE_ID = new FieldType();
...
TYPE_ID.setNumericType(NumericType.INT);
TYPE_ID.freeze();
and then:
int idTerm = 10;
BytesRef bytes = new BytesRef(NumericUtils.BUF_SIZE_INT);
NumericUtils.intToPrefixCoded(id, 0, bytes);
Term idTerm = new Term("id", bytes);
and now you'll be able to use idTerm to update the doc.
I would recommend, if possible, it would be better to store an ID as a keyword string, rather than a number. If it is simply a unique identifier, indexing as a keyword makes much more sense. This removes any need to mess with numeric formatting.
If it is actually being used as a number, then you might need to perform the update manually. That is, search for and fetch the document you wish to update, delete the old document with tryDeleteDocument, and then add the updated version with addDocument. This is basically what updateDocument does anyway, to my knowledge.
The first option would certainly be the better way, though. A non-numeric field to use as an update ID would make life easier.
With Lucene 4, you can now create IntField, LongField, FloatField or DoubleField like this:
document.add(new IntField("id", 6, Field.Store.NO));
To write the document once you modified it, it's still:
indexWriter.updateDocument(new Term("pk", "<pk value>"), document);
EDIT:
And here is a way to make a query including this numeric field:
// Query <=> id <= 7
Query query = NumericRangeQuery.newIntRange("id", Integer.MIN_VALUE, 7, true, true);
TopDocs topDocs = indexSearcher.search(query, 10);
According to the documentation of Lucene 4.0.0, the ID field must to be used with StringField class:
"A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache."
I had the same problem as you and I solved it by making this change. After that, my update and delete worked perfectly.

How to index date field in lucene

I am new to lucene. I have to index date field.
i am using Following IndexWriter constructor in lucene 3.0.0.
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED)
my point is:
Why it needs a analyzer when date fields are not analyzed,while indexing I used Field.Index.NOT_ANALYZED.
You can store date field in this fashion..
Document doc = new Document();
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.NOT_ANALYZED));
where f is a file object...
Now use the above document for indexwriter...
checkout the sample code comes with lucene... and the following link...
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/DateTools.html
UPDATE
Field.Index NOT_ANALYZED
Index the field's value without using
an Analyzer, so it can be searched. As
no analyzer is used the value will be
stored as a single term. This is
useful for unique Ids like product
numbers.
As per lucene javadoc you don't need analyzer for fields using Field.Index NOT_ANALYZED but i think by design the IndexWriter expects an analyzer as indexing the exact replica of data is not efficient in terms of storage and searching.

Categories

Resources