Java Lucene IndexReader not working correctly

Java Lucene IndexReader not working correctly - java

The story is this. I want to mimic the behavior of a relational database using a Lucene index in java. I need to be able to do searching(reading) and writing at the same time.
For example, I want to save Project information into an index. For simplicity, let's say that the project has 2 fields - id and name. Now, before adding a new project to the index, I'm searching if a project with a given id is already present. For this I'm using an IndexSearcher. This operation completes with success (namely the IndexSearcher returns the internal doc id for the document that contains the project id I'm looking for).
Now I want to actually read the value of this project ID, so I'm using now an IndexReader to get the indexed Lucene document from which I can extract the project id field.
The problem is that the IndexReader return a Document that has all of the fields NULL. So, to repeat IndexSearcher works correctly, IndexReader returns bogus stuff.
I'm thinking that somehow this has to do with the fact that the document fields data does not get saved on the hard disk when the IndexWriter is flushed. The thing is that the first time I do this indexing operation, IndexReader works good. However after a restart of my application, the above mentioned situation happens. So I'm thinking that the first time around data floats in RAM, but doesn't get flushed correctly (or totally since IndexSearcher works) on the hard drive.
Maybe it will help if I give you the source code, so here it is (you can safely ignore the tryGetIdFromMemory part, I'm using that as an speed optimization trick):
public class ProjectMetadataIndexer {
private File indexFolder;
private Directory directory;
private IndexSearcher indexSearcher;
private IndexReader indexReader;
private IndexWriter indexWriter;
private Version luceneVersion = Version.LUCENE_31;
private Map<String, Integer> inMemoryIdHolder;
private final int memoryCapacity = 10000;
public ProjectMetadataIndexer() throws IOException {
inMemoryIdHolder = new HashMap<String, Integer>();
indexFolder = new File(ConfigurationSingleton.getInstance()
.getProjectMetaIndexFolder());
directory = FSDirectory.open(indexFolder);
IndexWriterConfig config = new IndexWriterConfig(luceneVersion,
new WhitespaceAnalyzer(luceneVersion));
indexWriter = new IndexWriter(directory, config);
indexReader = IndexReader.open(indexWriter, false);
indexSearcher = new IndexSearcher(indexReader);
}
public int getProjectId(String projectName) throws IOException {
int fromMemoryId = tryGetProjectIdFromMemory(projectName);
if (fromMemoryId >= 0) {
return fromMemoryId;
} else {
int projectId;
Term projectNameTerm = new Term("projectName", projectName);
TermQuery projectNameQuery = new TermQuery(projectNameTerm);
BooleanQuery query = new BooleanQuery();
query.add(projectNameQuery, Occur.MUST);
TopDocs docs = indexSearcher.search(query, 1);
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
indexWriter.close();
indexReader.close();
indexSearcher.close();
indexReader = IndexReader.open(directory);
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
storeInMemory(projectName, projectId);
return projectId;
}
}
private int tryGetProjectIdFromMemory(String projectName) {
String key = projectName;
Integer id = inMemoryIdHolder.get(key);
if (id == null) {
return -1;
} else {
return id.intValue();
}
}
private void storeInMemory(String projectName, int projectId) {
if (inMemoryIdHolder.size() > memoryCapacity) {
inMemoryIdHolder.clear();
}
String key = projectName;
inMemoryIdHolder.put(key, projectId);
}
private void indexMeta(int projectId, String projectName)
throws CorruptIndexException, IOException {
Document document = new Document();
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Field nameField = new Field("projectName", projectName, Store.NO,
Index.ANALYZED);
document.add(nameField);
indexWriter.addDocument(document);
}
public void close() throws CorruptIndexException, IOException {
indexReader.close();
indexWriter.close();
}
}
To be more precise all the problems occur in this if:
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
On the else branch...
I don't know what is wrong.

Do you store the respective fields? If not, the fields are "only" stored in the reverse index part, i.e. the field value is mapped to the document, but the document itself doesn't contain the field value.
The part of the code where you save the document might be helpful.

I had a hard time figuring out how to index/search numbers and I just wanted to say the following snippets of code really helped me out:
projectId = Integer.valueOf(document.get("projectId"));
////////////
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Thanks!

Related

How to keep Lucene index without deleted documents

This is my first question on Stack Overflow,so wish me luck.
I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?
Here is some snippets of my code:
public static void main(String[] args) throws IOException, ParseException {
///////////////////////Preparing config data////////////////////////////
File indexDir = new File("/indexDir");
Directory fsDir = FSDirectory.open(indexDir);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
IndexReader reader = DirectoryReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);
classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());
System.out.println("***Before***");
showIndexedDocuments(reader);
System.out.println("***Before***");
int maxdoc = reader.maxDoc();
int j = 0;
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String clusterClasif = doc.get("category");
String text = doc.get("text");
String docid = doc.get("doc_id");
ClassificationResult<BytesRef> result = classifier.assignClass(text);
String classified = result.getAssignedClass().utf8ToString();
if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
Term term = new Term("doc_id", docid);
doc.removeField("category");
doc.add(new StringField("category",
classified, Field.Store.YES));
indexWriter.updateDocument(term,doc);
j++;
}
}
indexWriter.forceMergeDeletes(true);
indexWriter.close();
System.out.println("Classified documents count: " + j);
System.out.println();
reader.close();
reader = DirectoryReader.open(fsDir);
System.out.println("Deleted docs: " + reader.numDeletedDocs());
System.out.println("***After***");
showIndexedDocuments(reader);
}
private static void showIndexedDocuments(IndexReader reader) throws IOException {
int maxdoc = reader.maxDoc();
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
String text = doc.get("text");
String category = doc.get("category");
System.out.println("Id Doc: " + idDoc);
System.out.println("Category: " + category);
System.out.println("Text: " + text);
System.out.println();
}
System.out.println("Total: " + maxdoc);
}
I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

You can check is a document is deleted is the MultiFields class, like:
Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...
So, working this into your code, perhaps something like:
int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
if (!liveDocs.get(docID)) continue;
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
....
}
By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

I'm just a lucene starter and and i got stuck on a problem during a change from a RAMDIrectory to a FSDirectory:
First my code:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
Directory DIR = FSDirectory.open(new File(INDEXLOC)); //INDEXLOC = "path/to/dir/"
// RAMDirectory DIR = new RAMDirectory();
// Index some made up content
IndexWriter writer =
new IndexWriter(DIR, iwc);
// Store both position and offset information
FieldType type = new FieldType();
type.setStored(true);
type.setStoreTermVectors(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setIndexed(true);
type.setTokenized(true);
IDocumentParser p = DocumentParserFactory.getParser(f);
ArrayList<ParserDocument> DOCS = p.getParsedDocuments();
for (int i = 0; i < DOCS.size(); i++) {
Document doc = new Document();
Field id = new StringField("id", "doc_" + i, Field.Store.YES);
doc.add(id);
Field text = new Field("content", DOCS.get(i).getContent(), type);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
// Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(DIR));
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "zahl"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader);
Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>();
Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
int window = 2;// get the words within two of the match
while (spans.next() == true) {
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
int start = spans.start() - window;
int end = spans.end() + window;
Terms content = reader.getTermVector(spans.doc(), "content");
TermsEnum termsEnum = content.iterator(null);
BytesRef term;
while ((term = termsEnum.next()) != null) {
// could store the BytesRef here, but String is easier for this
// example
String s = new String(term.bytes, term.offset, term.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
it's just some code i found on a great website and i wanted to try .... everything works great using the RAMDirectory. But if i change it to my FSDirectory it's giving me a NullpointerException like :
Exception in thread "main" java.lang.NullPointerException at
com.org.test.TextDB.myMethod(TextDB.java:184) at
com.org.test.Main.main(Main.java:31)
The statement Terms content = reader.getTermVector(spans.doc(), "content"); seems to get no result and returns null. so the exception. but why? in my ramDIR everything works fine.
It seems that the indexWriter or the Reader (really don't know) didn't write or didn't read the field "content" properly from the index. But i really don't know why its 'written' in a RAMDirectory and not written in a FSDIrectory?!
Anybody an idea to that?

Gave this a test a quick test run, and I can't reproduce your issue.
I think the most likely issue here is old documents in your index. The way this is written, every time it is run, more documents will be added to your index. Old documents from previous runs won't get deleted, or overwritten, they'll just stick around. So, if you have run this before on the same directory, say perhaps, before you added the line type.setStoreTermVectors(true);, some of your results may be these old documents with term vectors, and reader.getTermVector(...) will return null, if the document does not store term vectors.
Of course, anything indexed in a RAMDirectory will be dropped as soon as execution finishes, so the issue would not occur in that case.
Simple solution would be to try deleting the index directory and run it again.
If you want to start with a fresh index when you run this, you can set that up through the IndexWriterConfig:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
That's a guess, of course, but seems consistent with the behavior you've described.

How to access ElasticSearch parent-doc fields from custom facet over child docs

I have most of a parent/child-doc solution for a problem I'm working on, but I ran into a hitch: from inside a facet that iterates over the child docs I need to access the value of a parent doc field. I have (or I can get) the parent doc ID (from the _parent field of the child doc, or worst case by indexing it again as a normal field) but that's an "external" ID, not the node-internal ID that I need to load the field value from the field cache. (I'm using default routing so the parent doc is definitely in the same shard as the children.)
More concretely, here's what I have in the FacetCollector so far (ES 0.20.6):
protected void doSetNextReader(IndexReader reader, int docBase) throws IOException {
/* not sure this will work, otherwise I can index the field seperately */
parentFieldData = (LongFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.LONG, reader, "_parent");
parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.FLOAT, "springiness");
/* ... */
protected void doCollect(int doc) throws IOException {
long parentID = parentFieldData.value(doc); // or whatever the correct equivalent here is
// here's the problem:
parentSpringiness = parentSpringinessFieldData.value(parentID)
// type error: expected int (node-internal ID), got long (external ID)
Any suggestions? (I can't upgrade to 0.90 yet but would be interested to hear if that would help.)

Honking great disclaimer: (1) I ended up not using this approach at all, so this is only slightly-tested code, and (2) far as I can see it will be pretty horribly inefficient, and it has the same memory overhead as parent queries. If another approach will work for you, do consider it (for my use case I ended up using nested documents, with a custom facet collector that iterates over both the nested and the parent documents, to have easy access to the field values of both).
The example within the ES code to work from is org.elasticsearch.index.search.child.ChildCollector. The first element you need is in the Collector initialisation:
try {
context.idCache().refresh(context.searcher().subReaders());
} catch (Exception e) {
throw new FacetPhaseExecutionException(facetName, "Failed to load parent-ID cache", e);
}
This makes possible the following line in doSetNextReader():
typeCache = context.idCache().reader(reader).type(parentType);
which gives you a lookup of the parent doc's UId in doCollect(int childDocId):
HashedBytesArray postingUid = typeCache.parentIdByDoc(childDocId);
The parent document won't necessarily be found in the same reader as the child doc: when the Collector initialises you also need to store all readers (needed to access the field value) and for each reader an IdReaderTypeCache (to resolve the parent doc's UId to a reader-internal docId).
this.readers = new Tuple[context.searcher().subReaders().length];
for (int i = 0; i < readers.length; i++) {
IndexReader reader = context.searcher().subReaders()[i];
readers[i] = new Tuple<IndexReader, IdReaderTypeCache>(reader, context.idCache().reader(reader).type(parentType));
}
this.context = context;
Then when you need the parent doc field, you have to iterate over the reader/typecache pairs looking for the right one:
int parentDocId = -1;
for (Tuple<IndexReader, IdReaderTypeCache> tuple : readers) {
IndexReader indexReader = tuple.v1();
IdReaderTypeCache idReaderTypeCache = tuple.v2();
if (idReaderTypeCache == null) { // might be if we don't have that doc with that type in this reader
continue;
}
parentDocId = idReaderTypeCache.docById(postingUid);
if (parentDocId != -1 && !indexReader.isDeleted(parentDocId)) {
FloatFieldData parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(
FieldDataType.DefaultTypes.FLOAT,
indexReader,
"springiness");
parentSpringiness = parentSpringinessFieldData.value(parentDocId);
break;
}
}
if (parentDocId == -1) {
throw new FacetPhaseExecutionException(facetName, "Parent doc " + postingUid + " could not be found!");
}

Lucene Index - single term and phrase querying

I've read some documents and build a lucene index which looks like
Documents:
id 1
keyword foo bar
keyword john
id 2
keyword foo
id 3
keyword john doe
keyword bar foo
keyword what the hell
I want to query lucene in a way, where I can combine single term and phrases.
Let's say my query is
foo bar
should give back the doc ids 1, 2 and 3
The query
"foo bar"
should give back the doc ids 1
The query
john
should give back the doc ids 1 and 3
The query
john "foo bar"
should give back the doc ids 1
My implementation in java is not working. Also reading tons of documents didn't help.
When I query my index with
"foo bar"
I get 0 hits
When I query my index with
foo "john doe"
I get back the doc ids 1, 2 and 3 (i would expect only doc id 3 since the query is meant as foo AND "john doe") The problem is, that "john doe" gives back 0 hits but foo gives back 3 hits.
My goal is to combine single term and phrase terms. What am I doing wrong? I've also played around with the analyzers with no luck.
My implementation looks like this:
Indexer
import ...
public class Indexer
{
private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);
private final File indexDir;
private IndexWriter writer;
public Indexer(File indexDir)
{
this.indexDir = indexDir;
this.writer = null;
}
private IndexWriter createIndexWriter()
{
try
{
Directory dir = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(256.0);
IndexWriter idx = new IndexWriter(dir, iwc);
idx.deleteAll();
return idx;
} catch (IOException e)
{
throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
}
}
public void index(TestCaseDescription desc)
{
if (writer == null)
writer = createIndexWriter();
Document doc = new Document();
addPathToDoc(desc, doc);
addLastModifiedToDoc(desc, doc);
addIdToDoc(desc, doc);
for (String keyword : desc.getKeywords())
addKeywordToDoc(doc, keyword);
updateIndex(doc, desc);
}
private void addIdToDoc(TestCaseDescription desc, Document doc)
{
Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
idField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(idField);
}
private void addKeywordToDoc(Document doc, String keyword)
{
Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(keywordField);
}
private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
{
NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
modifiedField.setLongValue(desc.getLastModified());
doc.add(modifiedField);
}
private void addPathToDoc(TestCaseDescription desc, Document doc)
{
Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(pathField);
}
private void updateIndex(Document doc, TestCaseDescription desc)
{
try
{
if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
{
// New index, so we just add the document (no old document can be there):
LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.addDocument(doc);
} else
{
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
}
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
desc.getPath()), e);
}
}
public void store()
{
try
{
writer.close();
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
}
writer = null;
}
}
Searcher:
import ...
public class Searcher
{
private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);
private final Analyzer analyzer;
private final QueryParser parser;
private final File indexDir;
public Searcher(File indexDir)
{
this.indexDir = indexDir;
analyzer = new StandardAnalyzer(Version.LUCENE_34);
parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
parser.setAllowLeadingWildcard(true);
}
public List<String> search(String searchString)
{
List<String> testCaseIds = new ArrayList<String>();
try
{
IndexSearcher searcher = getIndexSearcher(indexDir);
Query query = parser.parse(searchString);
LOG.info("Searching for: " + query.toString(parser.getField()));
AllDocCollector results = new AllDocCollector();
searcher.search(query, results);
LOG.info("Found [{}] hit", results.getHits().size());
for (ScoreDoc scoreDoc : results.getHits())
{
Document doc = searcher.doc(scoreDoc.doc);
String id = doc.get(LuceneConstants.FIELD_ID);
testCaseIds.add(id);
}
searcher.close();
return testCaseIds;
} catch (Exception e)
{
throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
}
}
private IndexSearcher getIndexSearcher(File indexDir)
{
try
{
FSDirectory dir = FSDirectory.open(indexDir);
return new IndexSearcher(dir);
} catch (IOException e)
{
LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
throw new RuntimeException(e);
}
}
}

Why are you using DOCS_ONLY?! If you only index docids, then you only have a basic inverted index with term->document mappings, but no proximity information. So thats why your phrase queries don't work.

I think you roughly want:
keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"
Which is to say, phrase match "foo bar" and boost it (prefer the full phrase), OR match "foo", OR match "bar".
The full query syntax is here: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html
EDIT:
It looks like one thing you're missing is that the default operator is OR. So you probably want to do something like this:
+keyword:john AND +keyword:"foo bar"
The plus sign means "must contain". You put the AND explicitly so that the document must contain both (rather that the default, which translates to "must contain john OR must contain "foo bar").

The problem was solved by replacing
StandardAnalyzer
with
KeywordAnalyzer
for both, the indexer and the searcher.
As I was able to point out that the StandardAnalyzer splits input text into several words I've replaced it with KeywordAnalyzer since the input (which can consist of one or more words) will remain untouched. It will recognize a term like
bla foo
as a single keyword.

Lucene 3.0.3 does not delete document

We use Lucene to index some internal documents. Sometimes we need to remove documents. These documents have an unique id and are represented by a class DocItem as follows (ALL THE CODE IS A SIMPLIFIED VERSION WITH ONLY SIGNIFICANT (I hope) PARTS):
public final class DocItem {
public static final String fID = "id";
public static final String fTITLE = "title";
private Document doc = new Document();
private Field id = new Field(fID, "", Field.Store.YES, Field.Index.ANALYZED);
private Field title = new Field(fTITLE, "", Field.Store.YES, Field.Index.ANALYZED);
public DocItem() {
doc.add(id);
doc.add(title);
}
... getters & setters
public getDoc() {
return doc;
}
}
So, to index a document, a new DocItem is created and passed to an indexer class as follows:
public static void index(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.addDocument(docitem.getDoc());
idxWriter.close();
}
We created an auxiliary method to iterate over the index directory:
public static void listAll() {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexReader reader = IndexReader.open(dir);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
System.out.println(doc.get(DocItem.fID));
}
}
Running the listAll, we can see that our docs are being indexed properly. At least, we can see the id and other attributes.
We retrieve the document using IndexSearcher as follows:
public static DocItem search(String id) {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexSearcher searcher = new IndexSearcher(index, true);
Query q = new QueryParser(Version.LUCENE_30, DocItem.fID, new StandardAnalyzer(Version.LUCENE_30)).parse(id);
TopDocs td = searcher.search(q, 1);
ScoreDoc[] hits = td.scoreDocs;
searcher.close();
return hits[0];
}
So after retrieving it, we are trying to delete it with:
public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close();
}
The problem is that it doesn't work. The document is never deleted. If I run the listAll() after the deletion, the document is still there. We tried to use IndexReader, with no lucky.
By this post and this post, We think that we are using it accordinlgy.
What we are doing wrong? Any advice? We are using lucene 3.0.3 and java 1.6.0_24.
TIA,
Bob

I would suggest, use IndexReader DeleteDocumets, it returns the number of documents deleted. this will help you narrow whether the deletions occur on first count.
the advantage of this over the indexwriter method, is that it returns the total document deleted, if none if shall return 0.
Also see the How do I delete documents from the index? and this post
Edit: Also i noticed you open the indexreader in readonly mode, can you change the listFiles() index reader open with false as second param, this will allow read write, perhaps the source of error

I call IndexWriterConfig#setMaxBufferedDeleteTerms(1) during IndexWriter instantiation/configuration and all delete operations go to disc immediately. Maybe it's not correct design-wise, but solves the problem explained here.

public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close(

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Lucene IndexReader not working correctly - java

Do you store the respective fields? If not, the fields are "only" stored in the reverse index part, i.e. the field value is mapped to the document, but the document itself doesn't contain the field value. The part of the code where you save the document might be helpful.

Related

How to keep Lucene index without deleted documents

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

How to access ElasticSearch parent-doc fields from custom facet over child docs

Lucene Index - single term and phrase querying

Lucene 3.0.3 does not delete document

Categories

Resources