How to keep Lucene index without deleted documents

How to keep Lucene index without deleted documents - java

This is my first question on Stack Overflow,so wish me luck.
I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?
Here is some snippets of my code:
public static void main(String[] args) throws IOException, ParseException {
///////////////////////Preparing config data////////////////////////////
File indexDir = new File("/indexDir");
Directory fsDir = FSDirectory.open(indexDir);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
IndexReader reader = DirectoryReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);
classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());
System.out.println("***Before***");
showIndexedDocuments(reader);
System.out.println("***Before***");
int maxdoc = reader.maxDoc();
int j = 0;
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String clusterClasif = doc.get("category");
String text = doc.get("text");
String docid = doc.get("doc_id");
ClassificationResult<BytesRef> result = classifier.assignClass(text);
String classified = result.getAssignedClass().utf8ToString();
if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
Term term = new Term("doc_id", docid);
doc.removeField("category");
doc.add(new StringField("category",
classified, Field.Store.YES));
indexWriter.updateDocument(term,doc);
j++;
}
}
indexWriter.forceMergeDeletes(true);
indexWriter.close();
System.out.println("Classified documents count: " + j);
System.out.println();
reader.close();
reader = DirectoryReader.open(fsDir);
System.out.println("Deleted docs: " + reader.numDeletedDocs());
System.out.println("***After***");
showIndexedDocuments(reader);
}
private static void showIndexedDocuments(IndexReader reader) throws IOException {
int maxdoc = reader.maxDoc();
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
String text = doc.get("text");
String category = doc.get("category");
System.out.println("Id Doc: " + idDoc);
System.out.println("Category: " + category);
System.out.println("Text: " + text);
System.out.println();
}
System.out.println("Total: " + maxdoc);
}
I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

You can check is a document is deleted is the MultiFields class, like:
Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...
So, working this into your code, perhaps something like:
int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
if (!liveDocs.get(docID)) continue;
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
....
}
By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Groovy with JAVA: Error while merging multiple large PDF file, causing out of memory [duplicate]

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2.
While trying to split the pdf file to single pages using the following code:
PDDocument document = PDDocument.load(inputFile);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here
I receive the following exception
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.
There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.
One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks
Edit:
This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:
1. I do not have the size problem mentioned in the aforementioned
topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
the size of each slice is an average of 80KB with total size of
30.7MB.
2. The Split throws the exception even before it returns the splitted parts.
I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.
An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)
public void execute() {
File inputFile = new File(path/to/the/file.pdf);
PDDocument document = null;
try {
document = PDDocument.load(inputFile);
int start = 1;
int end = 1;
int batchSize = 50;
int finalBatchSize = document.getNumberOfPages() % batchSize;
int noOfBatches = document.getNumberOfPages() / batchSize;
for (int i = 1; i <= noOfBatches; i++) {
start = end;
end = start + batchSize;
System.out.println("Batch: " + i + " start: " + start + " end: " + end);
split(document, start, end);
}
// handling the remaining
start = end;
end += finalBatchSize;
System.out.println("Final Batch start: " + start + " end: " + end);
split(document, start, end);
} catch (IOException e) {
e.printStackTrace();
} finally {
//close the document
}
}
private void split(PDDocument document, int start, int end) throws IOException {
List<File> fileList = new ArrayList<File>();
Splitter splitter = new Splitter();
splitter.setStartPage(start);
splitter.setEndPage(end);
List<PDDocument> splittedDocuments = splitter.split(document);
String outputPath = Config.INSTANCE.getProperty("outputPath");
PDFTextStripper stripper = new PDFTextStripper();
for (int index = 0; index < splittedDocuments.size(); index++) {
String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
PDDocument splittedDocument = splittedDocuments.get(index);
splittedDocument.save(pdfFullPath);
}
}

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

I'm just a lucene starter and and i got stuck on a problem during a change from a RAMDIrectory to a FSDirectory:
First my code:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
Directory DIR = FSDirectory.open(new File(INDEXLOC)); //INDEXLOC = "path/to/dir/"
// RAMDirectory DIR = new RAMDirectory();
// Index some made up content
IndexWriter writer =
new IndexWriter(DIR, iwc);
// Store both position and offset information
FieldType type = new FieldType();
type.setStored(true);
type.setStoreTermVectors(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setIndexed(true);
type.setTokenized(true);
IDocumentParser p = DocumentParserFactory.getParser(f);
ArrayList<ParserDocument> DOCS = p.getParsedDocuments();
for (int i = 0; i < DOCS.size(); i++) {
Document doc = new Document();
Field id = new StringField("id", "doc_" + i, Field.Store.YES);
doc.add(id);
Field text = new Field("content", DOCS.get(i).getContent(), type);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
// Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(DIR));
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "zahl"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader);
Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>();
Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
int window = 2;// get the words within two of the match
while (spans.next() == true) {
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
int start = spans.start() - window;
int end = spans.end() + window;
Terms content = reader.getTermVector(spans.doc(), "content");
TermsEnum termsEnum = content.iterator(null);
BytesRef term;
while ((term = termsEnum.next()) != null) {
// could store the BytesRef here, but String is easier for this
// example
String s = new String(term.bytes, term.offset, term.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
it's just some code i found on a great website and i wanted to try .... everything works great using the RAMDirectory. But if i change it to my FSDirectory it's giving me a NullpointerException like :
Exception in thread "main" java.lang.NullPointerException at
com.org.test.TextDB.myMethod(TextDB.java:184) at
com.org.test.Main.main(Main.java:31)
The statement Terms content = reader.getTermVector(spans.doc(), "content"); seems to get no result and returns null. so the exception. but why? in my ramDIR everything works fine.
It seems that the indexWriter or the Reader (really don't know) didn't write or didn't read the field "content" properly from the index. But i really don't know why its 'written' in a RAMDirectory and not written in a FSDIrectory?!
Anybody an idea to that?

Gave this a test a quick test run, and I can't reproduce your issue.
I think the most likely issue here is old documents in your index. The way this is written, every time it is run, more documents will be added to your index. Old documents from previous runs won't get deleted, or overwritten, they'll just stick around. So, if you have run this before on the same directory, say perhaps, before you added the line type.setStoreTermVectors(true);, some of your results may be these old documents with term vectors, and reader.getTermVector(...) will return null, if the document does not store term vectors.
Of course, anything indexed in a RAMDirectory will be dropped as soon as execution finishes, so the issue would not occur in that case.
Simple solution would be to try deleting the index directory and run it again.
If you want to start with a fresh index when you run this, you can set that up through the IndexWriterConfig:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
That's a guess, of course, but seems consistent with the behavior you've described.

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html

If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).

Java Lucene IndexReader not working correctly

The story is this. I want to mimic the behavior of a relational database using a Lucene index in java. I need to be able to do searching(reading) and writing at the same time.
For example, I want to save Project information into an index. For simplicity, let's say that the project has 2 fields - id and name. Now, before adding a new project to the index, I'm searching if a project with a given id is already present. For this I'm using an IndexSearcher. This operation completes with success (namely the IndexSearcher returns the internal doc id for the document that contains the project id I'm looking for).
Now I want to actually read the value of this project ID, so I'm using now an IndexReader to get the indexed Lucene document from which I can extract the project id field.
The problem is that the IndexReader return a Document that has all of the fields NULL. So, to repeat IndexSearcher works correctly, IndexReader returns bogus stuff.
I'm thinking that somehow this has to do with the fact that the document fields data does not get saved on the hard disk when the IndexWriter is flushed. The thing is that the first time I do this indexing operation, IndexReader works good. However after a restart of my application, the above mentioned situation happens. So I'm thinking that the first time around data floats in RAM, but doesn't get flushed correctly (or totally since IndexSearcher works) on the hard drive.
Maybe it will help if I give you the source code, so here it is (you can safely ignore the tryGetIdFromMemory part, I'm using that as an speed optimization trick):
public class ProjectMetadataIndexer {
private File indexFolder;
private Directory directory;
private IndexSearcher indexSearcher;
private IndexReader indexReader;
private IndexWriter indexWriter;
private Version luceneVersion = Version.LUCENE_31;
private Map<String, Integer> inMemoryIdHolder;
private final int memoryCapacity = 10000;
public ProjectMetadataIndexer() throws IOException {
inMemoryIdHolder = new HashMap<String, Integer>();
indexFolder = new File(ConfigurationSingleton.getInstance()
.getProjectMetaIndexFolder());
directory = FSDirectory.open(indexFolder);
IndexWriterConfig config = new IndexWriterConfig(luceneVersion,
new WhitespaceAnalyzer(luceneVersion));
indexWriter = new IndexWriter(directory, config);
indexReader = IndexReader.open(indexWriter, false);
indexSearcher = new IndexSearcher(indexReader);
}
public int getProjectId(String projectName) throws IOException {
int fromMemoryId = tryGetProjectIdFromMemory(projectName);
if (fromMemoryId >= 0) {
return fromMemoryId;
} else {
int projectId;
Term projectNameTerm = new Term("projectName", projectName);
TermQuery projectNameQuery = new TermQuery(projectNameTerm);
BooleanQuery query = new BooleanQuery();
query.add(projectNameQuery, Occur.MUST);
TopDocs docs = indexSearcher.search(query, 1);
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
indexWriter.close();
indexReader.close();
indexSearcher.close();
indexReader = IndexReader.open(directory);
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
storeInMemory(projectName, projectId);
return projectId;
}
}
private int tryGetProjectIdFromMemory(String projectName) {
String key = projectName;
Integer id = inMemoryIdHolder.get(key);
if (id == null) {
return -1;
} else {
return id.intValue();
}
}
private void storeInMemory(String projectName, int projectId) {
if (inMemoryIdHolder.size() > memoryCapacity) {
inMemoryIdHolder.clear();
}
String key = projectName;
inMemoryIdHolder.put(key, projectId);
}
private void indexMeta(int projectId, String projectName)
throws CorruptIndexException, IOException {
Document document = new Document();
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Field nameField = new Field("projectName", projectName, Store.NO,
Index.ANALYZED);
document.add(nameField);
indexWriter.addDocument(document);
}
public void close() throws CorruptIndexException, IOException {
indexReader.close();
indexWriter.close();
}
}
To be more precise all the problems occur in this if:
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
On the else branch...
I don't know what is wrong.

Do you store the respective fields? If not, the fields are "only" stored in the reverse index part, i.e. the field value is mapped to the document, but the document itself doesn't contain the field value.
The part of the code where you save the document might be helpful.

I had a hard time figuring out how to index/search numbers and I just wanted to say the following snippets of code really helped me out:
projectId = Integer.valueOf(document.get("projectId"));
////////////
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Thanks!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to keep Lucene index without deleted documents - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

Groovy with JAVA: Error while merging multiple large PDF file, causing out of memory [duplicate]

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

Java Lucene IndexReader not working correctly

Categories

Resources