Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing - java

I'm just a lucene starter and and i got stuck on a problem during a change from a RAMDIrectory to a FSDirectory:
First my code:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
Directory DIR = FSDirectory.open(new File(INDEXLOC)); //INDEXLOC = "path/to/dir/"
// RAMDirectory DIR = new RAMDirectory();
// Index some made up content
IndexWriter writer =
new IndexWriter(DIR, iwc);
// Store both position and offset information
FieldType type = new FieldType();
type.setStored(true);
type.setStoreTermVectors(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setIndexed(true);
type.setTokenized(true);
IDocumentParser p = DocumentParserFactory.getParser(f);
ArrayList<ParserDocument> DOCS = p.getParsedDocuments();
for (int i = 0; i < DOCS.size(); i++) {
Document doc = new Document();
Field id = new StringField("id", "doc_" + i, Field.Store.YES);
doc.add(id);
Field text = new Field("content", DOCS.get(i).getContent(), type);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
// Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(DIR));
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "zahl"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader);
Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>();
Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
int window = 2;// get the words within two of the match
while (spans.next() == true) {
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
int start = spans.start() - window;
int end = spans.end() + window;
Terms content = reader.getTermVector(spans.doc(), "content");
TermsEnum termsEnum = content.iterator(null);
BytesRef term;
while ((term = termsEnum.next()) != null) {
// could store the BytesRef here, but String is easier for this
// example
String s = new String(term.bytes, term.offset, term.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
it's just some code i found on a great website and i wanted to try .... everything works great using the RAMDirectory. But if i change it to my FSDirectory it's giving me a NullpointerException like :
Exception in thread "main" java.lang.NullPointerException at
com.org.test.TextDB.myMethod(TextDB.java:184) at
com.org.test.Main.main(Main.java:31)
The statement Terms content = reader.getTermVector(spans.doc(), "content"); seems to get no result and returns null. so the exception. but why? in my ramDIR everything works fine.
It seems that the indexWriter or the Reader (really don't know) didn't write or didn't read the field "content" properly from the index. But i really don't know why its 'written' in a RAMDirectory and not written in a FSDIrectory?!
Anybody an idea to that?

Gave this a test a quick test run, and I can't reproduce your issue.
I think the most likely issue here is old documents in your index. The way this is written, every time it is run, more documents will be added to your index. Old documents from previous runs won't get deleted, or overwritten, they'll just stick around. So, if you have run this before on the same directory, say perhaps, before you added the line type.setStoreTermVectors(true);, some of your results may be these old documents with term vectors, and reader.getTermVector(...) will return null, if the document does not store term vectors.
Of course, anything indexed in a RAMDirectory will be dropped as soon as execution finishes, so the issue would not occur in that case.
Simple solution would be to try deleting the index directory and run it again.
If you want to start with a fresh index when you run this, you can set that up through the IndexWriterConfig:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
That's a guess, of course, but seems consistent with the behavior you've described.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}

Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.

I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

How to keep Lucene index without deleted documents

This is my first question on Stack Overflow,so wish me luck.
I am doing a classification process over a Lucene index with java and i need to update a document field named category. I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. So, how can I actually erase those deleted documents from index?
Here is some snippets of my code:
public static void main(String[] args) throws IOException, ParseException {
///////////////////////Preparing config data////////////////////////////
File indexDir = new File("/indexDir");
Directory fsDir = FSDirectory.open(indexDir);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);
IndexReader reader = DirectoryReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);
classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());
System.out.println("***Before***");
showIndexedDocuments(reader);
System.out.println("***Before***");
int maxdoc = reader.maxDoc();
int j = 0;
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String clusterClasif = doc.get("category");
String text = doc.get("text");
String docid = doc.get("doc_id");
ClassificationResult<BytesRef> result = classifier.assignClass(text);
String classified = result.getAssignedClass().utf8ToString();
if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
Term term = new Term("doc_id", docid);
doc.removeField("category");
doc.add(new StringField("category",
classified, Field.Store.YES));
indexWriter.updateDocument(term,doc);
j++;
}
}
indexWriter.forceMergeDeletes(true);
indexWriter.close();
System.out.println("Classified documents count: " + j);
System.out.println();
reader.close();
reader = DirectoryReader.open(fsDir);
System.out.println("Deleted docs: " + reader.numDeletedDocs());
System.out.println("***After***");
showIndexedDocuments(reader);
}
private static void showIndexedDocuments(IndexReader reader) throws IOException {
int maxdoc = reader.maxDoc();
for (int i = 0; i < maxdoc; i++) {
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
String text = doc.get("text");
String category = doc.get("category");
System.out.println("Id Doc: " + idDoc);
System.out.println("Category: " + category);
System.out.println("Text: " + text);
System.out.println();
}
System.out.println("Total: " + maxdoc);
}
I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. If you know how to do that I really appreciate if you share it.

You can check is a document is deleted is the MultiFields class, like:
Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...
So, working this into your code, perhaps something like:
int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
if (!liveDocs.get(docID)) continue;
Document doc = reader.document(i);
String idDoc = doc.get("doc_id");
....
}
By the way, sounds like you have previously been working with 3.X, and are now on 4.X. The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them.

Open Office 1.1.4 Export to PDF with background images using java

I have a problem that needs solving where we use OpenOffice 1.1.4 templated reports and programmatically export them to PDF.
The team who create the templates have recently changed the header image and some images in a table to background images (before they were just inserted) since this change the current program is not creating the PDFs with the images. We can export from OpenOffice manually and the images are included. Can anyone help with a change I may need to make to get these background images included please?
The current code:
private void print(XInterface xComponent,
PrintRequestDTO printReq, File sourceFile,
Vector<String> pages) throws java.lang.Exception {
String pageRange;
// XXX create the PDF via OOo export facility
com.sun.star.frame.XStorable pdfCreator = (com.sun.star.frame.XStorable) UnoRuntime
.queryInterface(
com.sun.star.frame.XStorable.class,
xComponent);
PropertyValue[] outputOpts = new PropertyValue[2];
outputOpts[0] = new PropertyValue();
outputOpts[0].Name = "CompressionMode";
outputOpts[0].Value = "1"; // XXX Change this perhaps?
outputOpts[1] = new PropertyValue();
outputOpts[1].Name = "PageRange";
if (printReq.getPageRange() == null) {
pageRange = "1-";
}
else {
if (printReq.getPageRange().length() > 0) {
pageRange = printReq.getPageRange();
}
else {
pageRange = "1-";
}
}
log.debug("Print Instruction - page range = "
+ pageRange);
PropertyValue[] filterOpts = new PropertyValue[3];
filterOpts[0] = new PropertyValue();
filterOpts[0].Name = "FilterName";
filterOpts[0].Value = "writer_pdf_Export"; // MS Word 97
filterOpts[1] = new PropertyValue();
filterOpts[1].Name = "Overwrite";
filterOpts[1].Value = new Boolean(true);
filterOpts[2] = new PropertyValue();
filterOpts[2].Name = "FilterData";
filterOpts[2].Value = outputOpts;
if (pages.size() == 0) { // ie no forced page breaks
// set page range
outputOpts[1].Value = pageRange;
filterOpts[2] = new PropertyValue();
filterOpts[2].Name = "FilterData";
filterOpts[2].Value = outputOpts;
File outputFile = new File(
sourceFile.getParent(),
printReq.getOutputFileName()
+ ".pdf");
StringBuffer sPDFUrl = new StringBuffer(
"file:///");
sPDFUrl.append(outputFile.getCanonicalPath()
.replace('\\', '/'));
log.debug("PDF file = " + sPDFUrl.toString());
if (pdfCreator != null) {
sleep();
pdfCreator.storeToURL(sPDFUrl.toString(),
filterOpts);
}
}
else if (pages.size() > 1) {
throw new PrintDocumentException(
"Only one forced split catered for currently");
}
else { // a forced split exists.
log.debug("Page break found in "
+ (String) pages.firstElement());
String[] newPageRanges = calculatePageRanges(
(String) pages.firstElement(), pageRange);
int rangeCount = newPageRanges.length;
for (int i = 0; i < rangeCount; i++) {
outputOpts[1].Value = newPageRanges[i];
log.debug("page range = " + newPageRanges[i]);
filterOpts[2] = new PropertyValue();
filterOpts[2].Name = "FilterData";
filterOpts[2].Value = outputOpts;
String fileExtension = (i == 0 && rangeCount > 1) ? "__Summary.pdf"
: ".pdf";
File outputFile = new File(
sourceFile.getParent(),
printReq.getOutputFileName()
+ fileExtension);
StringBuffer sPDFUrl = new StringBuffer(
"file:///");
sPDFUrl.append(outputFile.getCanonicalPath()
.replace('\\', '/'));
log.debug("PDF file = " + sPDFUrl.toString());
if (pdfCreator != null) {
log.debug("about to create the PDF file");
sleep();
pdfCreator.storeToURL(
sPDFUrl.toString(), filterOpts);
log.debug("done");
}
}
}
}
Thanks in advance.

Glad that suggestion of making the document visible helped. Since it has ALSO fixed the problem you have a timing/threading issue. I suspect you'll find that another dodgy option of doing a sleep before executing the save to PDF will also allow the images to appear. Neither of these solutions is good.
Most likley best fix is to upgrade to a newer version of Open Office (the API calls you have should still work). Another option would be to try to call the API to ask the document to refresh itself.

After finding the correct property I was able to open the file with the hidden property set to false, this meant when the file was exported to PDF it included the background images. Its a shame I could not find another solultion that kept the file hidden but at least its working.

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html

If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).

How to get the Index Size in Solr using Java

I need to get total size of an index in Apache Solr using Java. The following code gets the total number of documents but I am looking for the size. And with the use of ReplicationHandler I was thinking that I can get the index size as told by someone here on this link.. http://lucene.472066.n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-td692686.html but I am not getting the index size.
BufferedWriter out1 = null;
FileWriter fstream1 = new FileWriter("src/test/resources/solr-document-id-desc.txt");
out1 = new BufferedWriter(fstream1);
ApplicationContext context = null;
context = new ClassPathXmlApplicationContext("application-context.xml");
CommonsHttpSolrServer solrServer = (CommonsHttpSolrServer) context.getBean("solrServer");
SolrQuery solrQuery = new SolrQuery().setQuery("*:*");
QueryResponse rsp = solrServer.query(solrQuery);
//I am trying to use replicationhandler but I am not able to get the index size using statistics. Is there any way to get the index size..?
ReplicationHandler handler2 = new ReplicationHandler();
System.out.println( handler2.getDescription());
NamedList statistics = handler2.getStatistics();
System.out.println("Statistics "+ statistics);
System.out.println(rsp.getResults().getNumFound());
Iterator<SolrDocument> iter = rsp.getResults().iterator();
while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
System.out.println(resultDoc.getFieldNames());
String id = (String) resultDoc.getFieldValue("numFound");
String description = (String) resultDoc.getFieldValue("description");
System.out.println(id+"~~"+description);
out1.write(id+"~~"+description);
out1.newLine();
}
out1.close();
Any suggestions will be appreciated..
Update Code:-
ReplicationHandler handler2 = new ReplicationHandler();
System.out.println( handler2.getDescription());
NamedList statistics = handler2.getStatistics();
System.out.println("Statistics "+ statistics.get("indexSize"));

The indexsize is available with the statistics in ReplicationHandler
org.apache.solr.handler.ReplicationHandler
code
public NamedList getStatistics() {
NamedList list = super.getStatistics();
if (core != null) {
list.add("indexSize", NumberUtils.readableSize(getIndexSize()));
}
}
You can use the URL http://localhost:8983/solr/replication?command=details , which returns the index size.
<lst name="details">
<str name="indexSize">26.13 KB</str>
.....
</lst>
Not sure if it works with the instantiation of ReplicationHandler, as it would need the reference of the core and the index.

You can use the command in the data directory
- du -kx

as said in this post you can use MAT tool in order to see the memory consumption. I think that you could use in your code. Enjoy solr!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene changing from RAMDirectory to FSDIrectory - Content-Field missing - java

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

How to keep Lucene index without deleted documents

Open Office 1.1.4 Export to PDF with background images using java

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

How to get the Index Size in Solr using Java

Categories

Resources