Searching for UUID in lucene not working - java

I've got a UUID field I'm adding to my document in the following format: 372d325c-e01b-432f-98bd-bc4c949f15b8. However, when I try to query for documents by the UUID it will not return them no matter how I try to escape the expression. For example:
+uuid:372d325c-e01b-432f-98bd-bc4c949f15b8
+uuid:"372d325c-e01b-432f-98bd-bc4c949f15b8"
+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8
+uuid:(372d325c-e01b-432f-98bd-bc4c949f15b8)
+uuid:("372d325c-e01b-432f-98bd-bc4c949f15b8")
And even skipping the QueryParser altogether using TermQuery like so:
new TermQuery(new Term("uuid", uuid.toString()))
Or
new TermQuery(new Term("uuid", QueryParser.escape(uuid.toString())))
None of these searches will return a document, but if I search for portions of the UUID it will return a document. For example these will return something:
+uuid:372d325c
+uuid:e01b
+uuid:432f
What should I do to index these documents so I can pull them back by their UUID? I've considered reformatting the UUID to remove the hyphens, but I haven't implemented it yet.

The only way I got this to work is to use WhitespaceAnalyzer instead of StandardAnalyzer. Then using a TermQuery like so:
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36))
.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter( directory, config);
Then searching:
TopDocs docs = searcher.search(new TermQuery(new Term("uuid", uuid.toString())), 1);
WhitespaceAnalyzer prevented Lucene from splitting apart the UUID by the hyphens. Another option could be to eliminate the dashes from the UUID, but using the WhitespaceAnalyzer works just as well for my purposes.

According to the Lucene Query Syntax rules, the query
+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8
should work.
I guess that if it don't, that is because the uuid field is not populated as it should when the document is inserted in the index. Could you make sure of what exactly is inserted for this field? You can use Luke to crawl the index and look for the actual values stored for the uuid field.

If you plan to a UUID field as a lookup key, you will need to ask Lucene to index the whole field as a single string without doing tokenization. This is done by setting the right FieldType for your UUID field. In Lucene 4+, you can use StringField.
import java.io.IOException;
import java.util.UUID;
import junit.framework.Assert;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
/**
* Using Lucene 4.7 on Java 7.
*/
public class LuceneUUIDFieldLookupTest {
private Directory directory;
private Analyzer analyzer;
#Test
public void testUsingUUIDAsLookupKey() throws IOException, ParseException {
directory = new RAMDirectory();
analyzer = new StandardAnalyzer(Version.LUCENE_47);
UUID docUUID = UUID.randomUUID();
String docContentText1 = "Stack Overflow is a question and answer site for professional and enthusiast programmers.";
index(docUUID, docContentText1);
QueryParser parser = new QueryParser(Version.LUCENE_47, MyIndexedFields.DOC_TEXT_FIELD.name(), analyzer);
Query queryForProgrammer = parser.parse("programmers");
IndexSearcher indexSearcher = getIndexSearcher();
TopDocs hits = indexSearcher.search(queryForProgrammer, Integer.MAX_VALUE);
Assert.assertTrue(hits.scoreDocs.length == 1);
Integer internalDocId1 = hits.scoreDocs[0].doc;
Document docRetrieved1 = indexSearcher.doc(internalDocId1);
indexSearcher.getIndexReader().close();
String docText1 = docRetrieved1.get(MyIndexedFields.DOC_TEXT_FIELD.name());
Assert.assertEquals(docText1, docContentText1);
String docContentText2 = "TechCrunch is a leading technology media property, dedicated to ... according to a new report from the Wall Street Journal confirmed by Google to TechCrunch.";
reindex(docUUID, docContentText2);
Query queryForTechCrunch = parser.parse("technology");
indexSearcher = getIndexSearcher(); //you must reopen directory because the previous IndexSearcher only sees a snapshoted directory.
hits = indexSearcher.search(queryForTechCrunch, Integer.MAX_VALUE);
Assert.assertTrue(hits.scoreDocs.length == 1);
Integer internalDocId2 = hits.scoreDocs[0].doc;
Document docRetrieved2 = indexSearcher.doc(internalDocId2);
indexSearcher.getIndexReader().close();
String docText2 = docRetrieved2.get(MyIndexedFields.DOC_TEXT_FIELD.name());
Assert.assertEquals(docText2, docContentText2);
}
private void reindex(UUID myUUID, String docContentText) throws IOException {
try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
Term term = new Term(MyIndexedFields.MY_UUID_FIELD.name(), myUUID.toString());
indexWriter.updateDocument(term, buildDoc(myUUID, docContentText));
}//auto-close
}
private void index(UUID myUUID, String docContentText) throws IOException {
try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
indexWriter.addDocument(buildDoc(myUUID, docContentText));
}//auto-close
}
private IndexWriterConfig getIndexWriterConfig() {
return new IndexWriterConfig(Version.LUCENE_47, analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
}
private Document buildDoc(UUID myUUID, String docContentText) {
Document doc = new Document();
doc.add(new Field(
MyIndexedFields.MY_UUID_FIELD.name(),
myUUID.toString(),
StringField.TYPE_STORED));//use TYPE_STORED if you want to read it back in search result.
doc.add(new Field(
MyIndexedFields.DOC_TEXT_FIELD.name(),
docContentText,
TextField.TYPE_STORED));
return doc;
}
private IndexSearcher getIndexSearcher() throws IOException {
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(ireader);
return indexSearcher;
}
enum MyIndexedFields {
MY_UUID_FIELD,
DOC_TEXT_FIELD
}
}

Related

Java Apache POI newlines randomly inserted into generated page?

I am working on a project for fun that handles creating my custom letterhead. I'm using Apache POI to handle word documents. I plan on expanding it once I have the base framework done to add a GUI using AWT and allowing for customization through it, which explains how I have some things set up in code. I am getting some really bizarre results when trying to format my header, it appears that Apache POI is inserting newlines where it wants to? I think I am not understanding something.
CreateDocument.java
package letterHeader;
import java.io.File;
import java.io.FileOutputStream;
import java.math.BigInteger;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPageMar;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr;
public class CreateDocument
{
// create blank document
static XWPFDocument document = new XWPFDocument();
// create paragraphs for header
static XWPFParagraph name = document.createParagraph();
static XWPFParagraph address = document.createParagraph();
static ArrayList<XWPFParagraph> phoneNumbers = new ArrayList<XWPFParagraph>();
static ArrayList<XWPFParagraph> emails = new ArrayList<XWPFParagraph>();
static XWPFParagraph date = document.createParagraph();
// create runner objects
static XWPFRun nameRunner = name.createRun();
static XWPFRun addressRunner = address.createRun();
// remember to make a runner for each email
static ArrayList<XWPFRun> phoneRunners = new ArrayList<XWPFRun>();
static ArrayList<XWPFRun> emailRunners = new ArrayList<XWPFRun>();
public static void main(String[] args) throws Exception
{
// make datetime for timestamp
DateFormat dateFormat = new SimpleDateFormat("MM:dd:yyyy");
Date date = new Date();
// create IO stream with document name
FileOutputStream out = new FileOutputStream( new File("letterhead" + dateFormat.format(date) + ".docx"));
CTSectPr sectPr = document.getDocument().getBody().addNewSectPr();
CTPageMar pageMar = sectPr.addNewPgMar();
pageMar.setLeft(BigInteger.valueOf(720L));
pageMar.setTop(BigInteger.valueOf(720L));
pageMar.setRight(BigInteger.valueOf(720L));
pageMar.setBottom(BigInteger.valueOf(720L));
// storing the ID automatically makes the objects
int phoneID = addListParagraph("phone");
int emailID1 = addListParagraph("email");
int emailID2 = addListParagraph("email");
// make name
nameRunner.setText("Michael Simanski");
nameRunner.setBold(true);
nameRunner.setFontSize(18);
nameRunner.setFontFamily("Times");
// make address
addressRunner.setText("address");
addressRunner.setFontSize(12);
addressRunner.setFontFamily("Times");
// make phone
phoneRunners.get(phoneID).setText("phone");
phoneRunners.get(phoneID).setFontSize(12);
phoneRunners.get(phoneID).setFontFamily("Times");
// make emails
emailRunners.get(emailID1).setText("mfsimanski#gmail.com");
emailRunners.get(emailID1).setFontSize(12);
emailRunners.get(emailID1).setFontFamily("Times");
emailRunners.get(emailID2).setText("secondemail");
emailRunners.get(emailID2).setFontSize(12);
emailRunners.get(emailID2).setFontFamily("Times");
emailRunners.get(emailID2).addCarriageReturn();
document.write(out);
out.close();
}
public static int addListParagraph(String type)
{
switch (type)
{
case "phone":
phoneNumbers.add(document.createParagraph());
phoneRunners.add(phoneNumbers.get(phoneNumbers.size() - 1).createRun());
return phoneNumbers.size() - 1;
case "email":
emails.add(document.createParagraph());
emailRunners.add(emails.get(emails.size() - 1).createRun());
return emails.size() - 1;
default:
System.out.println("ERROR: Paragraph type not found!");
return 0;
}
}
}
Pretty simple right? I expect the following results:
letterhead07/08/2019.docx
Michael Simanski
address
phone
mfsimanski#gmail.com
secondemail
But perplexingly I get:
letterhead07/08/2019.docx
Michael Simanski
address
phone
mfsimanski#gmail.com
secondemail
Am I missing something, stupid, or both?

Apache Tika and Apache Solr integration through Java API

I am trying to integrate Apache tika and Apache Solr so that I can index my parse data. I'm using Solr version 4.3.1 and Tika version as 2.11.6.
The code which I am following are like:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.DublinCore;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class Main {
private static SolrServer solr;
public static void main(String[] args) throws IOException, SAXException, TikaException {
try {
solr = new HttpSolrServer("http://localhost:8983/solr/#/"); //create solr connection
//solr.deleteByQuery( "*:*" ); //delete everything in the index; good for testing
//location of source documents
//later this will be switched to a database
String path = "C:\\content\\";
String file_html = path + "mobydick.htm";
String file_txt = path + "/home/ben/abc.warc";
String file_pdf = path + "callofthewild.pdf";
processDocument(file_html);
processDocument(file_txt);
processDocument(file_pdf);
solr.commit(); //after all docs are added, commit to the index
//now you can search at http://localhost:8983/solr/browse
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//use Apache Tika to convert documents in different formats to plain text
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser();
//handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context); //convert to plain text
//collect metadata and content from Tika and other sources
//document id must be unique, use guide
UUID guid = java.util.UUID.randomUUID();
String docid = guid.toString();
//Dublin Core metadata (partial set)
String doctitle = meta.get(DublinCore.TITLE);
String doccreator = meta.get(DublinCore.CREATOR);
//other metadata
String docurl = pathfilename; //document url
//content
String doccontent = textHandler.toString();
//call to index
indexDocument(docid, doctitle, doccreator, docurl, doccontent);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void indexDocument(String docid, String doctitle, String doccreator, String docurl, String doccontent) {
try {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", docid);
//map metadata fields to default schema
//location: path\solr-4.7.2\example\solr\collection1\conf\schema.xml
//Dublin Core
//thought: schema could be modified to use Dublin Core
doc.addField("title", doctitle);
doc.addField("author", doccreator);
//other metadata
doc.addField("url", docurl);
//content (and text)
//per schema, the content field is not indexed by default, used for returning and highlighting document content
//the schema "copyField" command automatically copies this to the "text" field which is indexed
doc.addField("content", doccontent);
//indexing
//when a field is indexed, like "text", Solr will handle tokenization, stemming, removal of stopwords etc, per the schema defn
//add to index
solr.add(doc);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
} } }
The Error I got
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/NoHttpResponseException
at Main.main(Main.java:28)
Caused by: java.lang.ClassNotFoundException: org.apache.http.NoHttpResponseException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
This does not seem to be Tika related, so I would focus on Solr.
Specifically on how you included Solr libraries and its dependencies. If you pulled them through Maven, they should have worked. But if you did it manually, perhaps you missed one or two.
Specifically, the error message is about a missing class that is distributed with Apache Commons HTTP (client) library. Perhaps are missing it in dependencies or on the classpath.

Doing an MLT (More Like This) Query in Solrj

I am using the recent Solr 4.2.1 solrj libraries.
I am trying to execute an MLT Query from a java program. It works fine as long as I only provide small chunks in the stream.body, but that kind of defeats my purpose.
When I try to use the ContentStream I don't get a response back, when I do the solr.query, it makes another request.
It looks like the server is getting my solr.request() ok. Appreciate any pointers.
Oh, and I am talking to a solr 3.6.1
Here is what I have so far:
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.common.util.ContentStream;
import org.apache.solr.common.util.ContentStreamBase;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.client.solrj.*;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.*;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.MalformedURLException;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.client.solrj.util.ClientUtils;
public class SolrJSearcher {
public static void main(String[] args) throws MalformedURLException, SolrServerException {
HttpSolrServer solr = new HttpSolrServer("http://localhost:8983/solr");
ModifiableSolrParams params = new ModifiableSolrParams();
String mltv[] = {"Big bunch of text for testing - redacted for brevity"};
String dvalues[] = {"mlt"};
String svalues[] = {"0"};
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/mlt");
ContentStream cs = new ContentStreamBase.StringStream(mltv[0]);
up.addContentStream( cs);
SolrQuery theQuery = new SolrQuery();;
theQuery.set("qt", dvalues);
up.setParam("start", "0");
try {
solr.request(up);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
QueryResponse response = solr.query(theQuery);
SolrDocumentList results = response.getResults();
for (int i = 0; i < results.size(); ++i) {
System.out.println(results.get(i));
}
}
}
As far as I know, MoreLikeThis is meant to find documents similar to a document already in the index. If you're searching documents similar to an input string, then just insert a temporary item in your index before you do the query, and remove it afterwards.
I've been using the following successfully:
/*
* Build up a MoreLikeThis query to retrieve documents
* similar to the one with id originalId
*/
private SolrQuery buildUpMoreLikeThisQuery(String field3, String originalId) {
SolrQuery query = new SolrQuery();
query.setQueryType("/" + MoreLikeThisParams.MLT);
query.set(MoreLikeThisParams.MATCH_INCLUDE, true);
query.set(MoreLikeThisParams.MIN_DOC_FREQ, 1);
query.set(MoreLikeThisParams.MIN_TERM_FREQ, 1);
query.set(MoreLikeThisParams.MIN_WORD_LEN, 7);
query.set(MoreLikeThisParams.BOOST, false);
query.set(MoreLikeThisParams.MAX_QUERY_TERMS, 1000);
query.set(MoreLikeThisParams.SIMILARITY_FIELDS,
"field1,field2");
query.setQuery("id:" + originalId);
query.set("fl", "id,score");
query.addFilterQuery("field3:" + field3);
int maxResults = 20;
query.setRows(maxResults);
return query;
}

Merging MS Word documents with Java

I'm looking for java libraries that read and write MS Word Document.
What I have to do is:
read a template file, .dot or .doc, and fill it with some data read from DB
take data from another Word document and merging that with the file described above, preserving paragraphs formats
users may make updates to the file.
I've searched and found POI Apache and UNO OpenOffice.
The first one can easily read a template and replace any placeholders with my own data from DB. I didn't found anything about merging two, or more, documents.
OpenOffice UNO looks more stable but complex too. Furthermore I'm not sure that it has the ability to merge documents..
We are looking the right direction?
Another solution i've thought was to convert doc file to docx. In that way I found more libraries that can help us merging documents.
But how can I do that?
Thanks!
You could take a look at Docmosis since it provides the four features you have mentioned (data population, template/document merging, DOC format and java interface). It has a couple of flavours (download, online service), but you could sign up for a free trial of the cloud service to see if Docmosis can do what you want (then you don't have to install anything) or read the online documentation.
It uses OpenOffice under the hood (you can see from the developer guide installation instructions) which does pretty decent conversions between documents. The UNO API has some complications - I would suggest either Docmosis or JODReports to isolate your project from UNO directly.
Hope that helps.
import java.io.File;
import java.util.List;
import javax.xml.bind.JAXBException;
import org.docx4j.dml.CTBlip;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageBmpPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageEpsPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageGifPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageJpegPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImagePngPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageTiffPart;
import org.docx4j.openpackaging.parts.relationships.RelationshipsPart;
import org.docx4j.openpackaging.parts.relationships.RelationshipsPart.AddPartBehaviour;
import org.docx4j.relationships.Relationship;
public class MultipleDocMerge {
public static void main(String[] args) throws Docx4JException, JAXBException {
File first = new File("D:\\Mreg.docx");
File second = new File("D:\\Mreg1.docx");
File third = new File("D:\\Mreg4&19.docx");
File fourth = new File("D:\\test12.docx");
WordprocessingMLPackage f = WordprocessingMLPackage.load(first);
WordprocessingMLPackage s = WordprocessingMLPackage.load(second);
WordprocessingMLPackage a = WordprocessingMLPackage.load(third);
WordprocessingMLPackage e = WordprocessingMLPackage.load(fourth);
List body = s.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List body1 = a.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body1){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List body2 = e.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body2){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List<Object> blips = e.getMainDocumentPart().getJAXBNodesViaXPath("//a:blip", false);
for(Object el : blips){
try {
CTBlip blip = (CTBlip) el;
RelationshipsPart parts = e.getMainDocumentPart().getRelationshipsPart();
Relationship rel = parts.getRelationshipByID(blip.getEmbed());
Part part = parts.getPart(rel);
if(part instanceof ImagePngPart)
System.out.println(((ImagePngPart) part).getBytes());
if(part instanceof ImageJpegPart)
System.out.println(((ImageJpegPart) part).getBytes());
if(part instanceof ImageBmpPart)
System.out.println(((ImageBmpPart) part).getBytes());
if(part instanceof ImageGifPart)
System.out.println(((ImageGifPart) part).getBytes());
if(part instanceof ImageEpsPart)
System.out.println(((ImageEpsPart) part).getBytes());
if(part instanceof ImageTiffPart)
System.out.println(((ImageTiffPart) part).getBytes());
Relationship newrel = f.getMainDocumentPart().addTargetPart(part,AddPartBehaviour.RENAME_IF_NAME_EXISTS);
blip.setEmbed(newrel.getId());
f.getMainDocumentPart().addTargetPart(e.getParts().getParts().get(new PartName("/word/"+rel.getTarget())));
} catch (Exception ex){
ex.printStackTrace();
} }
File saved = new File("D:\\saved1.docx");
f.save(saved);
}
}
I've developed the next class (using Apache POI):
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBody;
public class WordMerge {
private final OutputStream result;
private final List<InputStream> inputs;
private XWPFDocument first;
public WordMerge(OutputStream result) {
this.result = result;
inputs = new ArrayList<>();
}
public void add(InputStream stream) throws Exception{
inputs.add(stream);
OPCPackage srcPackage = OPCPackage.open(stream);
XWPFDocument src1Document = new XWPFDocument(srcPackage);
if(inputs.size() == 1){
first = src1Document;
} else {
CTBody srcBody = src1Document.getDocument().getBody();
first.getDocument().addNewBody().set(srcBody);
}
}
public void doMerge() throws Exception{
first.write(result);
}
public void close() throws Exception{
result.flush();
result.close();
for (InputStream input : inputs) {
input.close();
}
}
}
And its use:
public static void main(String[] args) throws Exception {
FileOutputStream faos = new FileOutputStream("/home/victor/result.docx");
WordMerge wm = new WordMerge(faos);
wm.add( new FileInputStream("/home/victor/001.docx") );
wm.add( new FileInputStream("/home/victor/002.docx") );
wm.doMerge();
wm.close();
}
The Apache POI code does not work for Images.

RegEx matching using Lucene

I would like to find "Bug reports" with Lucene using a regular expression, but whenever I try it doesn't work.
I used the code from the Lucene page to avoid a bad setup.
Here is my code:
import java.util.regex.Pattern;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.regex.JakartaRegexpCapabilities;
import org.apache.lucene.search.regex.RegexCapabilities;
import org.apache.lucene.search.regex.RegexQuery;
import org.apache.lucene.store.RAMDirectory;
public class Rege {
private static IndexSearcher searcher;
private static final String FN = "field";
public static void main(String[] args) throws Exception {
RAMDirectory directory = new RAMDirectory();
try {
IndexWriter writer = new IndexWriter(directory,
new SimpleAnalyzer(), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc
.add(new Field(
FN,
"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",
Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory, true);
} catch (Exception e) {
e.printStackTrace();
}
System.err.println(regexQueryNrHits("bug [0-9]+",null));
}
private static Term newTerm(String value) {
return new Term(FN, value);
}
private static int regexQueryNrHits(String regex,
RegexCapabilities capability) throws Exception {
RegexQuery query = new RegexQuery(newTerm(regex));
if (capability != null)
query.setRegexImplementation(capability);
return searcher.search(query, null, 1000).totalHits;
}
}
I would expect bug [0-9]+ to return 1 but it doesn't. I also tested the regex with Java and it worked.
If you have your field indexed as a "string" type (instead of "text" type), your regex would have to match the whole field value.
Try this, which takes your regex out to both ends of the field:
System.err.println(regexQueryNrHits("^.*bug [0-9]+.*$",null));
Thanks, but this alone didn't solve the problem. The problem is the Field.Index.ANALYZED flag:
It seems that lucene doesn't index numbers in a proper way so that a regex could be used with them.
I changed:
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.ANALYZED));
to
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.NOT_ANALYZED));
and with your improved regex:
System.err.println(regexQueryNrHits("^.*bug #+[0-9]+.*$",
new JavaUtilRegexCapabilities()));
it finally worked! :)

Categories

Resources