RegEx matching using Lucene

RegEx matching using Lucene - java

I would like to find "Bug reports" with Lucene using a regular expression, but whenever I try it doesn't work.
I used the code from the Lucene page to avoid a bad setup.
Here is my code:
import java.util.regex.Pattern;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.regex.JakartaRegexpCapabilities;
import org.apache.lucene.search.regex.RegexCapabilities;
import org.apache.lucene.search.regex.RegexQuery;
import org.apache.lucene.store.RAMDirectory;
public class Rege {
private static IndexSearcher searcher;
private static final String FN = "field";
public static void main(String[] args) throws Exception {
RAMDirectory directory = new RAMDirectory();
try {
IndexWriter writer = new IndexWriter(directory,
new SimpleAnalyzer(), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc
.add(new Field(
FN,
"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",
Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory, true);
} catch (Exception e) {
e.printStackTrace();
}
System.err.println(regexQueryNrHits("bug [0-9]+",null));
}
private static Term newTerm(String value) {
return new Term(FN, value);
}
private static int regexQueryNrHits(String regex,
RegexCapabilities capability) throws Exception {
RegexQuery query = new RegexQuery(newTerm(regex));
if (capability != null)
query.setRegexImplementation(capability);
return searcher.search(query, null, 1000).totalHits;
}
}
I would expect bug [0-9]+ to return 1 but it doesn't. I also tested the regex with Java and it worked.

If you have your field indexed as a "string" type (instead of "text" type), your regex would have to match the whole field value.
Try this, which takes your regex out to both ends of the field:
System.err.println(regexQueryNrHits("^.*bug [0-9]+.*$",null));

Thanks, but this alone didn't solve the problem. The problem is the Field.Index.ANALYZED flag:
It seems that lucene doesn't index numbers in a proper way so that a regex could be used with them.
I changed:
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.ANALYZED));
to
doc.add(new Field(
FN,"[Phpmyadmin-devel] Commits against bug 601721 (Cookie auth mode faulty with IIS)",Field.Store.NO, Field.Index.NOT_ANALYZED));
and with your improved regex:
System.err.println(regexQueryNrHits("^.*bug #+[0-9]+.*$",
new JavaUtilRegexCapabilities()));
it finally worked! :)

Related

Error indexing text from Apache Tika in Solr

I am trying to integrate Apache Tika with Solr so that text extracted by Tika could be indexed in Solr.
I tried the following code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.DublinCore;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class Main {
private static SolrServer solr;
public static void main(String[] args) throws IOException, SAXException, TikaException {
try {
solr = new HttpSolrServer("http://localhost:8983/solr/#/");
String path = "C:\\content\\";
String file_html = path + "mobydick.htm";
String file_txt = path + "/home/ben/abc.warc";
String file_pdf = path + "callofthewild.pdf";
processDocument(file_html);
processDocument(file_txt);
processDocument(file_pdf);
solr.commit();
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//use Apache Tika to convert documents in different formats to plain text
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser(); //handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context); //convert to plain text
//collect metadata and content from Tika and other sources
//document id must be unique, use guide
UUID guid = java.util.UUID.randomUUID();
String docid = guid.toString();
//Dublin Core metadata (partial set)
String doctitle = meta.get(DublinCore.TITLE);
String doccreator = meta.get(DublinCore.CREATOR);
//other metadata
String docurl = pathfilename; //document url
//content
String doccontent = textHandler.toString();
//call to index
indexDocument(docid, doctitle, doccreator, docurl, doccontent);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void indexDocument(String docid, String doctitle, String
doccreator, String docurl, String doccontent) {
try {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", docid);
//map metadata fields to default schema
//location: path\solr-4.7.2\example\solr\collection1\conf\schema.xml
//Dublin Core
//thought: schema could be modified to use Dublin Core
doc.addField("title", doctitle);
doc.addField("author", doccreator);
//other metadata
doc.addField("url", docurl);
//content (and text)
//per schema, the content field is not indexed by default, used for returning and highlighting document content
//the schema "copyField" command automatically copies this to the "text" field which is indexed
doc.addField("content", doccontent);
//indexing
//when a field is indexed, like "text", Solr will handle tokenization, stemming, removal of stopwords etc, per the schema defn
//add to index
solr.add(doc);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
}
Unfortunately I am hitting the Error below:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/http/NoHttpResponseException at Main.main(Main.java:28)
Caused by: java.lang.ClassNotFoundException:
org.apache.http.NoHttpResponseException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 1 more
Could you please help me with the resolution of this issue?

Apache Tika and Apache Solr integration through Java API

I am trying to integrate Apache tika and Apache Solr so that I can index my parse data. I'm using Solr version 4.3.1 and Tika version as 2.11.6.
The code which I am following are like:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.DublinCore;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class Main {
private static SolrServer solr;
public static void main(String[] args) throws IOException, SAXException, TikaException {
try {
solr = new HttpSolrServer("http://localhost:8983/solr/#/"); //create solr connection
//solr.deleteByQuery( "*:*" ); //delete everything in the index; good for testing
//location of source documents
//later this will be switched to a database
String path = "C:\\content\\";
String file_html = path + "mobydick.htm";
String file_txt = path + "/home/ben/abc.warc";
String file_pdf = path + "callofthewild.pdf";
processDocument(file_html);
processDocument(file_txt);
processDocument(file_pdf);
solr.commit(); //after all docs are added, commit to the index
//now you can search at http://localhost:8983/solr/browse
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//use Apache Tika to convert documents in different formats to plain text
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser();
//handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context); //convert to plain text
//collect metadata and content from Tika and other sources
//document id must be unique, use guide
UUID guid = java.util.UUID.randomUUID();
String docid = guid.toString();
//Dublin Core metadata (partial set)
String doctitle = meta.get(DublinCore.TITLE);
String doccreator = meta.get(DublinCore.CREATOR);
//other metadata
String docurl = pathfilename; //document url
//content
String doccontent = textHandler.toString();
//call to index
indexDocument(docid, doctitle, doccreator, docurl, doccontent);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private static void indexDocument(String docid, String doctitle, String doccreator, String docurl, String doccontent) {
try {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", docid);
//map metadata fields to default schema
//location: path\solr-4.7.2\example\solr\collection1\conf\schema.xml
//Dublin Core
//thought: schema could be modified to use Dublin Core
doc.addField("title", doctitle);
doc.addField("author", doccreator);
//other metadata
doc.addField("url", docurl);
//content (and text)
//per schema, the content field is not indexed by default, used for returning and highlighting document content
//the schema "copyField" command automatically copies this to the "text" field which is indexed
doc.addField("content", doccontent);
//indexing
//when a field is indexed, like "text", Solr will handle tokenization, stemming, removal of stopwords etc, per the schema defn
//add to index
solr.add(doc);
}
catch (Exception ex) {
System.out.println(ex.getMessage());
} } }
The Error I got
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/NoHttpResponseException
at Main.main(Main.java:28)
Caused by: java.lang.ClassNotFoundException: org.apache.http.NoHttpResponseException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more

This does not seem to be Tika related, so I would focus on Solr.
Specifically on how you included Solr libraries and its dependencies. If you pulled them through Maven, they should have worked. But if you did it manually, perhaps you missed one or two.
Specifically, the error message is about a missing class that is distributed with Apache Commons HTTP (client) library. Perhaps are missing it in dependencies or on the classpath.

Searching for UUID in lucene not working

I've got a UUID field I'm adding to my document in the following format: 372d325c-e01b-432f-98bd-bc4c949f15b8. However, when I try to query for documents by the UUID it will not return them no matter how I try to escape the expression. For example:
+uuid:372d325c-e01b-432f-98bd-bc4c949f15b8
+uuid:"372d325c-e01b-432f-98bd-bc4c949f15b8"
+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8
+uuid:(372d325c-e01b-432f-98bd-bc4c949f15b8)
+uuid:("372d325c-e01b-432f-98bd-bc4c949f15b8")
And even skipping the QueryParser altogether using TermQuery like so:
new TermQuery(new Term("uuid", uuid.toString()))
Or
new TermQuery(new Term("uuid", QueryParser.escape(uuid.toString())))
None of these searches will return a document, but if I search for portions of the UUID it will return a document. For example these will return something:
+uuid:372d325c
+uuid:e01b
+uuid:432f
What should I do to index these documents so I can pull them back by their UUID? I've considered reformatting the UUID to remove the hyphens, but I haven't implemented it yet.

The only way I got this to work is to use WhitespaceAnalyzer instead of StandardAnalyzer. Then using a TermQuery like so:
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36))
.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter( directory, config);
Then searching:
TopDocs docs = searcher.search(new TermQuery(new Term("uuid", uuid.toString())), 1);
WhitespaceAnalyzer prevented Lucene from splitting apart the UUID by the hyphens. Another option could be to eliminate the dashes from the UUID, but using the WhitespaceAnalyzer works just as well for my purposes.

According to the Lucene Query Syntax rules, the query
+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8
should work.
I guess that if it don't, that is because the uuid field is not populated as it should when the document is inserted in the index. Could you make sure of what exactly is inserted for this field? You can use Luke to crawl the index and look for the actual values stored for the uuid field.

If you plan to a UUID field as a lookup key, you will need to ask Lucene to index the whole field as a single string without doing tokenization. This is done by setting the right FieldType for your UUID field. In Lucene 4+, you can use StringField.
import java.io.IOException;
import java.util.UUID;
import junit.framework.Assert;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
/**
* Using Lucene 4.7 on Java 7.
*/
public class LuceneUUIDFieldLookupTest {
private Directory directory;
private Analyzer analyzer;
#Test
public void testUsingUUIDAsLookupKey() throws IOException, ParseException {
directory = new RAMDirectory();
analyzer = new StandardAnalyzer(Version.LUCENE_47);
UUID docUUID = UUID.randomUUID();
String docContentText1 = "Stack Overflow is a question and answer site for professional and enthusiast programmers.";
index(docUUID, docContentText1);
QueryParser parser = new QueryParser(Version.LUCENE_47, MyIndexedFields.DOC_TEXT_FIELD.name(), analyzer);
Query queryForProgrammer = parser.parse("programmers");
IndexSearcher indexSearcher = getIndexSearcher();
TopDocs hits = indexSearcher.search(queryForProgrammer, Integer.MAX_VALUE);
Assert.assertTrue(hits.scoreDocs.length == 1);
Integer internalDocId1 = hits.scoreDocs[0].doc;
Document docRetrieved1 = indexSearcher.doc(internalDocId1);
indexSearcher.getIndexReader().close();
String docText1 = docRetrieved1.get(MyIndexedFields.DOC_TEXT_FIELD.name());
Assert.assertEquals(docText1, docContentText1);
String docContentText2 = "TechCrunch is a leading technology media property, dedicated to ... according to a new report from the Wall Street Journal confirmed by Google to TechCrunch.";
reindex(docUUID, docContentText2);
Query queryForTechCrunch = parser.parse("technology");
indexSearcher = getIndexSearcher(); //you must reopen directory because the previous IndexSearcher only sees a snapshoted directory.
hits = indexSearcher.search(queryForTechCrunch, Integer.MAX_VALUE);
Assert.assertTrue(hits.scoreDocs.length == 1);
Integer internalDocId2 = hits.scoreDocs[0].doc;
Document docRetrieved2 = indexSearcher.doc(internalDocId2);
indexSearcher.getIndexReader().close();
String docText2 = docRetrieved2.get(MyIndexedFields.DOC_TEXT_FIELD.name());
Assert.assertEquals(docText2, docContentText2);
}
private void reindex(UUID myUUID, String docContentText) throws IOException {
try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
Term term = new Term(MyIndexedFields.MY_UUID_FIELD.name(), myUUID.toString());
indexWriter.updateDocument(term, buildDoc(myUUID, docContentText));
}//auto-close
}
private void index(UUID myUUID, String docContentText) throws IOException {
try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
indexWriter.addDocument(buildDoc(myUUID, docContentText));
}//auto-close
}
private IndexWriterConfig getIndexWriterConfig() {
return new IndexWriterConfig(Version.LUCENE_47, analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
}
private Document buildDoc(UUID myUUID, String docContentText) {
Document doc = new Document();
doc.add(new Field(
MyIndexedFields.MY_UUID_FIELD.name(),
myUUID.toString(),
StringField.TYPE_STORED));//use TYPE_STORED if you want to read it back in search result.
doc.add(new Field(
MyIndexedFields.DOC_TEXT_FIELD.name(),
docContentText,
TextField.TYPE_STORED));
return doc;
}
private IndexSearcher getIndexSearcher() throws IOException {
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(ireader);
return indexSearcher;
}
enum MyIndexedFields {
MY_UUID_FIELD,
DOC_TEXT_FIELD
}
}

Merging MS Word documents with Java

I'm looking for java libraries that read and write MS Word Document.
What I have to do is:
read a template file, .dot or .doc, and fill it with some data read from DB
take data from another Word document and merging that with the file described above, preserving paragraphs formats
users may make updates to the file.
I've searched and found POI Apache and UNO OpenOffice.
The first one can easily read a template and replace any placeholders with my own data from DB. I didn't found anything about merging two, or more, documents.
OpenOffice UNO looks more stable but complex too. Furthermore I'm not sure that it has the ability to merge documents..
We are looking the right direction?
Another solution i've thought was to convert doc file to docx. In that way I found more libraries that can help us merging documents.
But how can I do that?
Thanks!

You could take a look at Docmosis since it provides the four features you have mentioned (data population, template/document merging, DOC format and java interface). It has a couple of flavours (download, online service), but you could sign up for a free trial of the cloud service to see if Docmosis can do what you want (then you don't have to install anything) or read the online documentation.
It uses OpenOffice under the hood (you can see from the developer guide installation instructions) which does pretty decent conversions between documents. The UNO API has some complications - I would suggest either Docmosis or JODReports to isolate your project from UNO directly.
Hope that helps.

import java.io.File;
import java.util.List;
import javax.xml.bind.JAXBException;
import org.docx4j.dml.CTBlip;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageBmpPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageEpsPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageGifPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageJpegPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImagePngPart;
import org.docx4j.openpackaging.parts.WordprocessingML.ImageTiffPart;
import org.docx4j.openpackaging.parts.relationships.RelationshipsPart;
import org.docx4j.openpackaging.parts.relationships.RelationshipsPart.AddPartBehaviour;
import org.docx4j.relationships.Relationship;
public class MultipleDocMerge {
public static void main(String[] args) throws Docx4JException, JAXBException {
File first = new File("D:\\Mreg.docx");
File second = new File("D:\\Mreg1.docx");
File third = new File("D:\\Mreg4&19.docx");
File fourth = new File("D:\\test12.docx");
WordprocessingMLPackage f = WordprocessingMLPackage.load(first);
WordprocessingMLPackage s = WordprocessingMLPackage.load(second);
WordprocessingMLPackage a = WordprocessingMLPackage.load(third);
WordprocessingMLPackage e = WordprocessingMLPackage.load(fourth);
List body = s.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List body1 = a.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body1){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List body2 = e.getMainDocumentPart().getJAXBNodesViaXPath("//w:body", false);
for(Object b : body2){
List filhos = ((org.docx4j.wml.Body)b).getContent();
for(Object k : filhos)
f.getMainDocumentPart().addObject(k);
}
List<Object> blips = e.getMainDocumentPart().getJAXBNodesViaXPath("//a:blip", false);
for(Object el : blips){
try {
CTBlip blip = (CTBlip) el;
RelationshipsPart parts = e.getMainDocumentPart().getRelationshipsPart();
Relationship rel = parts.getRelationshipByID(blip.getEmbed());
Part part = parts.getPart(rel);
if(part instanceof ImagePngPart)
System.out.println(((ImagePngPart) part).getBytes());
if(part instanceof ImageJpegPart)
System.out.println(((ImageJpegPart) part).getBytes());
if(part instanceof ImageBmpPart)
System.out.println(((ImageBmpPart) part).getBytes());
if(part instanceof ImageGifPart)
System.out.println(((ImageGifPart) part).getBytes());
if(part instanceof ImageEpsPart)
System.out.println(((ImageEpsPart) part).getBytes());
if(part instanceof ImageTiffPart)
System.out.println(((ImageTiffPart) part).getBytes());
Relationship newrel = f.getMainDocumentPart().addTargetPart(part,AddPartBehaviour.RENAME_IF_NAME_EXISTS);
blip.setEmbed(newrel.getId());
f.getMainDocumentPart().addTargetPart(e.getParts().getParts().get(new PartName("/word/"+rel.getTarget())));
} catch (Exception ex){
ex.printStackTrace();
} }
File saved = new File("D:\\saved1.docx");
f.save(saved);
}
}

I've developed the next class (using Apache POI):
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBody;
public class WordMerge {
private final OutputStream result;
private final List<InputStream> inputs;
private XWPFDocument first;
public WordMerge(OutputStream result) {
this.result = result;
inputs = new ArrayList<>();
}
public void add(InputStream stream) throws Exception{
inputs.add(stream);
OPCPackage srcPackage = OPCPackage.open(stream);
XWPFDocument src1Document = new XWPFDocument(srcPackage);
if(inputs.size() == 1){
first = src1Document;
} else {
CTBody srcBody = src1Document.getDocument().getBody();
first.getDocument().addNewBody().set(srcBody);
}
}
public void doMerge() throws Exception{
first.write(result);
}
public void close() throws Exception{
result.flush();
result.close();
for (InputStream input : inputs) {
input.close();
}
}
}
And its use:
public static void main(String[] args) throws Exception {
FileOutputStream faos = new FileOutputStream("/home/victor/result.docx");
WordMerge wm = new WordMerge(faos);
wm.add( new FileInputStream("/home/victor/001.docx") );
wm.add( new FileInputStream("/home/victor/002.docx") );
wm.doMerge();
wm.close();
}

The Apache POI code does not work for Images.

Convert DOC file to DOCX with Java

I need to use DOCX files (actually the XML contained in them) in a Java software I'm currently developing, but some people in my company still use the DOC format.
Do you know if there is a way to convert a DOC file to the DOCX format using Java ? I know it's possible using C#, but that's not an option
I googled it, but nothing came up...
Thanks

You may try Aspose.Words for Java. It allows you to load a DOC file and save it as DOCX format. The code is very simple as shown below:
// Open a document.
Document doc = new Document("input.doc");
// Save document.
doc.save("output.docx");
Please see if this helps in your scenario.
Disclosure: I work as developer evangelist at Aspose.

Check out JODConverter to see if it fits the bill. I haven't personally used it.

Use newer versions of jars jodconverter-core-4.2.2.jar and jodconverter-local-4.2.2.jar
String inputFile = "*.doc";
String outputFile = "*.docx";
LocalOfficeManager localOfficeManager = LocalOfficeManager.builder()
.install()
.officeHome(getDefaultOfficeHome()) //your path to openoffice
.build();
try {
localOfficeManager.start();
final DocumentFormat format
= DocumentFormat.builder()
.from(DefaultDocumentFormatRegistry.DOCX)
.build();
LocalConverter
.make()
.convert(new FileInputStream(new File(inputFile)))
.as(DefaultDocumentFormatRegistry.getFormatByMediaType("application/msword"))
.to(new File(outputFile))
.as(format)
.execute();
} catch (OfficeException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (FileNotFoundException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} finally {
OfficeUtils.stopQuietly(localOfficeManager);
}

JODConvertor calls OpenOffice/LibreOffice via a network protocol. It can therefore 'do anything you can do in OpenOffice'. This includes converting formats. But it only does as good a job as whatever version of OpenOffice you are running. I have some art in one of my docs, and it doesn't convert them as I hoped.
JODConvertor is no longer supported, according to the google code web site for v3.
To get JOD to do the job you need to do something like
private static void transformBinaryWordDocToDocX(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
converter.convert(in, out, docx);
}
private static void transformBinaryWordDocToW2003Xml(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);;
DocumentFormat w2003xml = new DocumentFormat("Microsoft Word 2003 XML", "xml", "text/xml");
w2003xml.setInputFamily(DocumentFamily.TEXT);
w2003xml.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MS Word 2003 XML"));
converter.convert(in, out, w2003xml);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
For this to work you need to be running LibreOffice as a networked server ( I could not get the 'run on demand' part of JODConvertor to work under windows with LO 3.6 very well )

To convert DOC file to HTML look at this
(Convert Word doc to HTML programmatically in Java)
Use this: http://poi.apache.org/
Or use this :
XWPFDocument docx = new XWPFDocument(OPCPackage.openOrCreate(new File("hello.docx")));
XWPFWordExtractor wx = new XWPFWordExtractor(docx);
String text = wx.getText();
System.out.println("text = "+text);

I needed the same conversion ,after researching a lot found Jodconvertor can be useful in it , you can download the jar from
https://code.google.com/p/jodconverter/downloads/list
Add jodconverter-core-3.0-beta-4-sources.jar file to your project lib
//1) Create OfficeManger Object
OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome(new File("/opt/libreoffice4.4"))
.buildOfficeManager();
officeManager.start();
// 2) Create JODConverter converter
OfficeDocumentConverter converter = new OfficeDocumentConverter(
officeManager);
// 3)Create DocumentFormat for docx
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
//4)Call convert funtion in converter object
converter.convert(new File("doc/AdvancedTable.doc"), new File(
"docx/AdvancedTable.docx"), docx);

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class TestCon {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("C:/Users/312845/Desktop/a.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("C:/Users/312845/Desktop/test.docx"));
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx matching using Lucene - java

If you have your field indexed as a "string" type (instead of "text" type), your regex would have to match the whole field value. Try this, which takes your regex out to both ends of the field: System.err.println(regexQueryNrHits("^.bug [0-9]+.$",null));

Related

Error indexing text from Apache Tika in Solr

Apache Tika and Apache Solr integration through Java API

Searching for UUID in lucene not working

Merging MS Word documents with Java

Convert DOC file to DOCX with Java

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx matching using Lucene - java

If you have your field indexed as a "string" type (instead of "text" type), your regex would have to match the whole field value. Try this, which takes your regex out to both ends of the field: System.err.println(regexQueryNrHits("^.*bug [0-9]+.*$",null));

Related

Error indexing text from Apache Tika in Solr

Apache Tika and Apache Solr integration through Java API

Searching for UUID in lucene not working

Merging MS Word documents with Java

Convert DOC file to DOCX with Java

Categories

Resources

If you have your field indexed as a "string" type (instead of "text" type), your regex would have to match the whole field value. Try this, which takes your regex out to both ends of the field: System.err.println(regexQueryNrHits("^.bug [0-9]+.$",null));