Convert DOC file to DOCX with Java

Convert DOC file to DOCX with Java - java

I need to use DOCX files (actually the XML contained in them) in a Java software I'm currently developing, but some people in my company still use the DOC format.
Do you know if there is a way to convert a DOC file to the DOCX format using Java ? I know it's possible using C#, but that's not an option
I googled it, but nothing came up...
Thanks

You may try Aspose.Words for Java. It allows you to load a DOC file and save it as DOCX format. The code is very simple as shown below:
// Open a document.
Document doc = new Document("input.doc");
// Save document.
doc.save("output.docx");
Please see if this helps in your scenario.
Disclosure: I work as developer evangelist at Aspose.

Check out JODConverter to see if it fits the bill. I haven't personally used it.

Use newer versions of jars jodconverter-core-4.2.2.jar and jodconverter-local-4.2.2.jar
String inputFile = "*.doc";
String outputFile = "*.docx";
LocalOfficeManager localOfficeManager = LocalOfficeManager.builder()
.install()
.officeHome(getDefaultOfficeHome()) //your path to openoffice
.build();
try {
localOfficeManager.start();
final DocumentFormat format
= DocumentFormat.builder()
.from(DefaultDocumentFormatRegistry.DOCX)
.build();
LocalConverter
.make()
.convert(new FileInputStream(new File(inputFile)))
.as(DefaultDocumentFormatRegistry.getFormatByMediaType("application/msword"))
.to(new File(outputFile))
.as(format)
.execute();
} catch (OfficeException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (FileNotFoundException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} finally {
OfficeUtils.stopQuietly(localOfficeManager);
}

JODConvertor calls OpenOffice/LibreOffice via a network protocol. It can therefore 'do anything you can do in OpenOffice'. This includes converting formats. But it only does as good a job as whatever version of OpenOffice you are running. I have some art in one of my docs, and it doesn't convert them as I hoped.
JODConvertor is no longer supported, according to the google code web site for v3.
To get JOD to do the job you need to do something like
private static void transformBinaryWordDocToDocX(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
converter.convert(in, out, docx);
}
private static void transformBinaryWordDocToW2003Xml(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);;
DocumentFormat w2003xml = new DocumentFormat("Microsoft Word 2003 XML", "xml", "text/xml");
w2003xml.setInputFamily(DocumentFamily.TEXT);
w2003xml.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MS Word 2003 XML"));
converter.convert(in, out, w2003xml);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
For this to work you need to be running LibreOffice as a networked server ( I could not get the 'run on demand' part of JODConvertor to work under windows with LO 3.6 very well )

To convert DOC file to HTML look at this
(Convert Word doc to HTML programmatically in Java)
Use this: http://poi.apache.org/
Or use this :
XWPFDocument docx = new XWPFDocument(OPCPackage.openOrCreate(new File("hello.docx")));
XWPFWordExtractor wx = new XWPFWordExtractor(docx);
String text = wx.getText();
System.out.println("text = "+text);

I needed the same conversion ,after researching a lot found Jodconvertor can be useful in it , you can download the jar from
https://code.google.com/p/jodconverter/downloads/list
Add jodconverter-core-3.0-beta-4-sources.jar file to your project lib
//1) Create OfficeManger Object
OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome(new File("/opt/libreoffice4.4"))
.buildOfficeManager();
officeManager.start();
// 2) Create JODConverter converter
OfficeDocumentConverter converter = new OfficeDocumentConverter(
officeManager);
// 3)Create DocumentFormat for docx
DocumentFormat docx = converter.getFormatRegistry().getFormatByExtension("docx");
docx.setStoreProperties(DocumentFamily.TEXT,
Collections.singletonMap("FilterName", "MS Word 2007 XML"));
//4)Call convert funtion in converter object
converter.convert(new File("doc/AdvancedTable.doc"), new File(
"docx/AdvancedTable.docx"), docx);

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class TestCon {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("C:/Users/312845/Desktop/a.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("C:/Users/312845/Desktop/test.docx"));
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}

Related

java.lang.NoClassDefFoundError: org.apache.poi.POIXMLDocumentPart when converting .docx document to .pdf

I have a java program named wordToPdf that converts the *.docx file to the *.pdf file. The program runs pretty well with Apache POI 4.1.2 along with POI OOXML 4.1.2 and the fr.opensagres.xdocreport 2.0.2. The *.pdf result is created successfully.
Below is the java program
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;
public void wordToPdf(String inDocFile, String outPdfFile) {
try {
File src = new File(inDocFile);
InputStream doc = new FileInputStream(src);
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = null;
OutputStream out = new FileOutputStream(new File(outPdfFile));
PdfConverter.getInstance().convert(document, out, options);
} catch (Exception e) {
e.printStackTrace();
}
}
However, when the program is called from an HttpServletRequest, it doesn't work the same as the above scenario. In stead, the NoClassDefFoundError exception of org.apache.poi.POIXMLDocumentPart is returned.
Has anyone experienced this issue previously?
Please help me. Thanks so much, guys.

How to write HTML text with Marathi text to PDF document using docx4j?

I am using docx4j to create PDF documents from the HTML text. The HTML text has some English and Marathi text in it. English text comes properly in the pdf. but the marathi text is not displayed in the generated pdf.
In place of text, it shows square boxes.
Below is the code I am using.
import java.io.FileOutputStream;
import org.docx4j.Docx4J;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class ConvertInXHTMLFragment {
static String DEST_PDF = "/home/Downloads/Sample.pdf";
public static void main(String[] args) throws Exception {
// String content = "<html>Hello</html>";
String content = "<html>पासवर्ड</html>";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporter.convert(content, null));
Docx4J.toPDF(wordMLPackage, new FileOutputStream(DEST_PDF));
}
}
EDIT 1:-
This is from one of the samples from XSLFO
import java.io.OutputStream;
import org.docx4j.Docx4J;
import org.docx4j.convert.out.FOSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.model.fields.FieldUpdater;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.samples.AbstractSample;
public class ConvertOutPDFviaXSLFO extends AbstractSample {
static {
inputfilepath = "/home/Downloads/100.docx";;
saveFO = true;
}
static boolean saveFO;
public static void main(String[] args)
throws Exception {
try {
getInputFilePath(args);
} catch (IllegalArgumentException e) {
}
String regex = null;
PhysicalFonts.setRegex(regex);
WordprocessingMLPackage wordMLPackage;
System.out.println("Loading file from " + inputfilepath);
wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
FieldUpdater updater = null;
Mapper fontMapper = new IdentityPlusMapper();
wordMLPackage.setFontMapper(fontMapper);
PhysicalFont font = PhysicalFonts.get("Arial Unicode MS");
fontMapper.put("Mangal", font);
FOSettings foSettings = Docx4J.createFOSettings();
if (saveFO) {
foSettings.setFoDumpFile(new java.io.File(inputfilepath + ".fo"));
}
foSettings.setWmlPackage(wordMLPackage);
String outputfilepath;
if (inputfilepath==null) {
outputfilepath = System.getProperty("user.dir") + "/OUT_FontContent.pdf";
} else {
outputfilepath = inputfilepath + ".pdf";
}
OutputStream os = new java.io.FileOutputStream(outputfilepath);
Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
System.out.println("Saved: " + outputfilepath);
if (wordMLPackage.getMainDocumentPart().getFontTablePart()!=null) {
wordMLPackage.getMainDocumentPart().getFontTablePart().deleteEmbeddedFontTempFiles();
}
// This would also do it, via finalize() methods
updater = null;
foSettings = null;
wordMLPackage = null;
}
}
Now, I get #### in place of Marathi texts in the output PDF.

Docx4j v3.3 supports PDF output via 2 completely different ways.
The default is to use Plutext's PDF Converter. Things work if the mangal font you linked to is installed in the Conveter, and specified in the docx:
<w:r>
<w:rPr>
<w:rFonts w:ascii="mangal" w:eastAsia="mangal" w:hAnsi="mangal" w:cs="mangal"/>
</w:rPr>
<w:t>पासवर्ड</w:t>
</w:r>
Same would apply for Arial Unicode MS.
The other way is PDF via XSL FO; see https://github.com/plutext/docx4j-export-FO
If you have the relevant font installed it should just work. If you don't, then you need to tell it which font to use.
For example, suppose the docx specifies the mangal font, which I do not have. But I have Arial Unicode MS. So I tell the XSL FO process to use that instead:
fontMapper.put("mangal", PhysicalFonts.get("Arial Unicode MS"));
Note, you need to know which font your docx is specifying, and how to make specify the font you want. To do that in XHTML Import, copied from my answer to your earlier question:-
Fonts are handled by
https://github.com/plutext/docx4j-ImportXHTML/blob/master/src/main/java/org/docx4j/convert/in/xhtml/FontHandler.java#L58
Marathi might be relying on one of the other attributes in the RFonts
object. You'll need to look at a working docx to see. You can use
https://github.com/plutext/docx4j-ImportXHTML/blob/master/src/main/java/org/docx4j/convert/in/xhtml/FontHandler.java#L54
to inject a suitable font mapping.

SFNTLY: How to convert any font that gets uploaded to "WOFF" format?

I can not find any documenration on this library (https://code.google.com/p/sfntly/). I've been taking stabs at it for 2 days now. I'm trying to convert any font that gets uploaded to "WOFF" format.
Could someone shed some light?

I successfully converted my TTF into a WOFF file by following these steps:
Download and install ant following "The Short Story" steps (http://ant.apache.org/manual/install.html#getBinary)
Download SFNTLY via SVN checkout (https://code.google.com/p/sfntly/source/checkout) and followed the steps contained into the file "sfntly\java\quickstart.txt"
Created a new java project and imported the following four jars I created following the previous steps into my project:
sfntly.jar
woffconverter.jar
guava-16.0.1.jar
I slightly tweaked display_name code which contained a few syntax mistakes.
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import com.google.common.io.Files;
import com.google.typography.font.sfntly.Font;
import com.google.typography.font.sfntly.FontFactory;
import com.google.typography.font.sfntly.data.WritableFontData;
import com.google.typography.font.tools.conversion.woff.WoffWriter;
public class Main {
public static void main(String[] args) {
WoffWriter ww = new WoffWriter();
FontFactory fontFactory = FontFactory.getInstance();
byte[] bytes;
try {
bytes = Files.toByteArray(new File("C:\\FontName.TTF"));
Font font = fontFactory.loadFonts(bytes)[0];
WritableFontData wfd = ww.convert(font);
FileOutputStream fs = new FileOutputStream("out.fnt");
wfd.copyTo(fs);
fs.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}

After reading the source code of SFNTLY I am no expert in sfntly, so use my answer at your risk :).
I would convert the font with WoffWriter#convert() to writeable font data, then copy the wfd to outputstream.
WoffWriter ww = new WoffWriter();
WriteableFontData wfd = ww.convert(yourFont);
try {
FileOutPutStream fs = new FileOutputStream("out.fnt");
wfd.copyTo(fs, wfd);
fs.close();
} catch (IOException e) {
}

Converting html file to pdf using iText in JApplet

I am using iText for a project. My program is supposed to run from inside a browser and I need it to convert an html file to a pdf file. When I run the program from NetBeans everything works fine. I sign my jar and run the Applet in a browser and then I get this error:
Errorjava.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.windir")
For the purpose of this post I have made a simple JApplet code which has the same problem:
public class RunApplet extends JApplet {
#Override
public void init() {
this.add(new JLabel("This is a labe"));
File f = new File("C:/ReportGen/data.html");
File pdf = new File("C:/ReportGen/data.pdf");
try {
pdf.createNewFile();
Document pdfDocument = new Document();
PdfWriter writer = PdfWriter.getInstance(pdfDocument, new FileOutputStream(pdf));
pdfDocument.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
FontFactoryImp imp = new FontFactoryImp();
imp.getFont("Arial");
FontFactory.setFontImp(imp);
worker.parseXHtml(writer, pdfDocument, new FileInputStream(f));
pdfDocument.close();
writer.close();
this.add(new JLabel(f.getAbsolutePath()));
} catch (Exception ex) {
this.add(new JTextField("Error"+ex));
}
}
}
The html file is created and is fine, but when I create the pdf file I get the exception and the pdf file is actually created, but is corrupt and I am unable to open it. Thanks in advance for your time.

First, I see this error in your question:
Errorjava.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.windir")
You need signed your applet for access to your filesystem. See this link and this too.
Second, I have tried following code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
public class main {
public static void main(String[] args) {
File f = new File("C:/tmp/data.htm");
File pdf = new File("C:/tmp/data.pdf");
Document pdfDocument = null;
PdfWriter pdfWriter = null;
try {
pdfDocument = new Document();
pdfWriter = PdfWriter.getInstance(pdfDocument, new FileOutputStream(pdf));
pdfDocument.open();
XMLWorkerHelper.getInstance().parseXHtml(pdfWriter, pdfDocument,
new FileInputStream(f));
pdfDocument.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
If file data.htm (original data) is an htmlx, work fine. But if data.htm not is an xml, i get this error:
com.itextpdf.tool.xml.exceptions.RuntimeWorkerException: Invalid nested tag head found, expected closing tag meta.
at com.itextpdf.tool.xml.XMLWorker.endElement(XMLWorker.java:134)
at com.itextpdf.tool.xml.parser.XMLParser.endElement(XMLParser.java:395)
at com.itextpdf.tool.xml.parser.state.ClosingTagState.process(ClosingTagState.java:70)
at com.itextpdf.tool.xml.parser.XMLParser.parseWithReader(XMLParser.java:235)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:213)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:174)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:220)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:185)
at main.main(main.java:44)
Can you try with your data and with this example? The difference is that my example isn't an applet, is an java standalone.
Regards

Save content of HTML in local storage

i want to fetch xml file from the links like
http://api.worldbank.org/countries/GBR/indicators/NY.GDP.MKTP.KD.ZG?date=2004:2012
it returns a xml file, i don't know how to save this file in my folder named "temp" using java or javascripts, actually i don't want to display this result of that link to the user, I'm generating such links dynamically.
please help!!!

I recommend you to use an HTML parser library like jsoup in this situation. Please have a look at the below steps for better under standing:
1. Download jsoup core library (jsoup-1.6.1.jar) from http://jsoup.org/download
2. Add the jsoup-1.6.1.jar file to your classpath.
3. Try the below code to save the xml file from the URL.
package com.overflow.stack;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
/**
*
* #author sarath_sivan
*/
public class XmlExtractor {
public static StringBuilder fetchXmlContent(String url) throws IOException {
StringBuilder xmlContent = new StringBuilder();
Document document = Jsoup.connect(url).get();
xmlContent.append(document.body().html());
return xmlContent;
}
public static void saveXmlFile(StringBuilder xmlContent, String saveLocation) throws IOException {
FileWriter fileWriter = new FileWriter(saveLocation);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
bufferedWriter.write(xmlContent.toString());
bufferedWriter.close();
System.out.println("Downloading completed successfully..!");
}
public static void downloadXml() throws IOException {
String url = "http://api.worldbank.org/countries/GBR/indicators/NY.GDP.MKTP.KD.ZG?date=2004:2012";
String saveLocation = System.getProperty("java.io.tmpdir")+"sarath.xml";
XmlExtractor.saveXmlFile(XmlExtractor.fetchXmlContent(url), saveLocation);
}
public static void main(String[] args) throws IOException {
XmlExtractor.downloadXml();
}
}
4. Once the above code is executed successfully, a file named "sarath.xml" should be there in your temp folder.
Thank you!

Well your body is XML not HTML, just retrieve it using Apache HttpClient, and pump the read InputStream to a FileOutputStream. What was the problem? Do you want to save parsed content in a formatted form?

public String execute() {
try {
String url = "http://api.worldbank.org/countries/GBR/indicators/NY.GDP.MKTP.KD.ZG?date=2004:2012";
String saveLocation = System.getProperty("java.io.tmpdir")+"sarath.xml";
XmlExtractor.saveXmlFile(XmlExtractor.fetchXmlContent(url), saveLocation);
} catch (Exception e) {
e.printStackTrace();
addActionError(e.getMessage());
}
return SUCCESS;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert DOC file to DOCX with Java - java

Check out JODConverter to see if it fits the bill. I haven't personally used it.

Related

java.lang.NoClassDefFoundError: org.apache.poi.POIXMLDocumentPart when converting .docx document to .pdf

How to write HTML text with Marathi text to PDF document using docx4j?

SFNTLY: How to convert any font that gets uploaded to "WOFF" format?

Converting html file to pdf using iText in JApplet

Save content of HTML in local storage

Categories

Resources