Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
How can I convert a Word document to PDF where the document contains various things, such as tables. When trying to use iText, the original document looks different to the converted PDF. Is there an open source API / library, rather than calling out to an executable, that I can use?
This is quite a hard task, ever harder if you want perfect results (impossible without using Word) as such the number of APIs that just do it all for you in pure Java and are open source is zero I believe (Update: I am wrong, see below).
Your basic options are as follows:
Using JNI/a C# web service/etc script MS Office (only option for 100% perfect results)
Using the available APIs script Open Office (90+% perfect)
Use Apache POI & iText (very large job, will never be perfect).
Update - 2016-02-11
Here is a cut down copy of my blog post on this subject which outlines existing products that support Word-to-PDF in Java.
Converting Microsoft Office (Word, Excel) documents to PDFs in Java
Three products that I know of can render Office documents:
yeokm1/docs-to-pdf-converter
Irregularly maintained, Pure Java, Open Source
Ties together a number of libraries to perform the conversion.
xdocreport
Actively developed, Pure Java, Open Source
It's Java API to merge XML document created with MS Office (docx) or OpenOffice (odt), LibreOffice (odt) with a Java model to generate report and convert it if you need to another format (PDF, XHTML...).
Snowbound Imaging SDK
Closed Source, Pure Java
Snowbound appears to be a 100% Java solution and costs over $2,500. It contains samples describing how to convert documents in the evaluation download.
OpenOffice API
Open Source, Not Pure Java - Requires Open Office installed
OpenOffice is a native Office suite which supports a Java API. This supports reading Office documents and writing PDF documents. The SDK contains an example in document conversion (examples/java/DocumentHandling/DocumentConverter.java). To write PDFs you need to pass the "writer_pdf_Export" writer rather than the "MS Word 97" one.
Or you can use the wrapper API JODConverter.
JDocToPdf - Dead as of 2016-02-11
Uses Apache POI to read the Word document and iText to write the PDF. Completely free, 100% Java but has some limitations.
You can use JODConverter for this purpose. It can be used to convert documents between different office formats. such as:
Microsoft Office to OpenDocument, and vice versa
Any format to PDF
And supports many more conversion as well
It can also convert MS office 2007 documents to PDF as well with almost all formats
More details about it can be found here:
http://www.artofsolving.com/opensource/jodconverter
Docx4j is open source and the best API for convert Docx to pdf without any alignment or font issue.
Maven Dependencies:
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-Internal</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-ReferenceImpl</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-MOXy</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-export-fo</artifactId>
<version>8.0.0</version>
</dependency>
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
public class DocToPDF {
public static void main(String[] args) {
try {
InputStream templateInputStream = new FileInputStream("D:\\\\Workspace\\\\New\\\\Sample.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(templateInputStream);
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String outputfilepath = "D:\\\\Workspace\\\\New\\\\Sample.pdf";
FileOutputStream os = new FileOutputStream(outputfilepath);
Docx4J.toPDF(wordMLPackage,os);
os.flush();
os.close();
} catch (Throwable e) {
e.printStackTrace();
}
}
}
Check out docs-to-pdf-converter on github. Its a lightweight solution designed specifically for converting documents to pdf.
Why?
I wanted a simple program that can convert Microsoft Office documents
to PDF but without dependencies like LibreOffice or expensive
proprietary solutions. Seeing as how code and libraries to convert
each individual format is scattered around the web, I decided to
combine all those solutions into one single program. Along the way, I
decided to add ODT support as well since I encountered the code too.
It's already 2019, I can't believe still no easiest and conveniencest way to convert the most popular Micro$oft Word document to Adobe PDF format in Java world.
I almost tried every method the above answers mentioned, and I found the best and the only way can satisfy my requirement is by using OpenOffice or LibreOffice. Actually I am not exactly know the difference between them, seems both of them provide soffice command line.
My requirement is:
It must run on Linux, more specifically CentOS, not on Windows, thus we cannot install Microsoft Office on it;
It must support Chinese character, so ISO-8859-1 character encoding is not a choice, it must support Unicode.
First thing came in mind is doc-to-pdf-converter, but it lacks of maintenance, last update happened 4 years ago, I will not use a nobody-maintain-solution. Xdocreport seems a promising choice, but it can only convert docx, but not doc binary file which is mandatory for me. Using Java to call OpenOffice API seems good, but too complicated for such a simple requirement.
Finally I found the best solution: use OpenOffice command line to finish the job:
Runtime.getRuntime().exec("soffice --convert-to pdf -outdir . /path/some.doc");
I always believe the shortest code is the best code (of course it should be understandable), that's it.
You can use Cloudmersive native Java library. It is free for up to 50,000 conversions/month and is much higher fidelity in my experience than other things like iText or Apache POI-based methods. The documents actually look the same as they do in Microsoft Word which for me is the key. Incidentally it can also do XLSX, PPTX, and the legacy DOC, XLS and PPT conversion to PDF.
Here is what the code looks like, first add your imports:
import com.cloudmersive.client.invoker.ApiClient;
import com.cloudmersive.client.invoker.ApiException;
import com.cloudmersive.client.invoker.Configuration;
import com.cloudmersive.client.invoker.auth.*;
import com.cloudmersive.client.ConvertDocumentApi;
Then convert a file:
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/input.docx"); // File to perform the operation on.
try {
byte[] result = apiInstance.convertDocumentDocxToPdf(inputFile);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertDocumentApi#convertDocumentDocxToPdf");
e.printStackTrace();
}
You can get an document conversion API key for free from the portal.
I agree with posters listing OpenOffice as a high-fidelity import/export facility of word / pdf docs with a Java API and it also works across platforms. OpenOffice import/export filters are pretty powerful and preserve most formatting during conversion to various formats including PDF. Docmosis and JODReports value-add to make life easier than learning the OpenOffice API directly which can be challenging because of the style of the UNO api and the crash-related bugs.
Using JACOB call Office Word is a 100% perfect solution. But it only supports on Windows platform because need Office Word installed.
Download JACOB archive (the latest version is 1.19);
Add jacob.jar to your project classpath;
Add jacob-1.19-x32.dll or jacob-1.19-x64.dll (depends on your jdk version) to ...\Java\jdk1.x.x_xxx\jre\bin
Using JACOB API call Office Word to convert doc/docx to pdf.
public void convertDocx2pdf(String docxFilePath) {
File docxFile = new File(docxFilePath);
String pdfFile = docxFilePath.substring(0, docxFilePath.lastIndexOf(".docx")) + ".pdf";
if (docxFile.exists()) {
if (!docxFile.isDirectory()) {
ActiveXComponent app = null;
long start = System.currentTimeMillis();
try {
ComThread.InitMTA(true);
app = new ActiveXComponent("Word.Application");
Dispatch documents = app.getProperty("Documents").toDispatch();
Dispatch document = Dispatch.call(documents, "Open", docxFilePath, false, true).toDispatch();
File target = new File(pdfFile);
if (target.exists()) {
target.delete();
}
Dispatch.call(document, "SaveAs", pdfFile, 17);
Dispatch.call(document, "Close", false);
long end = System.currentTimeMillis();
logger.info("============Convert Finished:" + (end - start) + "ms");
} catch (Exception e) {
logger.error(e.getLocalizedMessage(), e);
throw new RuntimeException("pdf convert failed.");
} finally {
if (app != null) {
app.invoke("Quit", new Variant[] {});
}
ComThread.Release();
}
}
}
}
unoconv, it's a python tool worked in UNIX.
While I use Java to invoke the shell in UNIX, it works perfect for me. My source code : UnoconvTool.java. Both JODConverter and unoconv are said to use open office/libre office.
docx4j/docxreport, POI, PDFBox are good but they are missing some formats in conversion.
Related
I used iText 5 to create a nice looking report which includes some tables and graphs. I wonder if iText lets you convert PDF to HTML and if so .. how can one do it?
I believe previous versions of iText allowed it, but in iText 5 i was not able to find a way to do this.
No. iText has never converted PDF to HTML, only the reverse.
Have you had a look at http://www.jpedal.org/pdf_to_html_conversion.php - there is currently a free beta.
Possible to do with Apache Tika (it uses Apache PDFBox under the hood):
public String pdfToHtml(InputStream content) {
PDDocument pddDocument = PDDocument.load(content);
PDFText2HTML stripper = new PDFText2HTML("UTF-8");
return stripper.getText(pddDocument);
}
I need to convert a docx to a pdf. The following code use the library xdocreport and works pretty well.
The problem is for some specific docx which contain drawings. They are not visible in the final pdf. I've tested the conversion with the live demo avaible from the github and I've the same problem.
So I'm wondering, is this possible, or do I need to use an other library ? Which one ? (dox4j doesn't seems to works neither).
final XWPFDocument document = new XWPFDocument(inputStream);
final OutputStream outPdf = new FileOutputStream("myFile.pdf");
PdfConverter.getInstance().convert(document, outPdf, optionsPdf);
outPdf.close();
XDocReport doesn't support drawing. It could support it since docx->pdf is based on iText which supports draw, but it's a big task (any contribution are welcome!)
You can see here limitation of XDocReport docx->pdf converter.
I'm working on removing Protected View from a series of PDFs, and am trying to use the iText library within VBA. My main issue at this point is that I have no idea what method to use, and the iText documentation is pretty dense.
I'm also feeling my way forward on calling the iText library from VBA, so any help on syntax to do this is also appreciated, though I'm sure I could get there myself if I knew which method to call...
Currently, I've got:
Dim program As WshExec
program = Shell("Java.exe -jar " & mypath & "\itext-5.5.6\itextpdf-5.5.6.jar")
'Debug.print program returns a value here, so this line works.
'I'm thinking I need something like:
'Set program = RunProgram("Java.exe -jar " & mypath & "\itext-5.5.6\itextpdf-5.5.6.jar", & _
methodName, param1)
I've been using the following questions to get me this far...
Calling Java library (JAR) from VBA/VBScript/Visual Basic Classic
Microsoft Excel Macro to run Java program
Desired functionality is to have an unprotected PDF sitting in a folder on mypath.
The jar you are trying to run is not an executable jar. iText is a library that be used in a Java application by adding itextpdf-5.5.6.jar to the CLASSPATH. If you don't write any Java code, then the jar won't do a thing, hence your Shell() and your RunProgram() methods are useless: there is nothing to execute.
Moreover: from your question, it is far from certain that you have a Java environment on your machine. You are working in a VBA environment, which makes one wonder why you'd use the Java version of iText. Have you tried using iTextSharp, which is the .NET version of iText (written in C#)?
Take a look at this tutorial: Programmatically Complete PDF Form Fields using Visual Basic and the iTextSharp DLL
In this tutorial, we take an existing PDF, we fill out a form, and we get another PDF based on the original PDF, but with extra data. You can easily adapt the code so that it takes an existing PDF, doesn't add anything to the PDF, but saves the original PDF without its passwords, as is explained in my answer to How can I decrypt a PDF document with the owner password?
If you combine what you can learn from my Java code:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader.unethicalreading = true;
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
with what you learn from the form filling tutorial, you get something like this (provided that you use the iTextSharp DLL instead of the iText jar):
Dim pdfTemplate As String = "c:\Temp\PDF\encrypted.pdf"
Dim newFile As String = "c:\Temp\PDF\decrypted.pdf"
PdfReader.unethicalreading = true
Dim pdfReader As New PdfReader(pdfTemplate)
Dim pdfStamper As New PdfStamper(pdfReader, New FileStream(
newFile, FileMode.Create))
pdfStamper.Close()
pdfReader.Close()
IMPORTANT: this will only remove the password if the file is only protected with an owner password (which is what I assume when you talk about protected view). If the file is protected in any other way, you'll have to clarify. Also note that the parameter unethicalreading is not without meaning: make sure that you're not doing unethical by removing the protection.
I was having to manipulate protected PDF files using iText.
I just put in my pom.xml the following dependency and nothing more.
<!-- https://mvnrepository.com/artifact/org.bouncycastle/bcprov-jdk15on -->
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15on</artifactId>
<version>1.59</version>
</dependency>
I am trying to generate a PDF document from a *.doc document.
Till now and thanks to stackoverflow I have success generating it but with some problems.
My sample code below generates the pdf without formatations and images, just the text.
The document includes blank spaces and images which are not included in the PDF.
Here is the code:
in = new FileInputStream(sourceFile.getAbsolutePath());
out = new FileOutputStream(outputFile);
WordExtractor wd = new WordExtractor(in);
String text = wd.getText();
Document pdf= new Document(PageSize.A4);
PdfWriter.getInstance(pdf, out);
pdf.open();
pdf.add(new Paragraph(text));
docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.
There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.
If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.
Use it like so:
wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
// Set up font mapper
Mapper fontMapper = new IdentityPlusMapper();
wordMLPackage.setFontMapper(fontMapper);
// Example of mapping missing font Algerian to installed font Comic Sans MS
PhysicalFont font
= PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
fontMapper.getFontMappings().put("Algerian", font);
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);
OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");
c.output(os);
Update July 2016
As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com
If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.
Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.
The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/
WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.
What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.
One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.
Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.
I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).
It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.
Good luck with your project!
Wim
Use OpenOffice/LbreOffice and JODConnector
This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.
private static void transformDocXToPDFUsingJOD(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
converter.convert(in, out, pdf);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
You need to be running LibreOffice as a serverto make this work.
From the command line you can do this using;
"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore
Another option I came across recently is using the OpenOffice (or LibreOffice) API (see here). I have not been able to get into this but it should be able to open documents in various formats and output them in a pdf format. If you look into this, let me know how it worked!
I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.
If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.
P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)
Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();
You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.
Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.