Convert word to pdf java

Convert word to pdf java - java

I'm trying convert word to pdf, my code is:
public static void main(String[] args) {
try {
XWPFDocument document = new XWPFDocument();
document.createStyles();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File("C:/Users/pepe/Desktop/DocxToPdf1.pdf"));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
I'm getting error:
fr.opensagres.poi.xwpf.converter.core.XWPFConverterException: org.apache.xmlbeans.XmlException: error: Unexpected end of file after null
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:71)
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:39)
Caused by: org.apache.xmlbeans.XmlException: error: Unexpected end of file
I have tried other solutions but doesnt works. I create a java project, if someone can help me or other way to do

This is probably a duplicate of Trying to make simple PDF document with Apache poi. But let's have a complete example again to show how to create a new XWPFDocument from scratch using the latest apache poi 4.1.2 which then can be converted to PDF using PdfConverter of fr.opensagres.poi.xwpf.converter version 2.0.2 and iText.
As told the default *.docx documents created by apache poi lacks some content which PdfConverter needs.
There must be a styles document, even if it is empty.
And there must be section properties for the page having at least the page size set. To fulfilling this we must add some code additionally in our program. Unfortunately this then needs the full jar of all of the schemas ooxml-schemas-1.4.jar as mentioned in Faq-N10025.
And because we need changing the underlying low level objects, the document must be written so underlying objects will be committed. Else the XWPFDocument which we hand over the PdfConverter will be incomplete.
Minimal complete working example:
import java.io.*;
import java.math.BigInteger;
//needed jars: fr.opensagres.poi.xwpf.converter.core-2.0.2.jar,
// fr.opensagres.poi.xwpf.converter.pdf-2.0.2.jar,
// fr.opensagres.xdocreport.itext.extension-2.0.2.jar,
// itext-4.2.1.jar
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
//needed jars: apache poi and it's dependencies
// and additionally: ooxml-schemas-1.4.jar
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.util.Units;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
public class XWPFToPDFConverterSampleMin {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
// there must be a styles document, even if it is empty
XWPFStyles styles = document.createStyles();
// there must be section properties for the page having at least the page size set
CTSectPr sectPr = document.getDocument().getBody().addNewSectPr();
CTPageSz pageSz = sectPr.addNewPgSz();
pageSz.setW(BigInteger.valueOf(12240)); //12240 Twips = 12240/20 = 612 pt = 612/72 = 8.5"
pageSz.setH(BigInteger.valueOf(15840)); //15840 Twips = 15840/20 = 792 pt = 792/72 = 11"
// filling the body
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
//document must be written so underlaaying objects will be committed
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.write(out);
document.close();
document = new XWPFDocument(new ByteArrayInputStream(out.toByteArray()));
PdfOptions options = PdfOptions.create();
PdfConverter converter = (PdfConverter)PdfConverter.getInstance();
converter.convert(document, new FileOutputStream("XWPFToPDFConverterSampleMin.pdf"), options);
document.close();
}
}

I would not suggest you to use apache poi since its library to convert word to pdf have been discontinued now. As of today I don't think that there is any open source library which do the conversion (they require some dependencies like some need MS word to be installed, etc). The best way I could think of (it will only work if you are deploying project on linux machine) is that install Libre Office (open source) in the linux machine and run this :
String command = "libreoffice --headless --convert-to pdf " + inputPath + " --outdir " + outputPath;
try {
Runtime.getRuntime().exec(command);
} catch (IOException e) {
e.printStackTrace();
}

Related

Apache POI Mirroring Words in Arabic Language

I'm developing an Arabic OCR application in java which extracts Arabic texts in images and then saving the text into a Microsoft Word file, for this purpose i use Apache-POI library.
My problem is that when i extract some text the order of the words are fine but when i save it in a Word file the order of the words are kinda messed up and looks mirrored
for example:
BUT after saving it as a Word:
and here is the code for saving the Word file:
public class SavingStringAsWordDoc {
File f=theGUI.toBeSavedWord;
public void saveAsWorddd (){
String st=TesseractPerformer.toBeShown;
try(FileOutputStream fout=new FileOutputStream(f);XWPFDocument docfile=new XWPFDocument()){
XWPFParagraph paraTit=docfile.createParagraph();
paraTit.setAlignment(ParagraphAlignment.LEFT);
XWPFRun paraTitRun=paraTit.createRun();
paraTitRun.setBold(true);
paraTitRun.setFontSize(15);
paraTit.setAlignment(ParagraphAlignment.RIGHT);
docfile.createParagraph().createRun().setText(st); //content to be written
docfile.write(fout); //adding to output stream
} catch(IOException e){
System.out.println("IO ERROR:"+e);
}
}
i noticed one thing which might help understanding the problem:
if i copy the messed up text in the word file and then paste it by choosing the (Keep Text Only) paste option it fixes the order of the paragraph

This needs bidirectional text direction support (bidi) and is not yet implemented in XWPF of apache poi per default. But the underlying object org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr supports this. So we must get this underlying object from the XWPFParagraph and then set Bidi on.
Example:
import java.io.File;
import java.io.FileOutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWord {
public static void main(String[] args) throws Exception {
String content = Files.readString(new File("ArabicTextFile.txt").toPath(), StandardCharsets.UTF_16);
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
// set bidirectional text support on
CTP ctp = paragraph.getCTP();
CTPPr ctppr = ctp.getPPr();
if (ctppr == null) ctppr = ctp.addNewPPr();
ctppr.addNewBidi().setVal(STOnOff.ON);
XWPFRun run=paragraph.createRun();
run.setBold(true);
run.setFontSize(22);
run.setText(content);
FileOutputStream out = new FileOutputStream("CreateWord.docx");
document.write(out);
out.close();
document.close();
}
}
My ArabicTextFile.txt contains the text
هذا هو النص باللغة العربية لاختبار النص باللغة العربية
in UTF-16 encoding (Unicode).
Result in Word:

how to judge if the file is doc or docx in POI

The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.
What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?

Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}

try {
new ZipFile(new File("/Users/giang/Documents/a.doc"));
System.out.println("this file is .docx");
} catch (ZipException e) {
System.out.println("this file is not .docx");
e.printStackTrace();
}

Apache POI (Java) : Display embedded files on Microsoft Word (.docx)

It's my first time posting here :)
I want with apache POI to embed a file inside a .docx and reference it via an icon or a link inside the document.
I guess I've managed to embed the file.
My problem : I can't display a reference to the embed file.
To illustrate my problem:
With the following code, I've manage to embed inside myWord.docx the file "first.txt" at the location "/word/embeddings/first.txt".
I don't know how to reference it.
This is my code :
public void saveToDocx(OutputStream myOutputStream){
Resource r = new ClassPathResource("/myWord.docx") ;
try (FileInputStream fis= new FileInputStream(r.getFile())){
OPCPackage pkg = OPCPackage.open(fis);
XWPFDocument docx = new XWPFDocument(pkg);
fis.close();
// first.txt
final PackagePartName partName = PackagingURIHelper.createPartName("/word/embeddings/first.txt");
final PackagePart pkgPart = pkg.createPart(partName, "application/vnd.openxmlformats-officedocument.oleobject");
final OutputStream partOutputStream = pkgPart.getOutputStream();
partOutputStream.write("test test test".getBytes());
partOutputStream.close();
pkgPart.addRelationship(partName, TargetMode.INTERNAL, "http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject");
//image.jpg
String imageName = "C:/image.jpg";
InputStream imageIS = new FileInputStream(imageName);
imageIS.close();
imgPartOutputStream.close();
//add simple picture to my document
XWPFParagraph myParagraph = docx.createParagraph();
XWPFRun run = myParagraph.createRun();
run.addPicture(imageIS, XWPFDocument.PICTURE_TYPE_JPEG, imageName, Units.toEMU(77.25), Units.toEMU(49.5));
docx.write(myOutputStream);
//pkg.save(myOutputStream); there is a difference between this and docx.write ?
}catch(Exception e){
e.printStackTrace();
}
}
Can someone please try to help me? I'm really stuck (since last Friday). Thanks guys!
(Do forgive me for my grammar, i'm not a native :/)

Converting a pdf to word document using java

I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}

In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.

You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!

Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.

error in converting word document to pdf using iText

The below is the code that i used to convert a word document to pdf. After compiling the code, the PDF file is generated. But the file contains some junk characters along with the word document content. Please help me to know what modification should i do to get rid of the junk characters.
The code i used is:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PdfConverter
{
private void createPdf(String inputFile, String outputFile)//, boolean isPictureFile)
{
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try
{
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/*if (isPictureFile)
{
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
else
{ */
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils.readFileToString(file)));
//}
pdfDocument.close();
writer.close();
System.out.println("PDF has been generted");
}
catch (Exception exception)
{
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[])
{
PdfConverter pdfConversion = new PdfConverter();
pdfConversion.createPdf("C:/test.doc", "C:/test.pdf");//, true);
}
}
Thanks for you help.

Only because you name your class PdfConverter you don't have one. All you do is reading the binary content as a String and writing this as one paragraph (and that's what you see). This approach will definitively not be successful. See https://stackoverflow.com/questions/437394 for a similar question.
If you are interested just in the content of your word document, you might want to give Apache POI - the Java API for Microsoft Documents a try to read your the document not at binary level but on a hight abstraction level. If your Word document has a simple (and I mean a really simple) structure you might get reasonable results.

To do this, you will have to read the doc file correctly and then use the read data to create the PDF file.
What you are doing right now is that you are reading data from doc file, which is having garbage values since you are not using proper API to read the data, and then storing the obtained garbage data in the PDF file. Hence the issue.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert word to pdf java - java

Related

Apache POI Mirroring Words in Arabic Language

how to judge if the file is doc or docx in POI

Apache POI (Java) : Display embedded files on Microsoft Word (.docx)

Converting a pdf to word document using java

error in converting word document to pdf using iText

Categories

Resources