Apache POI convert HTML/XHTML to DOC/DOCX

Apache POI convert HTML/XHTML to DOC/DOCX - java

I need to transform HTML to a doc file, the HTML is filled with custom information and the images and CSS change depending on what is request.
I'm trying to use Apache POI for this, but I'm having an error
`
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
My code is this:
// Load the HTML file
//Doc file
String htmlFile = "pathToHtml/file.html";
//String htmlFile = parseHTMLTemplate(disputeLetterDetails, template, fileExtension);
//new File(htmlFile);
//File file = new FileReader(htmlFile);
Path path = Path.of(htmlFile);
OutputStream in = new FileOutputStream(htmlFile, true);
// Create a new XWPFDocument
XWPFDocument document = new XWPFDocument();
// Set up the XHTML options
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("./images/")));
options.setExtractor(new FileImageExtractor(new File("./images/")));
// Convert the HTML to XWPFDocument
XHTMLConverter.getInstance().convert(document, in, options);
// Save the document to a .doc file
FileOutputStream out = new FileOutputStream("pathToHtml/OUT_from_XHTML_TEST.docx");
document.write(out);
out.close();
`
I want to get a docx file from an HTML file with the same styles but I'm getting this error `
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
`

Related

unable to read pdf document using selenium web driver

I am writing code to read pdf file in selenium using Java PDF Library.
I wrote my code as
URL url = new URL(str);
InputStream is=url.openStream();
BufferedInputStream fileParse=new BufferedInputStream(is);
PDDocument document=null;
document=PDDocument.load(fileParse);
String pdfContent=new PDFTextStripper().getText(document);
But I am getting error at line document=PDDocument.load(fileParse) along with
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2017)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1988)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:269)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1143)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1040)
I need to verify the content on the pdf file .
Appreciate the help.
Thanks

Simply you can use below line of code and its working:
//Loading an existing document
File file = new File("yourPdfFilepath");
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String pdfcontent = pdfStripper.getText(document);
I hope it helps you

getting error for file.load (file) method when the path has \

I have a file which I am trying to read the content from.
Code tried:
String userDir = System.getProperty("user.home")+"\\Downloads";
//Loading an existing document
String s=userDir+"\\PDFStatement.pdf";
File file = new File(s);
//File file = new File(s);
System.out.println("file"+file);
PDDocument document = PDDocument.load(file);// this is where I am getting error
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);
//Closing the document
document.close();
}
Error :
java.io.IOException: Error: Expected an integer type, actual='statement'
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:531)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1154)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1125)
Same code works when I pass the same file path with / as
File file = new File("C:/abc/abc/abc/PDFStatement.pdf");
I am using pdfbox 1.8
Thanks.

Extracted pdf text is not getting displayed in console

I am trying to extract the pdf text using Tabula. But the code has no errors but when i run the extracted pdf text does not get displayed in console. Could some one help.
I have been using PDFBox and after doing some research, i have found that tabula is new and wanted to try it.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1) //1st page
TextStripper textStripper = new TextStripper(document,1);
System.out.println(textStripper.getText(document));
output of pdf text

You are not using the page variable. Try the following code.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1); // 1st page
for (TextElement textElement: page.getText()) {
System.out.print(textElement.getText());
}

Convert docx file into PDF with Java

I'am looking for some "stable" method to convert DOCX file from MS WORD into PDF. Since now I have used OpenOffice installed as listener but it often hangs. The problem is that we have situations when many users want to convert SXW,DOCX files into PDF at the same time. Is there some other possibility? I tryed with examples from this site: https://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-with-java/ but the output result is not good (converted documents have errors and layout is quite modified).
here is "source" docx document:
here is document converted with docx4j with some exception text inside document. Also the text in upper right corner is missing.
this one is PDF created with OpenOffice as converter from docx to pdf. Some text is missing "upper right corner"
Is there some other option to convert docx into pdf with Java?

There are lot of methods to do conversion
One of the used method is using POI and DOCX4j
InputStream is = new FileInputStream(new File("your Docx PAth"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i)
.getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get(
"Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager
.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("Your OutPut PDF path"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
This works perfect and even tried on multiple DOCX files.

Add image as a header in XWPF document

How do I add a image as a header to every page in XWPF document ?
I have tried every possible thing I could think of , below is my code :
XWPFDocument docx = new XWPFDocument();
CTSectPr sectPr = docx.getDocument().getBody().addNewSectPr();
XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(docx, sectPr);
XWPFHeader header = policy.createHeader(XWPFHeaderFooterPolicy.FIRST);
String imgFile="sample.png";
header.addPictureData(new FileInputStream(imgFile), XWPFDocument.PICTURE_TYPE_PNG);
String nameoffile ="customer"+".docx";
FileOutputStream out = new FileOutputStream(nameoffile);
docx.write(out);
out.close();
However this is giving me a java.lang.IndexOutOfBoundsException at line java.lang.IndexOutOfBoundsException

You should add a correct directory for image. Probably "new FileInputStream(imgFile)" is null.
Put your image in directory then set path as:
String imgFile="C:\Users\{user}\Desktop\Project\sample.png";

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI convert HTML/XHTML to DOC/DOCX - java

Related

unable to read pdf document using selenium web driver

getting error for file.load (file) method when the path has \

Extracted pdf text is not getting displayed in console

Convert docx file into PDF with Java

Add image as a header in XWPF document

Categories

Resources