Reading a .docx file with Apache POI

Reading a .docx file with Apache POI - java

I want to read and print out a whole .docx file into the console for now.
I read that you cannot do it without Apache POI or Docx4J, I tried both and failed twice.
Also I am aware that this question already exists on Stackoverflow but I am afraid it might be outdated.
This is my code with Apache POI right now.
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class test {
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (int i = 0; i < paragraphs.size(); i++) {
System.out.println(paragraphs.get(i).getParagraphText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
readDocxFile("C:\\Basics.docx");
}
}
It was taken from another question on here but, it does not work.
I get following Error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/compress/archivers/zip/ZipFile
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:37)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:142)
at test.readDocxFile(test.java:16)
at test.main(test.java:28)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.archivers.zip.ZipFile
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 5 more

It's due to a library that is not directly included in POI.
If you use maven add the following dependency to your project :
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.18</version>
</dependency>

Related

Apache POI - DOCX To PDF Conversion

I am trying to convert a docx file into pdf file using POI. Getting following error.
Using poi-3.17 ,
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordToPDF {
public static void main(String[] args) {
WordToPDF cwoWord = new WordToPDF();
System.out.println("Start");
cwoWord.ConvertToPDF("D:\\2067536.docx", "D:\\2067536.pdf");
}
public void ConvertToPDF(String docPath, String pdfPath) {
try {
InputStream doc = new FileInputStream(new File(docPath));
XWPFDocument document = new XWPFDocument(doc);
document.createStyles();
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File(pdfPath));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (FileNotFoundException ex) {
System.out.println(ex.getMessage());
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
}
}
Here is the Error happening
Exception in thread "main" org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:70)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
at WordToPDF.ConvertToPDF(WordToPDF.java:27)
at WordToPDF.main(WordToPDF.java:17)
Caused by: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:178)
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:111)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitHeaderRef(XWPFDocumentVisitor.java:1142)
at org.apache.poi.xwpf.converter.core.MasterPageManager.visitHeadersFooters(MasterPageManager.java:213)
at org.apache.poi.xwpf.converter.core.MasterPageManager.addSection(MasterPageManager.java:180)
at org.apache.poi.xwpf.converter.core.MasterPageManager.compute(MasterPageManager.java:127)
at org.apache.poi.xwpf.converter.core.MasterPageManager.initialize(MasterPageManager.java:90)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitBodyElements(XWPFDocumentVisitor.java:232)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.start(XWPFDocumentVisitor.java:199)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:56)
... 4 more
As this is a null pointer error I am unable to understand what exactly the issue might be, any help is appreciated. Thank you.

Libre Office Saved my life, Simple one liner command for docx to pdf conversion works like a charm.
Detailed answer here
Command `libreoffice --headless --convert-to pdf test.docx --outdir /pdf` is not working

How can I add pTab elements to docx4j while converting document to pdf

I'm getting some error while converting document to pdf using docx4j library in Java. Sadly, my error is this
NOT IMPLEMENTED support for w:pict without v:imagedata
and it's showing up on the converted pdf instead of displaying the error in my java terminal.
I have gone through some article and questions,thus found this converting docx to pdf . However, I am uncertain how to use this in my code or convert it. This is my code :
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import java.util.Map;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.model.structure.SectionWrapper;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class docTopdf {
public static void main(String[] args) {
try {
InputStream is = new FileInputStream(
new File(
"test.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i)
.getPageDimensions();
}
PhysicalFonts.discoverPhysicalFonts();
#Deprecated
Map<String, PhysicalFont> physicalFonts = PhysicalFonts.getPhysicalFonts();
// 2) Prepare Pdf settings
#Deprecated
PdfSettings pdfSettings = new PdfSettings();
// 3) Convert WordprocessingMLPackage to Pdf
#Deprecated
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
#Deprecated
OutputStream out = new FileOutputStream(
new File(
"test.pdf"));
conversion.output(out, pdfSettings);
} catch (Throwable e) {
e.printStackTrace();
}
}
}
And my pom.xml
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j</artifactId>
<version>3.2.1</version>
</dependency>
any help would be appreciated as I am noob to this conversion. Thanks in advance

Creating a PDF via XSL FO doesn't support w:pict without v:imagedata (ie a graphic which isn't a simple image).
Whilst you could suppress the message by configuring logging appropriately, your PDF output would be lossy.
Your options are to correct the input docx (ie use an image instead of whatever you currently have), or to use a PDF converter with appropriate support. For one option, see https://www.docx4java.org/blog/2020/03/documents4j-for-pdf-output/

Getting error while trying to copy a picture in doc file

I am getting below error while trying to copy a pic in a do file through selenium.
This is the error which I am getting -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
at LeadFreeTest.docCapture.main(docCapture.java:17)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
Below is code
package LeadFreeTest;
import java.io.*;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class docCapture {
#SuppressWarnings("resource")
public static void main(String[] args) throws IOException, InvalidFormatException
{
XWPFDocument docx = new XWPFDocument();
XWPFParagraph par = docx.createParagraph();
XWPFRun run = par.createRun();
run.setText("Hello, World. This is my first java generated docx-file. Have fun.");
run.setFontSize(13);
InputStream pic = new FileInputStream("C:\\Naveeen\\TestScreenShot\\LoginPage.png");
//byte [] picbytes = IOUtils.toByteArray(pic);
//run.addPicture(picbytes, Document.PICTURE_TYPE_JPEG);
run.addPicture(pic, Document.PICTURE_TYPE_JPEG, "3", 0, 0);
FileOutputStream out = new FileOutputStream("C:\\Naveeen\\TestScreenShot\\LoginPage.doc");
docx.write(out);
out.close();
pic.close();
}
}

You need to add the XML beans dependency to your classpath hense the
ClassNotFoundException: org.apache.xmlbeans.XmlException
The library is usually called xmlbeans-x.x.x.jar
You can find it here.

How to read docx file content in java api using poi jar

I have done reading doc file now i'm trying to read docx file content. when i searched for sample code i found many, nothing worked. check the code for reference...
import java.io.*;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
import com.itextpdf.text.Paragraph;
public class createPdfForDocx {
public static void main(String[] args) {
InputStream fs = null;
Document document = new Document();
XWPFWordExtractor extractor = null ;
try {
fs = new FileInputStream("C:\\DATASTORE\\test.docx");
//XWPFDocument hdoc=new XWPFDocument(fs);
XWPFDocument hdoc=new XWPFDocument(OPCPackage.open(fs));
//XWPFDocument hdoc=new XWPFDocument(fs);
extractor = new XWPFWordExtractor(hdoc);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/test.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
String fileData=extractor.getText();
System.out.println(fileData);
document.add(new Paragraph(fileData));
System.out.println(" pdf document created");
} catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
} catch(Exception ex) {
ex.printStackTrace();
}finally {
document.close();
}
}//end of main()
}//end of class
For the above code i'm getting following Exception:
org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:60)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:277)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:186)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:107)
at pagecode.createPdfForDocx.main(createPdfForDocx.java:20)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:67)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:521)
at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:58)
... 4 more
Caused by: java.lang.NoSuchMethodError: org/openxmlformats/schemas/wordprocessingml/x2006/main/CTStyles.getStyleList()Ljava/util/List;
at org.apache.poi.xwpf.usermodel.XWPFStyles.onDocumentRead(XWPFStyles.java:78)
at org.apache.poi.xwpf.usermodel.XWPFStyles.<init>(XWPFStyles.java:59)
... 9 more
Please help
Thank you

This is covered in the Apache POI FAQ! The entry you want is I'm using the poi-ooxml-schemas jar, but my code is failing with "java.lang.NoClassDefFoundError: org/openxmlformats/schemas/something"
The short answer is to switch the poi-ooxml-schemas jar for the full ooxml-schemas-1.1 jar. The full answer is given in the FAQ

For reading excels or docx file if you want to solve errors you need to add all jars then you wont get any error.

Reading MS Word 2007 using Java

I am trying to read a Microsoft word file through Java. I have included all the .jar files from Apache poi-3.8-beta1 to my classpath. However, when I try running this, I get the following exception:
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at readingmsword07.Main.main(Main.java:27)
Following is my code:
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.*;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class Main {
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream("C:\\TrialDoc.docx");
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor =
new XWPFWordExtractor(new XWPFDocument(fis));
System.out.print(oleTextExtractor.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
I am using the XWPFWordExtractor since I am trying to read a 2007 word document but for some reason I am unable to figure out the right POI that deals with this.
Any help is much appreciated. Thanks in advance!
~ Woods

remove the line,
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading a .docx file with Apache POI - java

It's due to a library that is not directly included in POI. If you use maven add the following dependency to your project : <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-compress</artifactId> <version>1.18</version> </dependency>

Related

Apache POI - DOCX To PDF Conversion

How can I add pTab elements to docx4j while converting document to pdf

Getting error while trying to copy a picture in doc file

How to read docx file content in java api using poi jar

Reading MS Word 2007 using Java

Categories

Resources