How to read docx file content in java api using poi jar - java

I have done reading doc file now i'm trying to read docx file content. when i searched for sample code i found many, nothing worked. check the code for reference...
import java.io.*;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
import com.itextpdf.text.Paragraph;
public class createPdfForDocx {
public static void main(String[] args) {
InputStream fs = null;
Document document = new Document();
XWPFWordExtractor extractor = null ;
try {
fs = new FileInputStream("C:\\DATASTORE\\test.docx");
//XWPFDocument hdoc=new XWPFDocument(fs);
XWPFDocument hdoc=new XWPFDocument(OPCPackage.open(fs));
//XWPFDocument hdoc=new XWPFDocument(fs);
extractor = new XWPFWordExtractor(hdoc);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/test.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
String fileData=extractor.getText();
System.out.println(fileData);
document.add(new Paragraph(fileData));
System.out.println(" pdf document created");
} catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
} catch(Exception ex) {
ex.printStackTrace();
}finally {
document.close();
}
}//end of main()
}//end of class
For the above code i'm getting following Exception:
org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:60)
at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:277)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:186)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:107)
at pagecode.createPdfForDocx.main(createPdfForDocx.java:20)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:67)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:521)
at org.apache.poi.xwpf.usermodel.XWPFFactory.createDocumentPart(XWPFFactory.java:58)
... 4 more
Caused by: java.lang.NoSuchMethodError: org/openxmlformats/schemas/wordprocessingml/x2006/main/CTStyles.getStyleList()Ljava/util/List;
at org.apache.poi.xwpf.usermodel.XWPFStyles.onDocumentRead(XWPFStyles.java:78)
at org.apache.poi.xwpf.usermodel.XWPFStyles.<init>(XWPFStyles.java:59)
... 9 more
Please help
Thank you

This is covered in the Apache POI FAQ! The entry you want is I'm using the poi-ooxml-schemas jar, but my code is failing with "java.lang.NoClassDefFoundError: org/openxmlformats/schemas/something"
The short answer is to switch the poi-ooxml-schemas jar for the full ooxml-schemas-1.1 jar. The full answer is given in the FAQ

For reading excels or docx file if you want to solve errors you need to add all jars then you wont get any error.

Related

Apache POI - DOCX To PDF Conversion

I am trying to convert a docx file into pdf file using POI. Getting following error.
Using poi-3.17 ,
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordToPDF {
public static void main(String[] args) {
WordToPDF cwoWord = new WordToPDF();
System.out.println("Start");
cwoWord.ConvertToPDF("D:\\2067536.docx", "D:\\2067536.pdf");
}
public void ConvertToPDF(String docPath, String pdfPath) {
try {
InputStream doc = new FileInputStream(new File(docPath));
XWPFDocument document = new XWPFDocument(doc);
document.createStyles();
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File(pdfPath));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (FileNotFoundException ex) {
System.out.println(ex.getMessage());
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
}
}
Here is the Error happening
Exception in thread "main" org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:70)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
at WordToPDF.ConvertToPDF(WordToPDF.java:27)
at WordToPDF.main(WordToPDF.java:17)
Caused by: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:178)
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:111)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitHeaderRef(XWPFDocumentVisitor.java:1142)
at org.apache.poi.xwpf.converter.core.MasterPageManager.visitHeadersFooters(MasterPageManager.java:213)
at org.apache.poi.xwpf.converter.core.MasterPageManager.addSection(MasterPageManager.java:180)
at org.apache.poi.xwpf.converter.core.MasterPageManager.compute(MasterPageManager.java:127)
at org.apache.poi.xwpf.converter.core.MasterPageManager.initialize(MasterPageManager.java:90)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitBodyElements(XWPFDocumentVisitor.java:232)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.start(XWPFDocumentVisitor.java:199)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:56)
... 4 more
As this is a null pointer error I am unable to understand what exactly the issue might be, any help is appreciated. Thank you.
Libre Office Saved my life, Simple one liner command for docx to pdf conversion works like a charm.
Detailed answer here
Command `libreoffice --headless --convert-to pdf test.docx --outdir /pdf` is not working

java.lang.NoClassDefFoundError: org.apache.poi.POIXMLDocumentPart when converting .docx document to .pdf

I have a java program named wordToPdf that converts the *.docx file to the *.pdf file. The program runs pretty well with Apache POI 4.1.2 along with POI OOXML 4.1.2 and the fr.opensagres.xdocreport 2.0.2. The *.pdf result is created successfully.
Below is the java program
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;
public void wordToPdf(String inDocFile, String outPdfFile) {
try {
File src = new File(inDocFile);
InputStream doc = new FileInputStream(src);
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = null;
OutputStream out = new FileOutputStream(new File(outPdfFile));
PdfConverter.getInstance().convert(document, out, options);
} catch (Exception e) {
e.printStackTrace();
}
}
However, when the program is called from an HttpServletRequest, it doesn't work the same as the above scenario. In stead, the NoClassDefFoundError exception of org.apache.poi.POIXMLDocumentPart is returned.
Has anyone experienced this issue previously?
Please help me. Thanks so much, guys.

Reading a .docx file with Apache POI

I want to read and print out a whole .docx file into the console for now.
I read that you cannot do it without Apache POI or Docx4J, I tried both and failed twice.
Also I am aware that this question already exists on Stackoverflow but I am afraid it might be outdated.
This is my code with Apache POI right now.
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class test {
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (int i = 0; i < paragraphs.size(); i++) {
System.out.println(paragraphs.get(i).getParagraphText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
readDocxFile("C:\\Basics.docx");
}
}
It was taken from another question on here but, it does not work.
I get following Error message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/compress/archivers/zip/ZipFile
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:37)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:142)
at test.readDocxFile(test.java:16)
at test.main(test.java:28)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.archivers.zip.ZipFile
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 5 more
It's due to a library that is not directly included in POI.
If you use maven add the following dependency to your project :
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.18</version>
</dependency>

Error in writing to .doc file by Apache POI

I'm writing a app to display and edit file .doc I'm using POI with HWPF. Now I can read text from file and write to file .doc too. But my reader only read default file .doc which is created by msoffice, It can't read the file created by my writer also msoffice can read this and all content was displayed right. It always show error:
Exception in thread "main" java.lang.RuntimeException:java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at org.apache.poi.hwpf.extractor.WordExtractor.getText(WordExtractor.java:322)
at ReadPOI.main(ReadPOI.java:18)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.poi.hwpf.usermodel.Range.binarySearchStart(Range.java:1016)
at org.apache.poi.hwpf.usermodel.Range.findRange(Range.java:1095)
at org.apache.poi.hwpf.usermodel.Range.initParagraphs(Range.java:982)
at org.apache.poi.hwpf.usermodel.Range.numParagraphs(Range.java:311)
at org.apache.poi.hwpf.converter.AbstractWordConverter.processParagraphes(AbstractWordConverter.java:1058)
at org.apache.poi.hwpf.converter.WordToTextConverter.processSection(WordToTextConverter.java:435)
at org.apache.poi.hwpf.converter.AbstractWordConverter.processSingleSection(AbstractWordConverter.java:1126)
at org.apache.poi.hwpf.converter.AbstractWordConverter.processDocument(AbstractWordConverter.java:722)
at org.apache.poi.hwpf.extractor.WordExtractor.getText(WordExtractor.java:304)
... 1 more
Are there any different between file created by msoffice and file created by my writer, and how to fix it. Please help me. There are my demo code in Java. Thank you
My reader:
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
public class ReadPOI
{
public static void main(String args[]) throws Exception
{
File file = new File("Test.doc");
FileInputStream fin = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(fin);
Range range = doc.getRange();
WordExtractor extractor = new WordExtractor(doc);
System.out.println("starting\n" + extractor.getText() + "end\n");
fin.close();
}
}
My Writer:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.HWPFDocument;
public class WritePOI
{
public static void main(String args[]) throws Exception
{
File file = new File("Template.doc");
FileInputStream fin = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(fin);
doc.getRange().replaceText("Haha\n", false);
FileOutputStream fout = new FileOutputStream("Test.doc");
doc.write(fout);
fout.close();
fin.close();
}
}
It's a bug in the WordExtractor getText() that even remains up to the version 3.10-FINAL. It should not give you an:
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.poi.hwpf.usermodel.Range.binarySearchStart(Range.java:1016)
It is not marked as deprecated in the api but it says that getTextFromPieces() is faster. I double checked it using your example and it works OK.
So in the ReadPOI use:
System.out.println(extractor.getTextFromPieces());
Or
String [] dataArray = extractor.getParagraphText();
for(int i=0;i<dataArray.length;i++)
{
System.out.println("\n–" + dataArray[i]);
}

Converting html file to pdf using iText in JApplet

I am using iText for a project. My program is supposed to run from inside a browser and I need it to convert an html file to a pdf file. When I run the program from NetBeans everything works fine. I sign my jar and run the Applet in a browser and then I get this error:
Errorjava.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.windir")
For the purpose of this post I have made a simple JApplet code which has the same problem:
public class RunApplet extends JApplet {
#Override
public void init() {
this.add(new JLabel("This is a labe"));
File f = new File("C:/ReportGen/data.html");
File pdf = new File("C:/ReportGen/data.pdf");
try {
pdf.createNewFile();
Document pdfDocument = new Document();
PdfWriter writer = PdfWriter.getInstance(pdfDocument, new FileOutputStream(pdf));
pdfDocument.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
FontFactoryImp imp = new FontFactoryImp();
imp.getFont("Arial");
FontFactory.setFontImp(imp);
worker.parseXHtml(writer, pdfDocument, new FileInputStream(f));
pdfDocument.close();
writer.close();
this.add(new JLabel(f.getAbsolutePath()));
} catch (Exception ex) {
this.add(new JTextField("Error"+ex));
}
}
}
The html file is created and is fine, but when I create the pdf file I get the exception and the pdf file is actually created, but is corrupt and I am unable to open it. Thanks in advance for your time.
First, I see this error in your question:
Errorjava.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getenv.windir")
You need signed your applet for access to your filesystem. See this link and this too.
Second, I have tried following code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
public class main {
public static void main(String[] args) {
File f = new File("C:/tmp/data.htm");
File pdf = new File("C:/tmp/data.pdf");
Document pdfDocument = null;
PdfWriter pdfWriter = null;
try {
pdfDocument = new Document();
pdfWriter = PdfWriter.getInstance(pdfDocument, new FileOutputStream(pdf));
pdfDocument.open();
XMLWorkerHelper.getInstance().parseXHtml(pdfWriter, pdfDocument,
new FileInputStream(f));
pdfDocument.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
If file data.htm (original data) is an htmlx, work fine. But if data.htm not is an xml, i get this error:
com.itextpdf.tool.xml.exceptions.RuntimeWorkerException: Invalid nested tag head found, expected closing tag meta.
at com.itextpdf.tool.xml.XMLWorker.endElement(XMLWorker.java:134)
at com.itextpdf.tool.xml.parser.XMLParser.endElement(XMLParser.java:395)
at com.itextpdf.tool.xml.parser.state.ClosingTagState.process(ClosingTagState.java:70)
at com.itextpdf.tool.xml.parser.XMLParser.parseWithReader(XMLParser.java:235)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:213)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:174)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:220)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:185)
at main.main(main.java:44)
Can you try with your data and with this example? The difference is that my example isn't an applet, is an java standalone.
Regards

Categories

Resources