Reading MS Word 2007 using Java

Reading MS Word 2007 using Java - java

I am trying to read a Microsoft word file through Java. I have included all the .jar files from Apache poi-3.8-beta1 to my classpath. However, when I try running this, I get the following exception:
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at readingmsword07.Main.main(Main.java:27)
Following is my code:
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.*;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class Main {
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream("C:\\TrialDoc.docx");
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor =
new XWPFWordExtractor(new XWPFDocument(fis));
System.out.print(oleTextExtractor.getText());
} catch (Exception e) {
e.printStackTrace();
}
}
}
I am using the XWPFWordExtractor since I am trying to read a 2007 word document but for some reason I am unable to figure out the right POI that deals with this.
Any help is much appreciated. Thanks in advance!
~ Woods

remove the line,
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);

Related

Apache POI - DOCX To PDF Conversion

I am trying to convert a docx file into pdf file using POI. Getting following error.
Using poi-3.17 ,
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordToPDF {
public static void main(String[] args) {
WordToPDF cwoWord = new WordToPDF();
System.out.println("Start");
cwoWord.ConvertToPDF("D:\\2067536.docx", "D:\\2067536.pdf");
}
public void ConvertToPDF(String docPath, String pdfPath) {
try {
InputStream doc = new FileInputStream(new File(docPath));
XWPFDocument document = new XWPFDocument(doc);
document.createStyles();
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File(pdfPath));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (FileNotFoundException ex) {
System.out.println(ex.getMessage());
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
}
}
Here is the Error happening
Exception in thread "main" org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:70)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
at WordToPDF.ConvertToPDF(WordToPDF.java:27)
at WordToPDF.main(WordToPDF.java:17)
Caused by: java.lang.NullPointerException
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:178)
at org.apache.poi.xwpf.converter.pdf.internal.PdfMapper.visitHeader(PdfMapper.java:111)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitHeaderRef(XWPFDocumentVisitor.java:1142)
at org.apache.poi.xwpf.converter.core.MasterPageManager.visitHeadersFooters(MasterPageManager.java:213)
at org.apache.poi.xwpf.converter.core.MasterPageManager.addSection(MasterPageManager.java:180)
at org.apache.poi.xwpf.converter.core.MasterPageManager.compute(MasterPageManager.java:127)
at org.apache.poi.xwpf.converter.core.MasterPageManager.initialize(MasterPageManager.java:90)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.visitBodyElements(XWPFDocumentVisitor.java:232)
at org.apache.poi.xwpf.converter.core.XWPFDocumentVisitor.start(XWPFDocumentVisitor.java:199)
at org.apache.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:56)
... 4 more
As this is a null pointer error I am unable to understand what exactly the issue might be, any help is appreciated. Thank you.

Libre Office Saved my life, Simple one liner command for docx to pdf conversion works like a charm.
Detailed answer here
Command `libreoffice --headless --convert-to pdf test.docx --outdir /pdf` is not working

java.lang.NoClassDefFoundError: org.apache.poi.POIXMLDocumentPart when converting .docx document to .pdf

I have a java program named wordToPdf that converts the *.docx file to the *.pdf file. The program runs pretty well with Apache POI 4.1.2 along with POI OOXML 4.1.2 and the fr.opensagres.xdocreport 2.0.2. The *.pdf result is created successfully.
Below is the java program
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;
public void wordToPdf(String inDocFile, String outPdfFile) {
try {
File src = new File(inDocFile);
InputStream doc = new FileInputStream(src);
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = null;
OutputStream out = new FileOutputStream(new File(outPdfFile));
PdfConverter.getInstance().convert(document, out, options);
} catch (Exception e) {
e.printStackTrace();
}
}
However, when the program is called from an HttpServletRequest, it doesn't work the same as the above scenario. In stead, the NoClassDefFoundError exception of org.apache.poi.POIXMLDocumentPart is returned.
Has anyone experienced this issue previously?
Please help me. Thanks so much, guys.

How can I add pTab elements to docx4j while converting document to pdf

I'm getting some error while converting document to pdf using docx4j library in Java. Sadly, my error is this
NOT IMPLEMENTED support for w:pict without v:imagedata
and it's showing up on the converted pdf instead of displaying the error in my java terminal.
I have gone through some article and questions,thus found this converting docx to pdf . However, I am uncertain how to use this in my code or convert it. This is my code :
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import java.util.Map;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.model.structure.SectionWrapper;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class docTopdf {
public static void main(String[] args) {
try {
InputStream is = new FileInputStream(
new File(
"test.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i)
.getPageDimensions();
}
PhysicalFonts.discoverPhysicalFonts();
#Deprecated
Map<String, PhysicalFont> physicalFonts = PhysicalFonts.getPhysicalFonts();
// 2) Prepare Pdf settings
#Deprecated
PdfSettings pdfSettings = new PdfSettings();
// 3) Convert WordprocessingMLPackage to Pdf
#Deprecated
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
#Deprecated
OutputStream out = new FileOutputStream(
new File(
"test.pdf"));
conversion.output(out, pdfSettings);
} catch (Throwable e) {
e.printStackTrace();
}
}
}
And my pom.xml
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j</artifactId>
<version>3.2.1</version>
</dependency>
any help would be appreciated as I am noob to this conversion. Thanks in advance

Creating a PDF via XSL FO doesn't support w:pict without v:imagedata (ie a graphic which isn't a simple image).
Whilst you could suppress the message by configuring logging appropriately, your PDF output would be lossy.
Your options are to correct the input docx (ie use an image instead of whatever you currently have), or to use a PDF converter with appropriate support. For one option, see https://www.docx4java.org/blog/2020/03/documents4j-for-pdf-output/

error while reading .xlsm file

I am trying to read a .xlsm file using POI.
My code is:
import java.io.*;
import java.util.List;
import jxl.*;
import jxl.write.WritableSheet;
import jxl.write.WritableWorkbook;
import jxl.read.biff.BiffException;
import org.apache.poi.ss.usermodel.WorkbookFactory;
import org.apache.poi.xssf.usermodel.*;;
public class ReadExcelSheet {
public static void main(String[] args) throws IOException {
final Workbook wb;
FileInputStream fileIn = new FileInputStream("C:\\Users\\my\\Desktop\\ExcelPORead\\Purchace.xlsm");
wb = WorkbookFactory.create(fileIn); // Error in this line
}
}
and I am getting an error at the line "wb = WorkbookFactory.create(fileIn)", it says to "configure Build Path".
I am using Eclipse and downloaded poi-ooxml-3.5-beta5.jar and add it to the Build path.
But I am not getting what I need to do to make it working.
Kindly suggest me how to remove this error or If you have any better way to read the .xlsm files in Java.
Thanks for your response.
Regards,
Raman

Identifying file type in Java

Please help me to find out the type of the file which is being uploaded.
I wanted to distinguish between excel type and csv.
MIMEType returns same for both of these file. Please help.

I use Apache Tika which identifies the filetype using magic byte patterns and globbing hints (the file extension) to detect the MIME type. It also supports additional parsing of file contents (which I don't really use).
Here is a quick and dirty example on how Tika can be used to detect the file type without performing any additional parsing on the file:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.HashMap;
import org.apache.tika.metadata.HttpHeaders;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaMetadataKeys;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.helpers.DefaultHandler;
public class Detector {
public static void main(String[] args) throws Exception {
File file = new File("/pats/to/file.xls");
AutoDetectParser parser = new AutoDetectParser();
parser.setParsers(new HashMap<MediaType, Parser>());
Metadata metadata = new Metadata();
metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
InputStream stream = new FileInputStream(file);
parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
stream.close();
String mimeType = metadata.get(HttpHeaders.CONTENT_TYPE);
System.out.println(mimeType);
}
}

I hope this will help. Taken from an example not from mine:
import javax.activation.MimetypesFileTypeMap;
import java.io.File;
class GetMimeType {
public static void main(String args[]) {
File f = new File("test.gif");
System.out.println("Mime Type of " + f.getName() + " is " +
new MimetypesFileTypeMap().getContentType(f));
// expected output :
// "Mime Type of test.gif is image/gif"
}
}
Same may be true for excel and csv types. Not tested.

I figured out a cheaper way of doing this with java.nio.file.Files
public String getContentType(File file) throws IOException {
return Files.probeContentType(file.toPath());
}
- or -
public String getContentType(Path filePath) throws IOException {
return Files.probeContentType(filePath);
}
Hope that helps.
Cheers.

A better way without using javax.activation.*:
URLConnection.guessContentTypeFromName(f.getAbsolutePath()));

If you are already using Spring this works for csv and excel:
import org.springframework.mail.javamail.ConfigurableMimeFileTypeMap;
import javax.activation.FileTypeMap;
import java.io.IOException;
public class ContentTypeResolver {
private FileTypeMap fileTypeMap;
public ContentTypeResolver() {
fileTypeMap = new ConfigurableMimeFileTypeMap();
}
public String getContentType(String fileName) throws IOException {
if (fileName == null) {
return null;
}
return fileTypeMap.getContentType(fileName.toLowerCase());
}
}
or with javax.activation you can update the mime.types file.

The CSV will start with text and the excel type is most likely binary.
However the simplest approach is to try to load the excel document using POI. If this fails try to load the file as a CSV, if that fails its possibly neither type.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading MS Word 2007 using Java - java

remove the line, POIFSFileSystem fileSystem = new POIFSFileSystem(fis);

Related

Apache POI - DOCX To PDF Conversion

java.lang.NoClassDefFoundError: org.apache.poi.POIXMLDocumentPart when converting .docx document to .pdf

How can I add pTab elements to docx4j while converting document to pdf

error while reading .xlsm file

Identifying file type in Java

Categories

Resources