Apache POI - read docx with image in header - java

I'm trying to process docx file with Apache POI. Just simply read and then write file (just for now). Here is my simple code:
FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream));
FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
document.write(fileOutputStream);
fileOutputStream.flush();
fileOutputStream.close();
fileInputStream.close();
Problem is that input file has small image in header. Because of that after processing input file with POI and opening output file in Microsoft Word I get corrupted file error :
Microsoft Office cannot open this file because some parts are missing or invalid.
Location: Part: /word/settings.xml, Line: 2, Column: 0
Everything works in OO Writer, but not in office.
The question is : what is wrong? Does apache POI not process files with image in header? Do you know any way to work around the problem?
I NEED to use Apache POI, I don't take into consideration other tools. Also I use POI 3.8

The problem is not with the image header but with the Apache POI jar version. Use the latest jars.
poi-3.10-FINAL.jar
poi-ooxml-3.10-FINAL.jar
poi-ooxml-schemas-3.10-FINAL.jar
ooxml-schemas-1.1.jar
Having the above jars solved the issue for me.

Related

Convert faulty Webpage/Excel to proper Excel

I have an app that automatically processes a range of excel files but i have one issue. For some files I have what seems to be an html file with a .xls file extension (opening in excel gives corrupt warning and resaving shows it wants to save as an html).
When using Apachi POI:
try (Workbook wkbk = WorkbookFactory.create(myCorruptFile)) {
//myCorruptFile is of type File
This fails to process with apache poi NotOLE2FileException error below
Invalid header signature; read 0x0A0D3E6C6D74683C, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document, { }
If I manually resave as a .xls the file will process appropriately, but is there a way to detect and resave/convert this file via java 11? Manually converting the files isn't an option for me as opposed to an automated one.
myCorruptFile.getContentType()
Gives content type as:
application/vnd.ms-excel
And using Apache Tika gives detected type as:
tika.detect(myCorruptFile.getBytes())
text/html
(My maven pom has no filtering)

Corrupt Excel Maven

I have some problems with Maven, Excel and poi package.
I access to an excel file thanks to the code :
Workbook workbook = WorkbookFactory.create(new File("src/main/resources/file.xlsx"));
Sheet sheet = workbook.getSheet(sheetName);
This code works correctly and I can read data inside later in my code.
Instead of a "new File(..)", I have to use this code below to access resources in dev mode and once the jar is built.
ClassLoader classLoader = ClassLoader.getSystemClassLoader();
String path = classLoader.getResource(fileName).toURI().getPath();
The given path is in "target/classes" and Maven do a "copy" of this file into folder "myproject/target/classes" of the current project(perfect so).
However, the xslx file copied by Maven is corrupted and neither by using Excel software, I can't access to its content. The original file size is 500Kb, the copied file size is more than 1Mb. (All other files img,txt.. are well copied excepted the xslx files)
I done lots of searches, I could find some answers like :
FileInputStream vs ClassPathResource vs getResourceAsStream and file integrity
. I tried all solutions I could find but impossible to solve mine and I always get the same error :
InvalidOperationException: Could not open the specified zip entry source stream
Or
java.io.FileNotFoundException: file.xlsx
From the same way of classLoader, I can access to my json, txt and image files.
Someone has answer on this issue ?
Why Maven doubles the size of the xlsx files and why they are corrupted ?
Any solution to solve that ?
I need help

Java - PdfReader is blocking my program when reading a specific PDF

I'm using PdfReader to get the number of pages of a .pdf file. I tested my application on 13 pdfs today, and the 12 firsts were working fine, and the last one is blocking my application. I don't understand why, I can open the file with a FileInputStream and it works, and I can open it with Adobe so I don't think the file have an issue.
Here is how I create my PdfReader :
// This line is block my application for the 13th file :
PdfReader pdf = new PdfReader(filename);
int pageCount = pdf.getNumberOfPages();
Edit :
Some of theses pdf files are files I zipped in a Zip file, and I unZipped them. The file who is causing troubles is one of them, but others zipped/unZipped files are working fine
I solved my issue by adding a dependency in my pom.xml :
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.9</version>
<dependency>
I was using Maven's itextpdf library which was in 2.1.7

Convert dot to dotx in JAVA

I have a .dot file using which i will be creating doc file with values replaced.
eg: .dot file having <claimId>
I will replace <claimId> with real claim Id say 1234 and generate a doc file.
I am using Apache POI HWPFDocument, when using HWPFDocument i am getting issues when i replace text inside table.
So i tried XWPFDocument i can feed only .dotx files.
I have no issue when using dotx file with XWPFDocument and successfully generated docx files. Now i need to convert .dot files to .dotx files from java.
Can someone help me on this...
There is no automated way to do this conversion in Apache POI as far as I know.
You can manually convert .dot files to .dotx by opening the file in Microsoft Word and saving it in the newer format.

JXL + POI : incompatibility

Im first using JXL to modify one xls file created by POI. After that I will try to read that file with POI. In the moment of POIFSFileSystem creation
poFileSystem = new POIFSFileSystem(input);
Im getting the exception
java.io.IOException: block[ 907 ] already removed - does your POIFS have circular or duplicate block references?
Is this a compatibility problem between those 2 libraries or something else?
Im using POI ver 3.6 and latest version of JXL.
Thanks
changing POIFSFileSystem to NPOIFSFileSystem solved my problem.

Categories

Resources