Apache POI fails to save (HWPFDocument.write) large word doc files - java

I want to remove word metadata from .doc files. My .docx files works fine with XWPFDocument, but the following code for removing metadata fails for large (> 1MB) files. For example using a 6MB .doc file with images, it outputs a 4.5MB file in which some images are removed.
public static InputStream removeMetaData(InputStream inputStream) throws IOException {
POIFSFileSystem fss = new POIFSFileSystem(inputStream);
HWPFDocument doc = new HWPFDocument(fss);
// **it even fails on large files if you remove from here to 'until' below**
SummaryInformation si = doc.getSummaryInformation();
si.removeAuthor();
si.removeComments();
si.removeLastAuthor();
si.removeKeywords();
si.removeSubject();
si.removeTitle();
doc.getDocumentSummaryInformation().removeCategory();
doc.getDocumentSummaryInformation().removeCompany();
doc.getDocumentSummaryInformation().removeManager();
try {
doc.getDocumentSummaryInformation().removeCustomProperties();
} catch (Exception e) {
// can not remove above
}
// until
ByteArrayOutputStream os = new ByteArrayOutputStream();
doc.write(os);
os.flush();
os.close();
return new ByteArrayInputStream(os.toByteArray());
}
Related posts:
How to save the Word Document using POI API?
https://stackoverflow.com/questions/9758955/saving-poi-document-correctly

Which version of Apache POI are you using ?
This seems to be the Bug 46220 - Regression: Some embedded images being lost .
Please upgrade to the latest release of POI (3.8) and try again.
Hope that helps.

Related

aspose convert pdf to excel as temporary file

The question is rather simple. I am using the aspose library to convert a pdf file to excel. The excel file is subsequently written to the database and this generated excel file is not needed in the future.
My method:
public void main(MultipartFile file) throws IOException {
InputStream inputStream = file.getInputStream();
Document document = new Document(inputStream);
ExcelSaveOptions options = new ExcelSaveOptions();
options.setFormat(ExcelSaveOptions.ExcelFormat.XLSX);
document.save("newExcelFile.xlsx", options);
}
In this method, the file is saved to the root folder of the project (if it is running locally). How can I not store this file, but make it temporary? My question is that this project is located on the server, and I would not like to create directories specifically for this file.
The Document.save() method has an overload for saving to an OutputStream (See here for the API reference).
Given that you can store the result to anything that implements an OutputStream, you can provide any implementation that you want - one useful option might be to use ByteArrayOutputStream to store the result in memory, or possibly - just use Files.createTempFile() and create a FileOutputStream for that.
For example, your code may be rewritten thus:
public byte[] convertToExcel(MultipartFile file) throws IOException {
InputStream inputStream = file.getInputStream();
Document document = new Document(inputStream);
ExcelSaveOptions options = new ExcelSaveOptions();
options.setFormat(ExcelSaveOptions.ExcelFormat.XLSX);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos, options);
return baos.toByteArray();
}

Apache PDFBox to open temporary created PDF file

I'm using apache pdfbox 2.x version and I am trying to read a temp created file.
Below is my code to create a temp file and read it:
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
PDDocument pdDocument = PDDocument.load(mergedTempFile.toFile());
But it gives error:
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2577)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1006)
at com.howtodoinjava.demo.PdfboxApi.test(PdfboxApi.java:326)
at com.howtodoinjava.demo.PdfboxApi.main(PdfboxApi.java:317)
From this link I have got a reference but it did not help anyway:
Similar Issue Link
Please help me with this. Still, I can not get rid of this.
PDDocument.load(...) is used to parse an existing PDF.
The passed temporary file (mergedTempFile) is empty, thus the exception. Just create a PDDocument with the constructor (resides in-memory) and later save it with PDDocument.save(...).
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
try (PDDocument pdDocument = new PDDocument()) {
// add content
pdDocument.addPage(new PDPage()); // empty page as an example
pdDocument.save(mergedTempFile.toFile());
}
} catch (IOException e) {
// exception handling
}
// use mergedTempFile for further logic

Validate to check uploaded file is pdf

How to validate if the file uploaded is PDF only? not only by extension(.pdf) but also with the content.If someone change the extension of any other file to pdf file then it should fail while uploading.
You can use Apache Tika for this, available here. http://tika.apache.org/
You can also find a practical example here: https://dzone.com/articles/determining-file-types-java
There are many way to validate PDF file. I used itext for check pdf is corrupted or not.
try {
PdfReader pdfReader = new PdfReader(file);
PdfTextExtractor.getTextFromPage(pdfReader, 1);
LOGGER.info("pdfFileValidator ==> Exit");
return true;
} catch (InvalidPdfException e) {
e.printStackTrace();
LOGGER.error("pdfFileValidator ==> Exit. Error ==> " + e.getMessage());
return false;
}
If file is not PDF or file is corrupted than it will throw InvalidPDFException.
For above example you need itext library.
There are many validation libraries that you can use in order to validate if a file is PDF compliant. For instance, you can use - veradpf or pdfbox. Of course you can use any other library that would do the work for you. As it was already mentioned, tika is another library that can read file metadata and tell you what the file is.
As an example (a bare one), you can do something with pdfbox. Also keep in mind that this will validate if the file is PDF/A compliant.
boolean validateImpl(File file) {
PreflightDocument document = new PreflightParser(file).getPreflightDocument();
try {
document.validate();
ValidationResult validationResult = document.getResult();
if (validationResult.isValid()) {
return true;
}
} catch (Exception e) {
// Error validating
}
return false;
}
or with Tika, you can do something like
public ContentType tikaDetect(File file) {
Tika tika = new Tika();
String detectedType = tika.detect(file);
}

Write over a pdf file (template) with a servlet

so i just heard about the API called iText and i'm not really familiar with its use.
So my problem now is i want to write over an existing pdf file (a template) the informations provided in a jsp form.
I tried some code found on the internet it works fine but not on servlets.
Thanks.
EDIT Here is the code i found and tried to put into a servlet.
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
Document document = new Document(PageSize.A4);
try {
PdfWriter.getInstance(document, new FileOutputStream(new File(
"test.pdf")));
document.open();
String content = request.getParameter("aa");
Paragraph paragraph = new Paragraph(content);
document.add(paragraph);
} catch (DocumentException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} finally {
document.close();
}
}
I look at your servlet and I see:
new FileOutputStream(new File("test.pdf"))
This means that you are writing a file to the file system on your server. I don't see you sending any bytes to the response object, so nothing shows up in the browser.
You claim that iText "doesn't work in a servlet", but that's not correct: if no exception is thrown, a file named "test.pdf" is creating in your working directory on the server side. That's not very smart, because the more people use your servlet, the more PDFs will be saved on the server. You probably don't want that. You probably want to create the PDF in memory, and serve the PDF bytes to the browser.
The short answer to your question, is that you should write the PDF to the OutputStream of the response object instead of to a FileOutputStream:
public class Hello extends HttpServlet {
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("application/pdf");
try {
// step 1
Document document = new Document();
// step 2
PdfWriter.getInstance(document, response.getOutputStream());
// step 3
document.open();
// step 4
document.add(new Paragraph("Hello World"));
document.add(new Paragraph(new Date().toString()));
// step 5
document.close();
} catch (DocumentException de) {
throw new IOException(de.getMessage());
}
}
}
However, to avoid known issues with this approach, you should also read the official documentation. Search for the keyword "servlet" and you'll find these FAQ entries:
How can I serve a PDF to a browser without storing a file on the server side? (iText 5)
How can I serve a PDF to a browser without storing a file on the server side? (iText 7)
Since you are new at iText, it is surprising that you chose to use iText 5 instead of the newer iText 7. iText 7 isn't compatible with iText 5; it is a complete rewrite of the library. I recommend that you use iText 7, because we have stopped active development on iText 5.
Update:
The error known as "The document has no pages." indicates that you are trying to create a document that doesn't have any content.
Replace:
String content = request.getParameter("aa");
Paragraph paragraph = new Paragraph(content);
document.add(paragraph);
With:
document.add(new Paragraph("Hello"));
My guess is that something went wrong while fetch the parameter "aa", causing no content to be added to the document.

Java POI - Error: Unable to read entire header

I'm trying to read a .doc file with java through the POI library. Here is my code:
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
And I have this exception:
java.io.IOException: Unable to read entire header; 162 bytes read; expected 512 bytes
at org.apache.poi.poifs.storage.HeaderBlock.alertShortRead(HeaderBlock.java:226)
at org.apache.poi.poifs.storage.HeaderBlock.readFirst512(HeaderBlock.java:207)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at MicrosoftWordParser.getDocString(MicrosoftWordParser.java:277)
at MicrosoftWordParser.main(MicrosoftWordParser.java:86)
My file is not corrupted, i can launch it with microsoft Word.
I'm using poi 3.9 (the latest stable version).
Do you have an idea t solve the problem ?
Thank you.
readFirst512() will read the first 512 bytes of your Inputstream and throw an exception if there is not enough bytes to read. I think your file is not big enough to be read by POI.
It is probably not a correct Word file. Is it really only 162 bytes long? Check in your filesystem.
I'd recommend creating a new Word file using Word or LibreOffice, and then try to read it using your program.
you should try this programm.
package file_opration;
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("filepath location");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
Ahh, you've got a file, then you're spending loads of memory buffering the whole thing into memory by hiding your file behind an InputStream... Don't! If you have a File, give that to POI. Only give POI an InputStream if that's all your have
Your code should be something like:
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("myfile.doc"));
HWPFDocument document = new HWPFDocument(fs.getRoot());
That'll be quicker and use less memory that reading it into an InputStream, and if there are problems with the file you should normally get slightly more helpful error messages out too
A 162 byte MS Word .doc is probably an "owner file". A temporary file that Word uses to signify the file is locked/owned.
They have a .doc file extension but they are not MS Word Docs.

Categories

Resources