OutOfMemoryError during the pdf merge

OutOfMemoryError during the pdf merge - java

the below code merges the pdf files and returns the combined pdf data. while this code runs, i try to combine the 100 files with each file approximately around 500kb, i get outofmemory error in the line document.close();. this code runs in the web environment, is the memory available to webspehere server is the problem? i read in an article to use freeReader method, but i cannot get how to use it my scenario.
protected ByteArrayOutputStream joinPDFs(List<InputStream> pdfStreams,
boolean paginate) {
Document document = new Document();
ByteArrayOutputStream mergedPdfStream = new ByteArrayOutputStream();
try {
//List<InputStream> pdfs = pdfStreams;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
//Iterator<InputStream> iteratorPDFs = pdfs.iterator();
Iterator<InputStream> iteratorPDFs = pdfStreams.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
if (pdf == null)
continue;
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
//clear this
pdfStreams = null;
//WeakReference ref = new WeakReference(pdfs);
//ref.clear();
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, mergedPdfStream);
writer.setFullCompression();
document.open();
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
pageOfCurrentReaderPDF++;
document.setPageSize(pdfReader
.getPageSizeWithRotation(pageOfCurrentReaderPDF));
document.newPage();
// pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader,
pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, ""
+ currentPageNumber + " of " + totalPages, 520,
5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
System.out.println("now the size is: "+pdfReader.getFileLength());
}
mergedPdfStream.flush();
document.close();
mergedPdfStream.close();
return mergedPdfStream;
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
try {
if (mergedPdfStream != null)
mergedPdfStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
return mergedPdfStream;
}
Thanks
V

This code merges all the PDF's in an array in the memory (the heap) so yes, memory usage will grow linearly with the number of files merged.
I don't know about the freeReader method, but maybe you could try to write the merged PDF into a temporary file instead of a byte array ? mergedPdfStream would be a FileOutputStream instead of a ByteArrayOutputStream. Then you return e.g. a File reference to the client code.
Or you could increase the quantity of memory Java can use (-Xmx JVM parameter), but if the number of files to merge eventually increases, you will find yourself with the same problem.

First, why do you clutter your code with all those Iterator<> boilerplate code?
Do you ever heard of the for statement?
i.e
for (PDfReader pdfReader: readers) {
// code for each single PDF reader in readers
}
Second: consider to close the pdfReader as soon as it is done. This will hopefully flush some buffers and free the memory occupied by the original PDF.

This is not proper way of doing file operation. You are doing merging of files using ArrayList and Array in memory. You should rather use File IO with buffering techniques.
Do you wish to show the final merged file at last? Then you can open the file after all your merging is done.
Do not use only in-memory buffering as you have shown. Use File Io with buffering (byte[] i mean)
Close each file after you read it and append it.
Java has limited memory you allocated at startup time, so merging some big number of file at once like this will lead to crashing of application. You should try this merging operation in separate thread using ThreadPool, so that your application will not get stucked for this.
thanks.

100 files * 500 kB is something around 50 MB. If maximum heap size is 64 MB I'm pretty sure this code won't work in such conditions.

Related

Replace itext to pdfbox performance

I am evaluating to replace our pdf processing from itext to pdfbox. I did some tests with 200 pdfs with a single page (94KB, 469KB, 937KB) and merged them to one pdf in our application. PDFBox version: 2.0.23.
itext version: 2.1.7. Here are the test results:
Here is the itext implementation:
byte[] l_PDFPage = null;
PdfReader l_PDFReader = null;
PdfCopy l_Copier = null;
Document l_PDFDocument = null;
OutputStream l_Stream = new FileOutputStream(m_File);
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
l_PDFPage = l_Page.getAsPdf();
l_PDFReader = new PdfReader(l_PDFPage);
l_PDFReader.getPageN(1).put(PdfName.ROTATE, new PdfNumber(l_PDFReader.getPageRotation(1) + l_Page.getRotation() % 360));
l_PDFReader.consolidateNamedDestinations();
if( i == 0 ) {
l_PDFDocument = new Document(l_PDFReader.getPageSizeWithRotation(1));
l_Copier = new PdfCopy(l_PDFDocument, l_Stream);
l_PDFDocument.open();
}
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
if( l_PDFReader.getAcroForm() != null )
l_Copier.copyAcroForm(l_PDFReader);
l_Copier.flush();
l_Copier.freeReader(l_PDFReader);
}
l_PDFDocument.close();
l_Stream.close();
Here is the pdfbox implementation:
byte[] l_PDFPage = null;
List<PDDocument> pageDocuments = new ArrayList<>();
PDDocument saveDocument = new PDDocument();
try {
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
// our wrapper object for a page
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
PDDocument document = PDDocument.load(l_PDFPage);
// save page document to close it later
pageDocuments.add(document);
PDPage page = document.getPage(0);
saveDocument.addPage(saveDocument.importPage(page));
}
saveDocument.save(l_Stream);
}
finally {
// close every page document
for(PDDocument doc : pageDocuments) {
doc.close();
}
saveDocument.close();
}
I have also tried using pdfmerger of pdfbox. The performance was nearly the same as the other pdfbox implementation. But with the 937KB files I run in an outofmemory exception with this implementation:
byte[] l_PDFPage = null;
OutputStream l_Stream = new FileOutputStream(m_File);
PDFMergerUtility merger = new PDFMergerUtility();
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
merger.addSource(new ByteArrayInputStream(l_PDFPage));
}
merger.setDestinationStream(l_Stream);
merger.mergeDocuments(null);
So my questions:
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
Am I missing something in our pdfbox implementation?
Why I can't close the "page document" after I added the page in "saveDocument"? If i close it there I'd get an error while saving so I have to store the "page documents" and close them at the end.

PDFBox and iText are architecturally different and, therefore, perform differently well for different tasks.
In particular iText attempts to write out new contents early, in your case much of the page is written to the output already during
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
and
l_PDFDocument.close();
eventually only finalizes the PDF and writes last remaining objects and the file trailer.
PDFBox on the other hand saves everything in the end at once:
saveDocument.save(l_Stream);
The approach of iText has the advantage of a smaller memory footprint (as you observed) and the disadvantage that you cannot change data of a page once it is written.
(As an aside: the iText architecture has changed from iText 5 to iText 7, in iText 7 you have the choice and can keep everything in memory, but the price here also is a big memory footprint.)
Thus,
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
The difference in memory use can partially be explained by the above. Also in iText after
l_Copier.freeReader(l_PDFReader);
the PdfReader can be closed (which you leave to the garbage collection to do for you) to free its resources while in your PDFBox code you keep all the source documents open, holding the resources up to the end. (Actually I would have assumed that when you're using importPage, you needn't keep them.)
Concerning the time I'm not sure now. You should do some finer clocking and determine where exactly the extra time is used in PDFBox; thus, I second #Tilman's request for profiling data. I assume it's during the final save but that's only a hunch. Also such time differences might depend on structural details of the PDF in question and may be less extreme for other documents.

iText Fill Form / Copy Page to new Document

I'm useing iText to fill a template PDF which contains a AcroForm.
Now I want to use this template to create a new PDF with dynamically pages.
My idea is it to fill the template PDF, copy the page with the written fields and add it to a new file. They main Problem is that our customer want to designe the template by them self. So I'm not sure if I try the right way to solve this Problem.
So I've created this code which don't work right now I get the error com.itextpdf.io.IOException: PDF header not found.
My Code
x = 1;
try (PdfDocument finalDoc = new PdfDocument(new PdfWriter("C:\\Users\\...Final.pdf"))) {
for (HashMap<String, String> map : testValues) {
String path1 = "C:\\Users\\.....Temp.pdf"
InputStream template = templateValues.get("Template");
PdfWriter writer = new PdfWriter(path1);
try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(template), writer)) {
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
for (HashMap.Entry<String, String> map2 : map.entrySet()) {
if (form.getField(map2.getKey()) != null) {
Map<String, PdfFormField> fields = form.getFormFields();
fields.get(map2.getKey()).setValue(map2.getValue());
}
}
} catch (IOException | PdfException ex) {
System.err.println("Ex2: " + ex.getMessage());
}
if (x != 0 && (x % 5) == 0) {
try (PdfDocument tempDoc = new PdfDocument(new PdfReader(path1))) {
PdfPage page = tempDoc.getFirstPage();
finalDoc.addPage(page.copyTo(finalDoc));
} catch (IOException | PdfException ex) {
System.err.println("Ex3: " + ex.getMessage());
}
}
x++;
}
} catch (IOException | PdfException ex) {
System.err.println("Ex: " + ex.getMessage());
}

Part 1 - PDF Header is Missing
this appears to be caused by you attempting to re-read an InputStream w/in a loop that has already been read (and, depending on the configuration of the PdfReader, closed). Solving for this depends on the specific type of InputStream being used - if you want to leave it as a simple InputStream (vs. a more specific yet more capable InputStream type) then you'll need to first slurp up the bytes from the stream into memory (e.g. a ByteArrayOutputStream) then create your PDFReaders based on those bytes.
i.e.
ByteArrayOutputStream templateBuffer = new ByteArrayOutputStream();
while ((int c = template.read()) > 0) templateBuffer.write(c);
for (/* your loop */) {
...
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(templateBuffer.toByteArray())), new PdfWriter(tmp))
...
Part 2 - other problems
Couple of things
make sure to grab the recently released 7.0.1 version of iText since it included a couple of fixes wrt/ AcroForm handling
you can probably get away with using ByteArrayOutputStreams for your temporary PDFs (vs. writing them out to files) - i'll use this approach in the example below
PdfDocument/PdfPage is in the "kernel" module, yet AcroForms are in the "form" module (meaning PdfPage is intentionally unaware of AcroForms) - IPdfPageExtraCopier is sortof the bridge between the modules. In order to properly copy AcroForms, you need to use the two-arg copyTo() version, passing an instance of PdfPageFormCopier
field names must be unique in the document (the "absolute" field name that is - i'll skip field hierarcies for now). Since we're looping through and adding the fields from the template multiple times, we need to come up with a strategy to rename the fields to ensure uniqueness (the current API is actually a little bit clunky in this area)
File acroFormTemplate = new File("someTemplate.pdf");
Map<String, String> someMapOfFieldToValues = new HashMap<>();
try (
PdfDocument finalOutput = new PdfDocument(new PdfWriter(new FileOutputStream(new File("finalOutput.pdf")));
) {
for (/* some looping condition */int x = 0; x < 5; x++) {
// for each iteration of the loop, create a temporary in-memory
// PDF to handle form field edits.
ByteArrayOutputStream tmp = new ByteArrayOutputStream();
try (
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new FileInputStream(acroFormTemplate)), new PdfWriter(tmp));
) {
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(filledInAcroFormTemplate, true);
for (PdfFormField field : acroForm.getFormFields().values()) {
if (someMapOfFieldToValues.containsKey(field.getFieldName())) {
field.setValue(someMapOfFieldToValues.get(field.getFieldName()));
}
}
// NOTE that because we're adding the template multiple times
// we need to adopt a field renaming strategy to ensure field
// uniqueness in the final document. For demonstration's sake
// we'll just rename them prefixed w/ our loop counter
List<String> fieldNames = new ArrayList<>();
fieldNames.addAll(acroForm.getFormFields().keySet()); // avoid ConfurrentModification
for (String fieldName : fieldNames) {
acroForm.renameField(fieldName, x+"_"+fieldName);
}
}
// the temp PDF needs to be "closed" for all the PDF finalization
// magic to happen...so open up new read-only version to act as
// the source for the merging from our in-memory bucket-o-bytes
try (
PdfDocument readOnlyFilledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(tmp.toByteArray())));
) {
// although PdfPage.copyTo will probably work for simple pages, PdfDocument.copyPagesTo
// is a more comprehensive copy (wider support for copying Outlines and Tagged content)
// so it's more suitable for general page-copy use. Also, since we're copying AcroForm
// content, we need to use the PdfPageFormCopier
readOnlyFilledInAcroFormTemplate.copyPagesTo(1, 1, finalOutput, new PdfPageFormCopier());
}
}
}

Close your PdfDocuments when you are done with adding content to them.

iText: Splitting pdf documents in a Multi-threaded environment

I am using iText2.1.2 but also found the same behaviour on iText5.4.3.
I have a requirement of breaking 'n' paged document to 'n' documents in my JAX-WS service. I have written the below logic to achieve the same.
reader = new PdfReader(fileLocation + "\\" + pdfFilename + ".pdf");
int n = reader.getNumberOfPages();
int i = 0;
while (i < n) {
.
.
.
document = new Document(reader.getPageSizeWithRotation(1));
fos = new FileOutputStream(outFile);
writer = new PdfCopy(document, fos);
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
fos.close();
.
.
.
}
reader.close();
This also runs in a multi threaded environment. The entire processing is slowing down and makes my service appear sequential because of the following lines.
writer = new PdfCopy(document, fos);
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
Observation in my service: For a 500 page document, for 5 threads, each thread is taking around 160 seconds. If the above lines are commented, then each thread is taking 30 seconds.
I would like to know if this is an expected behaviour and how does iText support for multithreaded enviroments for IO Operations.
Please suggest if any other way is possible to split the 'n' paged document to 'n' documents.

how to download mulitple .PDF files in java

OnClick of button on JSP Page, I am trying to download more than one pdf one by one using java code but not able to done it and Using Following snippet code for the same
Document document[]= new Document[20];
httpServletResponse.setHeader("Content-Disposition",
"attachment;filename=welcome.pdf");
httpServletResponse.setContentType("application/pdf");
try{
for(int i=0;i<3;i++)
{
System.out.println(i);
document[i]=new Document();
PdfWriter.getInstance(document[i], httpServletResponse.getOutputStream());
document[i].open();
document[i].add(new Paragraph("Hello Prakash"));
document[i].add(new Paragraph(new Date().toString()));
document[i].close();
}
}catch(Exception e){
e.printStackTrace();
}
It is not working and alaways only one .PDF file is downloading, anyone help me out?

One could prepare a page, that does multiple requests to the server, every one which of downloads a PDF. This is not so nice a user experience.
I would use a zip file containing all PDFs:
response.setContentType("application/zip"); // application/octet-stream
response.setHeader("Content-Disposition", "inline; filename=\"all.zip\"");
try (ZipOutputStream zos = new ZipOutputStream(response.getOutputStream())) {
for (int i = 0; i < 3; i++) {
ZipEntry ze = new ZipEntry("document-" + i + ".pdf");
zos.putNextEntry(ze);
// It would be nice to write the PDF immediately to zos.
// However then you must take care to not close the PDF (and zos),
// but just flush (= write all buffered).
//PdfWriter pw = PdfWriter.getInstance(document[i], zos);
//...
//pw.flush(); // Not closing pw/zos
// Or write the PDF to memory:
ByteArrayOutputStream baos = new ...
PdfWriter pw = PdfWriter.getInstance(document[i], baos);
...
pw.close();
byte[] bytes = baos.toByteArray();
zos.write(baos, 0, baos.length);
zos.closeEntry();
}
}
Just read, you cannot use ZIP download.
Maybe you might use HTML5 offering a nicer download experience (progress bars?).

iText Pdf Page Byte Size

I have a business requirement that requires me to splits pdfs into multiple documents.
Lets say I have a 100MB pdf, I need to split that into for simplicity sake, into multiple pdfs no larger than 10MB a piece.
I am using iText.
I am going to get the original pdf, and loop through the pages, but how can I determine the file size of each page without writing it separately to the disk?
Sample code for simplicity
int numPages = reader.getNumberOfPages();
PdfImportedPage page;
for (int currentPage = 0; currentPage &lt numPages; ){
++currentPage;
//Get page from reader
page = writer.getImportedPage(reader, currentPage);
// I need the size in bytes here of the page
}

I think the easiest way is to write it to the disk and delete it afterwards:
Document document = new Document();
File f= new File("C:\\delete.pdf"); //for instance
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(f));
document.open();
document.add(page);
document.close();
long filesize = f.length(); //this is the filesize in byte
f.delete();
I'm not absolutely sure, I admit, but I don't know how it should be possible to figure out the filesize if the file is not existing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

OutOfMemoryError during the pdf merge - java

100 files * 500 kB is something around 50 MB. If maximum heap size is 64 MB I'm pretty sure this code won't work in such conditions.

Related

Replace itext to pdfbox performance

iText Fill Form / Copy Page to new Document

iText: Splitting pdf documents in a Multi-threaded environment

how to download mulitple .PDF files in java

iText Pdf Page Byte Size

Categories

Resources