iText: Splitting pdf documents in a Multi-threaded environment

iText: Splitting pdf documents in a Multi-threaded environment - java

I am using iText2.1.2 but also found the same behaviour on iText5.4.3.
I have a requirement of breaking 'n' paged document to 'n' documents in my JAX-WS service. I have written the below logic to achieve the same.
reader = new PdfReader(fileLocation + "\\" + pdfFilename + ".pdf");
int n = reader.getNumberOfPages();
int i = 0;
while (i < n) {
.
.
.
document = new Document(reader.getPageSizeWithRotation(1));
fos = new FileOutputStream(outFile);
writer = new PdfCopy(document, fos);
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
fos.close();
.
.
.
}
reader.close();
This also runs in a multi threaded environment. The entire processing is slowing down and makes my service appear sequential because of the following lines.
writer = new PdfCopy(document, fos);
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
Observation in my service: For a 500 page document, for 5 threads, each thread is taking around 160 seconds. If the above lines are commented, then each thread is taking 30 seconds.
I would like to know if this is an expected behaviour and how does iText support for multithreaded enviroments for IO Operations.
Please suggest if any other way is possible to split the 'n' paged document to 'n' documents.

Related

making more than one pdf document in java

This is my code:
try {
dozen = magazijn.getFfd().vraagDozenOp();
for (int i = 0; i < dozen.size(); i++) {
PdfWriter.getInstance(doc, new FileOutputStream("Order" + x + ".pdf"));
System.out.println("Writer instance created");
doc.open();
System.out.println("doc open");
Paragraph ordernummer = new Paragraph(order.getOrdernummer());
doc.add(ordernummer);
doc.add( Chunk.NEWLINE );
for (String t : text) {
Paragraph klant = new Paragraph(t);
doc.add(klant);
}
doc.add( Chunk.NEWLINE );
Paragraph datum = new Paragraph (order.getDatum());
doc.add(datum);
doc.add( Chunk.NEWLINE );
artikelen = magazijn.getFfd().vraagArtikelenOp(i);
for (Artikel a : artikelen){
artikelnr.add(a.getArtikelNaam());
}
for (String nr: artikelnr){
Paragraph Artikelnr = new Paragraph(nr);
doc.add(Artikelnr);
}
doc.close();
artikelnr.clear();
x++;
System.out.println("doc closed");
}
} catch (Exception e) {
System.out.println(e);
}
I get this exception: com.itextpdf.text.DocumentException: The document has been closed. You can't add any Elements.
can someone help me fix this so that the other pdf can be created and paragrphs added?

Alright, your intent is not very clear from your code and question so I'm going to operate under the following assumptions:
You are creating a report for each box you're processing
Each report needs to be a separate PDF file
You're getting a DocumentException on the second iteration of the loop, you're trying to add content to a Document that has been closed in the previous iteration via doc.close();. 'doc.close' will finalize the Document and write everything still pending to any linked PdfWriter.
If you wish to create separate pdfs for each box, you need to create a seperate Document in your loop statement as well, since creating a new PdfWriter via PdfWriter.getInstance(doc, new FileOutputStream("Order" + x + ".pdf")); will not create a new Document on its own.
If I'm wrong with assumption 2 and you wish to add everything to a single PDF, move doc.close(); outside of the loop and create only a single PdfWriter

You can try something like this using Apache PDFBox
File outputFile = new File(path);
outputFile.createNewFile();
PDDocument newDoc = new PDDocument();
then create a PDPage and write what you wanna write in that page. After your page is ready, add it to the newDoc and in the end save it and close it
newDoc.save(outputFile);
newDoc.close()
repeat this dozen.size() times and keep changing the file's name in path for every new document.

Can't remove whiteSpace in Java itext

Here i am combining 2 pdf documents using the Itext packages.
Merging was done successfully using the code below
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (InputStream in : list)
{
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
document.newPage();
//import the page from source pdf
PdfImportedPage page = writer.getImportedPage(reader, i);
//add the page to the destination pdf
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
Here the list is an InputStream List.
And outputStream is an output stream
The problem i am having is i want to append the PDFdocuments in the list after the 1st PDF is added
(i.e 1st PDF has 4 lines...i want the 2nd PDF to continue in the same page after the 4th line).
What i am getting is the 2nd PDF is added in the second page.
Is there any alternate keyword for document.newPage();
Can anyone help me with it.
Thanks would like to hear any responses:)

It depends on the requirements you have. As long as
you only are interested in the page contents of the merged PDFs, not in the page annotations and
the pages have no content but the text lines you mention, in particular no background graphics, watermarks, or header/footer lines,
you can you use either the
PdfDenseMergeTool from this answer or the
PdfVeryDenseMergeTool from this answer.
If you are interested in annotations, it should be no problem to extend those classes accordingly. If your PDDFs have background graphics or watermarks, headers or footers, they should be removed beforehand.

function that can use iText to concatenate / merge pdfs together - causing some issues

I'm using the following code to merge PDFs together using iText:
public static void concatenatePdfs(List<File> listOfPdfFiles, File outputFile) throws DocumentException, IOException {
Document document = new Document();
FileOutputStream outputStream = new FileOutputStream(outputFile);
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (File inFile : listOfPdfFiles) {
PdfReader reader = new PdfReader(inFile.getAbsolutePath());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
document.newPage();
PdfImportedPage page = writer.getImportedPage(reader, i);
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
}
This usually works great! But once and a while, it's rotating some of the pages by 90 degrees? Anyone ever have this happen?
I am looking into the PDFs themselves to see what is special about the ones that are being flipped.

There are errors once in a while because you are using the wrong method to concatenate documents. Please read chapter 6 of my book and you'll notice that using PdfWriter to concatenate (or merge) PDF documents is wrong:
You completely ignore the page size of the pages in the original document (you assume they are all of size A4),
You ignore page boundaries such as the crop box (if present),
You ignore the rotation value stored in the page dictionary,
You throw away all interactivity that is present in the original document, and so on.
Concatenating PDFs is done using PdfCopy, see for instance the FillFlattenMerge2 example:
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
PdfReader reader;
String line = br.readLine();
// loop over readers
// add the PDF to PdfCopy
reader = new PdfReader(baos.toByteArray());
copy.addDocument(reader);
reader.close();
// end loop
document.close();
There are other examples in the book.

In case anyone is looking for it, using Bruno Lowagie's correct answer above, here is the version of the function that does not seem to have the page flipping issue i described above:
public static void concatenatePdfs(List<File> listOfPdfFiles, File outputFile) throws DocumentException, IOException {
Document document = new Document();
FileOutputStream outputStream = new FileOutputStream(outputFile);
PdfCopy copy = new PdfSmartCopy(document, outputStream);
document.open();
for (File inFile : listOfPdfFiles) {
PdfReader reader = new PdfReader(inFile.getAbsolutePath());
copy.addDocument(reader);
reader.close();
}
document.close();
}

iText Pdf Page Byte Size

I have a business requirement that requires me to splits pdfs into multiple documents.
Lets say I have a 100MB pdf, I need to split that into for simplicity sake, into multiple pdfs no larger than 10MB a piece.
I am using iText.
I am going to get the original pdf, and loop through the pages, but how can I determine the file size of each page without writing it separately to the disk?
Sample code for simplicity
int numPages = reader.getNumberOfPages();
PdfImportedPage page;
for (int currentPage = 0; currentPage &lt numPages; ){
++currentPage;
//Get page from reader
page = writer.getImportedPage(reader, currentPage);
// I need the size in bytes here of the page
}

I think the easiest way is to write it to the disk and delete it afterwards:
Document document = new Document();
File f= new File("C:\\delete.pdf"); //for instance
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(f));
document.open();
document.add(page);
document.close();
long filesize = f.length(); //this is the filesize in byte
f.delete();
I'm not absolutely sure, I admit, but I don't know how it should be possible to figure out the filesize if the file is not existing.

OutOfMemoryError during the pdf merge

the below code merges the pdf files and returns the combined pdf data. while this code runs, i try to combine the 100 files with each file approximately around 500kb, i get outofmemory error in the line document.close();. this code runs in the web environment, is the memory available to webspehere server is the problem? i read in an article to use freeReader method, but i cannot get how to use it my scenario.
protected ByteArrayOutputStream joinPDFs(List<InputStream> pdfStreams,
boolean paginate) {
Document document = new Document();
ByteArrayOutputStream mergedPdfStream = new ByteArrayOutputStream();
try {
//List<InputStream> pdfs = pdfStreams;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
//Iterator<InputStream> iteratorPDFs = pdfs.iterator();
Iterator<InputStream> iteratorPDFs = pdfStreams.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
if (pdf == null)
continue;
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
//clear this
pdfStreams = null;
//WeakReference ref = new WeakReference(pdfs);
//ref.clear();
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, mergedPdfStream);
writer.setFullCompression();
document.open();
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
pageOfCurrentReaderPDF++;
document.setPageSize(pdfReader
.getPageSizeWithRotation(pageOfCurrentReaderPDF));
document.newPage();
// pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader,
pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, ""
+ currentPageNumber + " of " + totalPages, 520,
5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
System.out.println("now the size is: "+pdfReader.getFileLength());
}
mergedPdfStream.flush();
document.close();
mergedPdfStream.close();
return mergedPdfStream;
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
try {
if (mergedPdfStream != null)
mergedPdfStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
return mergedPdfStream;
}
Thanks
V

This code merges all the PDF's in an array in the memory (the heap) so yes, memory usage will grow linearly with the number of files merged.
I don't know about the freeReader method, but maybe you could try to write the merged PDF into a temporary file instead of a byte array ? mergedPdfStream would be a FileOutputStream instead of a ByteArrayOutputStream. Then you return e.g. a File reference to the client code.
Or you could increase the quantity of memory Java can use (-Xmx JVM parameter), but if the number of files to merge eventually increases, you will find yourself with the same problem.

First, why do you clutter your code with all those Iterator<> boilerplate code?
Do you ever heard of the for statement?
i.e
for (PDfReader pdfReader: readers) {
// code for each single PDF reader in readers
}
Second: consider to close the pdfReader as soon as it is done. This will hopefully flush some buffers and free the memory occupied by the original PDF.

This is not proper way of doing file operation. You are doing merging of files using ArrayList and Array in memory. You should rather use File IO with buffering techniques.
Do you wish to show the final merged file at last? Then you can open the file after all your merging is done.
Do not use only in-memory buffering as you have shown. Use File Io with buffering (byte[] i mean)
Close each file after you read it and append it.
Java has limited memory you allocated at startup time, so merging some big number of file at once like this will lead to crashing of application. You should try this merging operation in separate thread using ThreadPool, so that your application will not get stucked for this.
thanks.

100 files * 500 kB is something around 50 MB. If maximum heap size is 64 MB I'm pretty sure this code won't work in such conditions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

iText: Splitting pdf documents in a Multi-threaded environment - java

Related

making more than one pdf document in java

Can't remove whiteSpace in Java itext

function that can use iText to concatenate / merge pdfs together - causing some issues

iText Pdf Page Byte Size

OutOfMemoryError during the pdf merge

Categories

Resources