Using iText 2.1.7 to merge large PDFs

Using iText 2.1.7 to merge large PDFs - java

I am using an older version of iText (2.1.7) to merge PDFs. Because that is the last version under the MPL available to me. I cannot change this.
Anyways. I am trying to merge multiple PDFs. Everything seems to work ok, but when I go over about 1500 pages, then the generated PDF fails to open (behaves as if it is corrupted)
This is how I am doing it:
private byte[] mergePDFs(List<byte[]> pdfBytesList) throws DocumentException, IOException {
Document document = new Document();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
for (byte[] pdfByteArray : pdfBytesList) {
ByteArrayInputStream readerStream = new ByteArrayInputStream(pdfByteArray);
PdfReader reader = new PdfReader(readerStream);
for (int i = 0; i < reader.getNumberOfPages(); ) {
copy.addPage(copy.getImportedPage(reader, ++i));
}
copy.freeReader(reader);
reader.close();
}
document.close();
return outputStream.toByteArray();
}
Is this the correct approach? Is there anything about this that would hint at breaking when going over a certain amount of pages? There are no exceptions thrown or anything.

For anyone curious, the issue had nothing to do with iText and instead was the code responsible for returning the response from iText.

Related

Remain only used font subsets while splitting PDF in Java

I am using iText to split a PDF document into separate pages as PDF files. Each file seems to be too large as all fonts used in the input PDF are saved into all result pages, which is apparently not very clean.
Code of splitting is as below. Notice PdfSmartCopy and setFullCompression doesn't help to reduce size (which I have no idea why).
public List<byte[]> split(byte[] input) throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader(input);
List<byte[]> pdfFiles = new ArrayList<>();
int pageCount = pdfReader.getNumberOfPages();
int pageIndex = 0;
while (++pageIndex <= pageCount) {
Document document = new Document(pdfReader.getPageSizeWithRotation(pageIndex));
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfCopy pdfCopy = new PdfSmartCopy(document, byteArrayOutputStream);
pdfCopy.setFullCompression();
PdfImportedPage pdfImportedPage = pdfCopy.getImportedPage(pdfReader, pageIndex);
document.open();
pdfCopy.addPage(pdfImportedPage);
document.close();
pdfCopy.close();
pdfFiles.add(byteArrayOutputStream.toByteArray());
}
return pdfFiles;
}
So is there a way in Java (iText or not) to solve these problem?
Update with demo PDF
Here is a 377KB PDF using multiple CJK fonts where any page in it using 1 or 2 fonts. The summary size of sub-PDFs is 1.2MB. Considering CJK fonts are very bloated, I would like to find a way to remove unused font and even remove unused characters in used fonts.
So my idea is to remain only used characters in used fonts and embed them in sub files and then un-embed all other fonts. Any advice?

Merge documents to create TOC in iText (Java)

When creating documents with iText that need a table of contents, I have usually used a process where I create the main document in memory, create the TOC as a separate document in memory (using dummy links), merge them as a third document, and then use a PdfStamper to reconcile the links into the document and write it to a file.
This works with all versions of iText except the most recent (5.5.6). I will include a simple program that does this process (the real programs are much more complex). When running this with iText 5.5.5 or earlier, it creates the desired document (2 pages with the first page containing text that provides a link to open the second page). With 5.5.6 the call to makeRemoteNamedDestinationsLocal causes an exception com.itextpdf.text.pdf.PdfDictionary cannot be cast to com.itextpdf.text.pdf.PdfArray.
As this had always worked until the latest version, I have some suspicion that this may be a bug in the newest version. Is this a bug, or am I doing something wrong? How should I do this task if it is not a bug? Additionally, how are bug reports usually submitted for iText? From the website, it looks like they expect a question to be submitted here as a report.
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.draw.*;
import java.io.*;
// WORKS CORRECTLY USING itext version 5.5.5
// FAILS WITH 5.5.6
// CAUSES AN EXCEPTION
// "com.itextpdf.text.pdf.PdfDictionary cannot be cast to com.itextpdf.text.pdf.PdfArray"
// with makeRemoteNamedDestinationsLocal()
public class testPdf {
public static void main (String[] args) throws Exception {
// Create simple document
ByteArrayOutputStream main = new ByteArrayOutputStream();
Document doc = new Document(new Rectangle(612f,792f),54f,54f,36f,36f);
PdfWriter pdfwrite = PdfWriter.getInstance(doc,main);
doc.open();
doc.add(new Paragraph("Testing Page"));
doc.close();
// Create TOC document
ByteArrayOutputStream two = new ByteArrayOutputStream();
Document doc2 = new Document(new Rectangle(612f,792f),54f,54f,36f,36f);
PdfWriter pdfwrite2 = PdfWriter.getInstance(doc2,two);
doc2.open();
Chunk chn = new Chunk("<<-- Link To Testing Page -->>");
chn.setRemoteGoto("DUMMY.PDF","page-num-1");
doc2.add(new Paragraph(chn));
doc2.close();
// Merge documents
ByteArrayOutputStream three = new ByteArrayOutputStream();
PdfReader reader1 = new PdfReader(main.toByteArray());
PdfReader reader2 = new PdfReader(two.toByteArray());
Document doc3 = new Document();
PdfCopy DocCopy = new PdfCopy(doc3,three);
doc3.open();
DocCopy.addPage(DocCopy.getImportedPage(reader2,1));
DocCopy.addPage(DocCopy.getImportedPage(reader1,1));
DocCopy.addNamedDestination("page-num-1",2,new PdfDestination(PdfDestination.FIT));
doc3.close();
// Fix references and write to file
PdfReader finalReader = new PdfReader(three.toByteArray());
// Fails on this line
finalReader.makeRemoteNamedDestinationsLocal();
PdfStamper stamper = new PdfStamper(finalReader,new FileOutputStream("Testing.pdf"));
stamper.close();
}
}

You have detected a bug that was introduced in iText 5.5.6. This has already been fixed in our repository:
Thank you for reporting this bug. You can find the fix on github: https://github.com/itext/itextpdf/commit/eac1a4318e6c31b054e0726ad44d0da5b8a720c2

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

function that can use iText to concatenate / merge pdfs together - causing some issues

I'm using the following code to merge PDFs together using iText:
public static void concatenatePdfs(List<File> listOfPdfFiles, File outputFile) throws DocumentException, IOException {
Document document = new Document();
FileOutputStream outputStream = new FileOutputStream(outputFile);
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (File inFile : listOfPdfFiles) {
PdfReader reader = new PdfReader(inFile.getAbsolutePath());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
document.newPage();
PdfImportedPage page = writer.getImportedPage(reader, i);
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
}
This usually works great! But once and a while, it's rotating some of the pages by 90 degrees? Anyone ever have this happen?
I am looking into the PDFs themselves to see what is special about the ones that are being flipped.

There are errors once in a while because you are using the wrong method to concatenate documents. Please read chapter 6 of my book and you'll notice that using PdfWriter to concatenate (or merge) PDF documents is wrong:
You completely ignore the page size of the pages in the original document (you assume they are all of size A4),
You ignore page boundaries such as the crop box (if present),
You ignore the rotation value stored in the page dictionary,
You throw away all interactivity that is present in the original document, and so on.
Concatenating PDFs is done using PdfCopy, see for instance the FillFlattenMerge2 example:
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, new FileOutputStream(dest));
document.open();
PdfReader reader;
String line = br.readLine();
// loop over readers
// add the PDF to PdfCopy
reader = new PdfReader(baos.toByteArray());
copy.addDocument(reader);
reader.close();
// end loop
document.close();
There are other examples in the book.

In case anyone is looking for it, using Bruno Lowagie's correct answer above, here is the version of the function that does not seem to have the page flipping issue i described above:
public static void concatenatePdfs(List<File> listOfPdfFiles, File outputFile) throws DocumentException, IOException {
Document document = new Document();
FileOutputStream outputStream = new FileOutputStream(outputFile);
PdfCopy copy = new PdfSmartCopy(document, outputStream);
document.open();
for (File inFile : listOfPdfFiles) {
PdfReader reader = new PdfReader(inFile.getAbsolutePath());
copy.addDocument(reader);
reader.close();
}
document.close();
}

iText merge a stamped pdf with a pdf created at runtime

I want to merge 2 pdf documents using iText in java, one of the pdfs is created at runtime while the other is an existing pdf that I read in and using the PdfStamper function stamp an image onto it. I want to then merge these two pdfs and display them using a servlet.
I want to know if this is possible and how to do it.
I have no problem creating or stamping them separately but I just can't seem to figure out how to merge them.
Thanks

I suppose this code can help you. You would have to import IText.Jar for this
public static void doMerge(List<InputStream> list,
OutputStream outputStream) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
float k = 0;
for (InputStream in : list) {
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// document.newPage();
//import the page from source pdf
PdfImportedPage page = writer.getImportedPage(reader, i);
//add the page to the destination pdf
cb.addTemplate(page, 0, 0);
System.out.println(page.getHeight());
}
}
outputStream.flush();
document.close();
outputStream.close();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using iText 2.1.7 to merge large PDFs - java

For anyone curious, the issue had nothing to do with iText and instead was the code responsible for returning the response from iText.

Related

Remain only used font subsets while splitting PDF in Java

Merge documents to create TOC in iText (Java)

Extracting an embedded object from a pdf

function that can use iText to concatenate / merge pdfs together - causing some issues

iText merge a stamped pdf with a pdf created at runtime

Categories

Resources