PDFBox render image misses content

PDFBox render image misses content - java

When using PDFBox, we encounter an issue where if we call render on a PDDocument, it sometimes loses content, such as fonts or certain shapes.
Having dug into this, it looks to be caused by the use of SoftReference throughout the PDFBox code base. The JVM seems to reap the underlying contents of the PDDocument while it's attempting to render the image. As a result, we see org.apache.pdfbox.cos.COSDocument - Warning: You did not close a PDF Document at random intervals.
Has anyone else encountered this issue? If so, how was it solved?
So far, our solution has been to write the contents to a file, then read and render.

I had a similar issue. The PDF contained an image which was not rendered. The result was just a plain white BufferedImage.
Including the JBIG2 library (see https://pdfbox.apache.org/2.0/dependencies.html) to my classpath and updating the PDFBox-version from 2.0.15 to 2.0.26 solved the issue for me.

Related

PDFBox - show icon for embedded files in pdf

I developed a Java PDF viewer using Apache PDFBox. The problem is, when rendering a page of a PDF, if the page has file attachments, there is no icon shown in PDFBox rendering, like there is a paper clip icon, when such a file is opened in Adobe PDF reader.
Is it possible to automatically have such icons in the rendering using PDFBox? I think I saw such a code some time ago, like a single line that switches this behavior on and off but I can't find it. Thanks.

This was fixed in PDFBOX-5394 and will be in the version 2.0.26. However only one single symbol will be shown at this time: a paperclip in fixed size.

Adobe PDFBox produces unexpected changes for load followed by save

I'm working on a tool to automatically generate PDF files based on a template file and other source data. I'm using Mac OS 10.14.4, Java 1.8 and PDFBox version 2.0.15.
As a basic test, I trimmed the open-and-save code down to two lines, which have an obvious problem for one particular PDF and more subtle issues for all other PDFs I've tried:
PDDocument targetPDF = PDDocument.load(new File(templatePath));
targetPDF.save(targetFileName)
The observed problem for one particular PDF is that unexpected characters are inserted at the top of the first page. (They appear to be in an alphabet which is not otherwise used, and are clipped.) Other PDFs are visually similar, but very different when I run them through diff. Is this something tricky I should do to save the files? Is this a problem with that one file? Is PDFBox is doing something odd?
I've looked for similar reports, and found a few that are concerned with the size of the output files: In PDFBox, why does file size becomes extremely large after saving? and Split and merge pdf files using PDFBOX produces large file I do see a noticeable increase in file size, but not as much as those questions report. In one case, the input and output files are visually different. In others, diff -y --text template.pdf target.pdf reports large differences but I don't detect any differences by eye alone. (In Mac's built-in "Preview" document viewer. The breaking "template.pdf" is created in Adobe Acrobat. I don't know about the non-breaking files.)
After comparison to https://issues.apache.org/jira/browse/PDFBOX-2690 and http://useof.org/java-open-source/org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject I tried adding targetPDF.close() after the targetPDF.save(), but that made no difference.
https://pdfbox.apache.org/2.0/faq.html seems to suggest closing a content stream before saving the file, but I don't know how to do that. (File itself doesn't have a close() method. None of the methods on PDDocument seem related to streams or closing them, except PDDocument.close() itself, which prevents saving.)
I would paste in some log files here, but I'm not getting any log messages from PDFBox classes....

Displaying embedded fonts with PDFBox and Swing

I am using PDFBox to display PDF files inside a JInternalFrame. When opening PDF I get lots of warnings like this:
Changing font on <m> from <Tahoma Negrita> to the default font
I am aware that the fonts being reported are not part of the standard set of 14 fonts. So I decided to check if those fonts are embedded on the PDF file (thinking that there shouldn't be a problem loading embedded fonts, right?).
So I open the file on different readers and check properties/fonts. I am in doubt whether this section reports fonts required by the document or fonts actually embedded in the document.
The information that I get is as follows:
BAAAA+Tahoma-Bold (embedded Subset), type:TrueType, Encoding:
CAAAA+Tahoma (Embedded Subset), type:TrueType, Encoding:
Confused about this, I researched on how to embed fonts from OpenOffice and found that the PDF/A-1a option should be checked. So I made another PDF using this option (in case this was not used when making the original PDF file), yet I got the same results.
I would like your guidance understanding how this works. I would like to be able to open PDF files just as PDF readers do. I also read about the PDFBox_External_Fonts.properties but I am guessing this file shouldn't be modified since I am dealing with embedded fonts.
Thanks.

pdfbox is not able to parse embedded subsets of TrueType fonts.
As far as I understand it, embedded TrueType subsets are missing some metadata for the font file that pdfbox needs.
The bug is known but not easy to solve. Right now I can only advise to use embedded Type 1 Fonts if possible, pdfbox can deal with them.
You can also try to set the path to your complete font files in your pdfbox.jar under org/apache/pdfbox/resources/PDFBox_External_Fonts.properties, so if pdfbox cannot parse the subset, at least it can find a full path to the original font file. Maybe that works, but I have not tested this.
Good Luck!

Custom PDF creation - Large images

Looking for a Java based PDF creation library. We're currently using Apache Velocity with HTML to render PDFs on the fly.
We'd like to be able to find a way to render large images (sometimes as big as 3000 x 1700) in a creative manner within the PDF container. For instance, a scrollable image pane within a PDF. This might not be possible within a PDF, I might be wrong.
Open source would ideal.

For a good PDF library you should take a look at iText: http://itextpdf.com/
I have used images of around 5000x4000 with iText without any problems.
I don't know if it is possible to create a working scrollpane inside a PDF, unless of course you were doing it through a custom PDF creator/viewer.
iText is open source but make sure to check out the AGPL license before you use it commecrially: http://itextpdf.com/terms-of-use/agpl.php

For just creating PDF files from images iText is a little overdimensioned. Give xsPDF a chance, it has no limits for images sizes and seems to be appropriate for your problem.

Just a FYI for anyone that may run into this in the future:
I used a library called PDFBox (http://pdfbox.apache.org/) to open a pre-existing PDF and modify the PDF with a custom sized PDFRectangle with the dimensions of the image. Then inserted the image and rectangle into that new page and got the desired results.
I didn't realize you could have multiple page sizes in a single PDF.

How to extract images from pdf using Java (not using pdfbox)

I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.
I'm using the PDFToImage class of pdfbox as base for my code.
So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.
I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.

I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).
Take a look at org.apache.pdfbox.tools.ExtractImages (source code) to see if it does what you want.

The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.

Have you tried icepdf or JPedal (both pure java)?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.