PDFRenderer creates images of greater size than the PDF itself

PDFRenderer creates images of greater size than the PDF itself - java

I have a sample of scanned PDFs that I need to edit and re-export. I use PDFBox to render the PDF into a series of images (one image per page), I perform some OpenCV calculations on the rasterized jpegs and then I intend to insert them back into a new pdf file.
Example: PDF is 423kb, Page 1 is 313kb, Page 2 is 287kb, Page 3 is 319kb, Page 4 is 485kb, and Page 5 is 470kb.
Problem is that the output images are greater in size than the PDF itself. This results in my OCR efforts taking much longer than is acceptable (5 minutes vs 30 seconds per document). The only way to keep the jpegs from inflating in size is to leave them with a default DPI of 72. This produces poor quality images that cannot be used.
Why is this happening? I should be able to get back images that have a size less than or equal to the PDF in question (without sacrificing quality). I'm not doing anything weird to the images, just removing watermarks.
Here's some code illustrating how I'm extracting the jpegs from the PDF.
File file = new File(fileName);
PDDocument document = PDDocument.load(file);
PDFRenderer renderer = new PDFRenderer(document);
BufferedImage[] pageArray = new BufferedImage[document.getNumberOfPages()];
int pageCounter = 0;
for(PDPage page : document.getPages()) {
pageArray[pageCounter] = renderer.renderImageWithDPI(pageCounter, 160);
pageCounter++;
}

Related

Convert html to pdf in landscape mode using iText

I'm trying to convert html to pdf using iText.
Here is the simple code that is working fine :
ByteArrayOutputStream pdfStream = new ByteArrayOutputStream();
HtmlConverter.convertToPdf(htmlAsStringToConvert, pdfStream)
Now, I want to convert the pdf to LANDSCAPE mode, so I've tried :
ConverterProperties converterProperties = new ConverterProperties();
MediaDeviceDescription mediaDeviceDescription = new MediaDeviceDescription(MediaType.SCREEN);
mediaDeviceDescription.setOrientation(LANDSCAPE);
converterProperties.setMediaDeviceDescription(mediaDeviceDescription);
HtmlConverter.convertToPdf(htmlAsStringToConvert, pdfStream, converterProperties);
and also :
PdfDocument pdfDoc = new PdfDocument(writer);
pdfDoc.setDefaultPageSize(PageSize.A4.rotate());
HtmlConverter.convertToPdf(htmlAsStringToConvert, pdfDoc, new ConverterProperties()).
I've also mixed both, but the result remains the same, the final PDF is still in default mode.

The best way to achieve landscape page size when converting HTML to PDF is to provide the corresponding CSS instruction for the page to become landscape.
This is done with the following CSS:
#page {
size: landscape;
}
Now, if you have your input HTML document in htmlAsStringToConvert variable then you can process it as an HTML using Jsoup library which iText embeds. Basically we are just adding the necessary CSS instruction into our <head>:
Document htmlDoc = Jsoup.parse(htmlAsStringToConvert);
htmlDoc.head().append("<style>" +
"#page { size: landscape; } "
+ "</style>");
HtmlConverter.convertToPdf(htmlDoc.outerHtml(), new FileOutputStream(outPdf));
Beware that if you already have #page declarations in your HTML then the one you append might be in conflict with the ones you already have - in this case, you need to make sure you insert your declaration as the latest one (this should be the case with the code above as long as all of your declaration are in <head> element).

Why is PDFBox PDFRenderer slow?

I want to convert a PDF to a TIFF using PDFBox 2.x and the PDFRenderer Class.
But it runs very slowly compared to ghostscript.
Here's my sample code
public class SpeedTest
{
static long startTime = System.currentTimeMillis ();
public static void logTime (String msg)
{
long now = System.currentTimeMillis ();
System.out.println (String.format ("%.3f: %s", (now - startTime) / 1000.0, msg));
startTime = now;
}
public static void main (String[] args) throws Exception
{
//System.setProperty ("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
String pdfFileName = args[0];
String tiffFileName = args[1];
PDDocument document = PDDocument.load (new File (pdfFileName));
logTime (pdfFileName + " loaded.");
PDFRenderer pdfRenderer = new PDFRenderer (document);
logTime ("intitalized renderer.");
BufferedImage img = pdfRenderer.renderImageWithDPI (0, 600, ImageType.RGB);
logTime ("page rendered as image.");
ImageIO.write (img, "TIFF", new File (tiffFileName));
logTime ("image saved as TIFF.");
}
}
The output is as follows
0.521: sample.pdf loaded.
0.013: intitalized renderer.
2.910: page rendered as image.
2.005: image saved as TIFF.
As you can see, the call to pdfRenderer.renderImageWithDPI takes almost 3 secs (also ImageIO.write-call takes 2 secs, too).
When done the same using ghostscript the complete task finishes in 0.4secs.
time gs -dQUIET -dBATCH -dNOPAUSE -sstdout=/dev/null -sDEVICE=tifflzw -r600 -dFirstPage=1 -dLastPage=1 -sOutputFile=sample.tif sample.pdf
real 0m0.389s
user 0m0.340s
sys 0m0.048s
I've also already tried
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
as I'm running Java 8 (1.8.0_161 to be precise) but that makes no difference.
Thanks for every idea,
regards
Thomas

Upgrade to JDK 1.8.0_191 which was released on Oct, 2018, or JDK 9.0.4.
From Pdfbox docs,
PDFBox and Java 8
Important notice when using PDFBox with Java 8
before 1.8.0_191 or Java 9 before 9.0.4
Due to the change of the java color management module towards
“LittleCMS”, users can experience slow performance in color
operations. A solution is to disable LittleCMS in favor of the old
KCMS (Kodak Color Management System) by:
Starting with -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
or Calling
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
Sources:
https://bugs.openjdk.java.net/browse/JDK-8041125

According to my experiments this slowness only occurs for the first rendered page of a document. If you render all pages of a multi-page document, then all pages after the first one render faster. The absolute speed of the rendering also depends very much on the size of the DPIs used.
Render 6 document pages at 600 DPI
4.903s: page 0 rendered as image.
4.205s: page 1 rendered as image.
3.946s: page 2 rendered as image.
3.866s: page 3 rendered as image.
3.761s: page 4 rendered as image.
3.633s: page 5 rendered as image.
Render 6 document pages at 300 DPI
3.241s: page 0 rendered as image.
1.308s: page 1 rendered as image.
1.155s: page 2 rendered as image.
1.156s: page 3 rendered as image.
1.109s: page 4 rendered as image.
1.083s: page 5 rendered as image.
Render 6 document pages at 150 DPI
2.507s: page 0 rendered as image.
0.555s: page 1 rendered as image.
0.386s: page 2 rendered as image.
0.373s: page 3 rendered as image.
0.410s: page 4 rendered as image.
0.361s: page 5 rendered as image.
Render 6 document pages at 72 DPI
2.455s: page 0 rendered as image.
0.333s: page 1 rendered as image.
0.213s: page 2 rendered as image.
0.190s: page 3 rendered as image.
0.175s: page 4 rendered as image.
0.171s: page 5 rendered as image.
I think the problem here is that the AWT graphics does all rendering in software and with a constant pixel fill rate the rendering time scales quadratically with the DPI value. The slowness of the first image is probably some initialization overhead. (But that's all a wild guess at the moment.)

extract thumbnail from 3d pdf using itextpdf

When I view a 3D pdf (aka PDF/E) with Adobe Acrobat Reader, it shows a thumbnail on the left side:
Is it possible to extract this thumbnail from the pdf using itext or is it generated on the fly by the viewer?

This is possible, though from what I am seeing I doubt your PDF has a specific thumbnail image and just renders the page in the thumbnail.
First, let's create a PDF that has a thumbnail according to the PDF specification since I couldn't find one. Section 12.3.4 of ISO-3200-2 (the PDF specification) states the following:
The thumbnail image for a page shall be an image XObject specified by the Thumb entry in the page object...
This can be easily created using iText like so:
PdfWriter writer = new PdfWriter(OUTPUT_FILE);
PdfDocument pdfDocument = new PdfDocument(writer);
Document document = new Document(pdfDocument);
document.add(new Paragraph("Hello world"));
PdfImageXObject thumbnail = new PdfImageXObject(ImageDataFactory.create(getInput("itext.png")));
pdfDocument.getFirstPage().getPdfObject().put(PdfName.Thumb, thumbnail.getPdfObject());
document.close();
Where getInput("itext.png") resolves to a full path of our image:
This gives us output.pdf
You'll note that neither Acrobat nor Reader display the thumbnail image- they simply render the page. Other readers do use our new thumbnail:
Since you are using reader I would think this means the thumbnail in your PDF is simply the rendered page since thumbnails appear to be ignored.
To answer your question, getting the thumbnail is simply the reverse of the operation above- we get the Page's dictionary and look for a /Thumb entry
PdfReader reader = new PdfReader(OUTPUT_FILE);
PdfDocument pdfDocument = new PdfDocument(reader);
PdfStream thumbnailStream = pdfDocument.getFirstPage().getPdfObject().getAsStream(PdfName.Thumb);
if (thumbnailStream != null) {
PdfImageXObject thumbnail = new PdfImageXObject(thumbnailStream);
BufferedImage image = thumbnail.getBufferedImage();
//Output to file, memory, etc
}

Creating PDF using JAVA (Netbeans) with images and multi pages

I am developing a Java program with the following requirements:
The application will take 5 input fields and 3 images (browse and "attach" to the Java application).
Once the "form" is completed it will be submitted using a button called "submit".
Once submitted the JAVA application will create a PDF file with the 5 inputed text and the 3 attached images.
I should be able to control which goes to which page number.
How do I implement such a solution with iText?

The application will take 5 input fields and 3 images (browse and "attach" to the Java application).
Once the "form" is completed it will be submitted using a button called "submit".
These first two requirements are unclear; are they to be implemented in a Java GUI (AWT? Swing? FX?), in some independent web UI (Plain HTML? Vaadin?), or in some derived UI (Portlet? ...)?
But as the question title "Creating PDF using JAVA (Netbeans) with images and multi pages" focuses on the PDF creation, let's look at the third and fourth requirements.
Once submitted the JAVA application will create a PDF file with the 5 inputed text and the 3 attached images.
I should be able to control which goes to which page number.
Let's assume you already have those inputs in the variables
String text1, text2, text3, text4, text5;
byte[] image1, image2, image3;
The framework
With iText you now create the document like this:
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfWriter;
...
// where you want to create the PDF;
// use a FileOutputStream for creating the PDF in the file system
// use a ByteArrayOutputStream for creating the PDF in a byte[] in memory
OutputStream output = ...;
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, output);
document.open();
// Add content for the first page(s)
...
// Start e new page
document.newPage();
// Add content for the next page(s)
...
// Start a new page
document.newPage();
// etc etc
document.close();
Adding text
You can add text in one of the Add content for the ... page(s) sections using
import com.itextpdf.text.Paragraph;
...
document.add(new Paragraph(text1));
Adding an image
You can add an image in one of the Add content for the ... page(s) sections using
import com.itextpdf.text.Image;
...
document.add(Image.getInstance(image1));
Adding at a given position
Adding text or images as described above leaves the layout details to iText, and iText fills the page from top to bottom except some margins.
If you want to control the positioning of the content yourself (which also means you have to take care that the content parts do not overlap or are drawn outside the page area), you can do so like this:
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.Phrase;
...
PdfContentByte canvas = writer.getDirectContent();
Phrase phrase = new Phrase(text2);
ColumnText.showTextAligned(canvas, Element.ALIGN_LEFT, phrase, 200, 572, 0);
Image img = Image.getInstance(image2);
img.setAbsolutePosition(200, 200);
canvas.addImage(img);
And there are many more options how to manipulate your content, e.g. choosing a font, choosing text sizes, scaling images, rotating content, ..., simply have a look at the iText samples from the book iText in Action - Second Edition.

You can use XSL-FO. A basic example here. After this, you can search and add other options for your PDF.

Filling landscape PDF with PDFBox

I try to fill a PDF form with PDFBox and I managed to do it well with a portrait oriented document. But I have a problem when filling a document in landscape mode. The fields are filled up, but the text orientation is not good. It appear vertically like if it was still in portrait but in a rotation of 90 degrees.
Here is my simplified code:
PDDocument pdfDoc = PDDocument.load(MY_FILE);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.getField("aAddressLine1").setValue("ADDRESS1_HERE");
acroForm.getField("aAddressLine2").setValue("ADDRESS1_HERE");
acroForm.getField("country").setValue("COUNTRY_HERE");
pdfDoc.save(PATH_HERE);
pdfDoc.close();
Did you manage to fill a PDF document in landscape mode?
Thanks for your help.

The short answer
I'm afraid PDFBox does not yet (as of version 1.8.2) allow you to fill in landscape PDFs like the one you provided because it does not seem to query and factor in informations about the page the form field is located on.
The long answer
There are different ways you can define a page to be A4 landscape:
You can define it to have the A4 landscape dimensions directly by means of a media box definition:
/MediaBox [0, 0, 842, 595]
In this case the coordinates of your aAddressLine1 would be
/Rect[23.1711 86.8914 292.121 100.132]
or you can define it to have the A4 portrait dimensions and being rotated by 90° (or 270° obviously):
/MediaBox [0, 0, 595, 842]
/Rotate 90
In this case the coordinates of your aAddressLine1 are
/Rect[86.8914 23.1711 100.132 292.121]
Your example document uses the latter method.
Now PDFBox, when creating an appearance stream for that field, only looks at the rectangle defining the field but ignores the properties of the page. Thus, PDFBox sees a very narrow and very high textfield and fills it in just like that. It is completely unaware that the result will be rotated in a PDF viewer.
What it should have done is to also look at the page the field is located on. If that page has a /Rotate entry, it should create an appearance stream for the field which displays the text rotated in the opposite direction.
Alternatives
In a comment you also asked
Do you know another library I could use if PDFBox can't do what I want?
I have tested the feat with iText 5.4.2:
PdfReader reader = new PdfReader(MY_FILE);
OutputStream os = new FileOutputStream(PATH_HERE);
PdfStamper stamper = new PdfStamper(reader, os);
AcroFields acroFields = stamper.getAcroFields();
acroFields.setField("aAddressLine1", "ADDRESS1_HERE");
acroFields.setField("aAddressLine2", "ADDRESS1_HERE");
stamper.close();
(The free iText version is licensed under the AGPL; you have to decide whether that's ok for your project. There is a commercial license, too, if it's not ok.)
I'm sure other PDF libraries also can do that, it's not too exotic a feature after all...
But I also tested PDF Clown 0.1.3 (trunk version), which did not work either:
File file = new File(MY_FILE);
Document document = file.getDocument();
Form form = document.getForm();
form.getFields().get("aAddressLine1").setValue("ADDRESS1_HERE");
form.getFields().get("aAddressLine2").setValue("ADDRESS1_HERE");
file.save(new java.io.File(PATH_HERE), SerializationModeEnum.Incremental);
file.close();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.