Can we compress PDF file size using iText? - java

I am using iText to read and manipulate PDF files(reports). I read in a PDF report using iText and I save each page as a separate PDF file. But I am unable to reduce the size of the generated PDF. Is there any compression technique or any other way to reduce the size of the PDF? Does pdfbox help in this way?

You can try to set a compression level when using iText:
Document document = ...
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
writer.setCompressionLevel(9);
Level 9 is slowest, but gives you the best compression available in iText.
Please note, that the compression effect largely depends on the PDF content. If your PDF file contains large binary streams, such as images, the compression will have little to no effect on your document. Also, iText will never compress XMP metadata stream regardless of the configuration options.

Related

How to reduce the size of split PDF document in PDFBox? [duplicate]

I have a PDF file to save, but first I have to compress it with the best possible quality and I must use open source library (like Apache PDFBox®).
So, until now what I do is get all the image type resources, compress them and put them back in the PDF, but the compression ratio is to low. This is just a fragment of the code where I assign the compression parameters:
PDImageXObject imageXObject = (PDImageXObject) pdxObject;
ImageWriter imageWriter = ImageIO
.getImageWritersByFormatName(FileType.JPEG.name().toLowerCase()).next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(COMPRESSION_FACTOR);
There is some other mechanism to optimize a PDF, so far only compress the images shows a slightly poor result.
On compression. Indeed, images probably are the largest culprits.
Images: The image size, width and height, contribute to the file size too, not only the lossy image quality (your COMPRESSION_FACTOR). In general I would start with compressing a JPEG file outside the PDF. Then you can find the best compression, that still shows and prints (!) adequately. Photos JPEG, vector graphics (like diagrams) can best be done with Encapsulated PostScript.
Repeated images like page logos should not be stored repeatedly. The optimisation here is internet streaming.
Fonts: The default fonts need no space, the full fonts need the most space (for PDFs with forms for instance). Embedded fonts are a third possibility, only loading the symbols one needs.
PDFs own binary data: Text and other parts can be uncompressed, compressed using only 7bits ASCII, and further compressed using all bytes. The ASCII option is a bit outdated.
At the moment I am not using pdfbox, hence I leave that to you.

What is the best solution to compress PDF with PDFBox?

I have a PDF file to save, but first I have to compress it with the best possible quality and I must use open source library (like Apache PDFBox®).
So, until now what I do is get all the image type resources, compress them and put them back in the PDF, but the compression ratio is to low. This is just a fragment of the code where I assign the compression parameters:
PDImageXObject imageXObject = (PDImageXObject) pdxObject;
ImageWriter imageWriter = ImageIO
.getImageWritersByFormatName(FileType.JPEG.name().toLowerCase()).next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(COMPRESSION_FACTOR);
There is some other mechanism to optimize a PDF, so far only compress the images shows a slightly poor result.
On compression. Indeed, images probably are the largest culprits.
Images: The image size, width and height, contribute to the file size too, not only the lossy image quality (your COMPRESSION_FACTOR). In general I would start with compressing a JPEG file outside the PDF. Then you can find the best compression, that still shows and prints (!) adequately. Photos JPEG, vector graphics (like diagrams) can best be done with Encapsulated PostScript.
Repeated images like page logos should not be stored repeatedly. The optimisation here is internet streaming.
Fonts: The default fonts need no space, the full fonts need the most space (for PDFs with forms for instance). Embedded fonts are a third possibility, only loading the symbols one needs.
PDFs own binary data: Text and other parts can be uncompressed, compressed using only 7bits ASCII, and further compressed using all bytes. The ASCII option is a bit outdated.
At the moment I am not using pdfbox, hence I leave that to you.

Apache Batik SVG to PDF - Output PDF is Incorrect Size

I'm trying to convert an SVG file into a PDF for embedding into another PDF document. I'm using the batik transcoder, passing in the bytes for the SVG and getting the data for the PDF back.
My main PDF document and the SVG file passed into the transcoder both have dimensions of:
width="602.8" height="763.8"
The output PDF file generated from the SVG is smaller however. Because of this, when embedded into our main document, the generated SVG PDF doesn't take up all available space in our main PDF as I would expect it to because it has smaller dimensions. How can I force the output pdf to have the same dimensions of the main document / input SVG.
So after some further research I came to a solution. We're using PDFBox as our pdf manipulation tool which uses a DPI of 72 by default for documents.
Batik on the other hand uses a DPI of 96 when transcoding an SVG to a PDF file. This makes the output file slightly smaller than the main PDFBox generated document. To switch Batik to a DPI that supports PDFBox by default we must change the pixel to mm conversion from 96dpi to 72dpi.
We can add a transcoding hint to our PDFTranscoder as follows:
transcoder.addTranscodingHint(PDFTranscoder.KEY_PIXEL_UNIT_TO_MILLIMETER,
(25.4f / 72f));
where (25.4f / 72f) is equal to 72dpi. This will replace the default dpi of 96dpi (25.4f / 96f)

How I can get Images in PDF with ITEXT and convert PDXObjectImage format to BufferedImage

I need to get the images in one specific page inside the PDF, but I need to convert this image the format of ITEXT PDXObjectImage to BufferedImage, is possible?, How I can?.
I saw several examples with extraction, but I need to get in memory not in file, because I want to use with ZXING Library to read a Barcode.
Can somebody help me please?

Read pdf uploadstream one page at a time with java

I am trying to read a pdf document in a j2ee application.
For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.
With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.
I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.
For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.
If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.
Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.
Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.
For parsing the PDF document from an InputStream, look at the COSDocument class
For writing lucene indexes, look at LucenePDFDocument class
For in-memory representations of COSDocuments, look at FDFDocument
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.
This was answered here.
Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.
Here is an example copied from the link above which shows how to draw a PDF page into an image:
File file = new File("test.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel channel = raf.getChannel();
ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
PDFFile pdffile = new PDFFile(buf);
// draw the first page to an image
PDFPage page = pdffile.getPage(0);
//get the width and height for the doc at the default zoom
Rectangle rect = new Rectangle(0,0,
(int)page.getBBox().getWidth(),
(int)page.getBBox().getHeight());
//generate the image
Image img = page.getImage(
rect.width, rect.height, //width & height
rect, // clip rect
null, // null for the ImageObserver
true, // fill background with white
true // block until drawing is done
);
I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

Categories

Resources