Compressing PDF with Java library

Compressing PDF with Java library - java

Im building a chat platform and Im implementing upload-attachments.
For that I need to use a couple of libraries to compress the files (pdf, image, video).
Im using lambda to do this.
Image compression is working fine.
Video compression Im working on it right now, to compress (mp4, avi, x264).
The difficult part is the PDF.
In the PDF there is a case that it contains images inside, or something else... To use this library, Im using this
` <dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.0-RC1</version>
</dependency>`
And the code for the compression is this
public class PdfCompressor implements MediaCompressor {
public byte[] compress(MediaData media) throws IOException {
byte[] data = media.getData();
if (data == null) {
throw new IllegalArgumentException("Data must not be null");
}
try (PDDocument document = Loader.loadPDF(new ByteArrayInputStream(data));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
for (PDPage page : document.getPages()) {
PDResources resources = page.getResources();
if (resources == null) {
continue;
}
for (COSName xObjectName : resources.getXObjectNames()) {
PDXObject xObject = resources.getXObject(xObjectName);
if (xObject instanceof PDImageXObject) {
compressImage((PDImageXObject) xObject, document, outputStream);
}
}
}
return outputStream.toByteArray();
}
}
private void compressImage(PDImageXObject imageObject, PDDocument document, ByteArrayOutputStream outputStream) throws IOException {
// Only compress supported image formats
String format = imageObject.getSuffix();
if (format == null || !(format.equals("jpg") || format.equals("jpeg") || format.equals("png") || format.equals("bmp"))) {
outputStream.write(imageObject.getStream().toByteArray());
return;
}
// Compress the image
BufferedImage image = imageObject.getImage();
ByteArrayOutputStream compressedStream = new ByteArrayOutputStream();
PDImageXObject compressedImage = LosslessFactory.createFromImage(document, image);
ImageIO.write(compressedImage.getImage(), format, compressedStream);
byte[] compressedImageData = compressedStream.toByteArray();
// Check if compressed size is less than original size
if (compressedImageData.length < imageObject.getStream().getLength()) {
outputStream.write(compressedImageData);
} else {
outputStream.write(imageObject.getStream().toByteArray());
}
}
}
But the problem is that the code is increasing the size of the PDF, and not only that, it is also damaging the file, which means that I cant open it after that.
Lambda, trying to upload it to S3 bucket.

Related

Java Apache POI: insert an image "infront the text"

I have a placeholder image in my docx file and I want to replace it with new image. The problem is - the placeholder image has an attribute "in front of text", but the new image has not. As a result the alignment breaks. Here is my code snippet and the docx with placeholder and the resulting docx.
.......
replaceImage(doc, "Рисунок 1", qr, 50, 50);
ByteArrayOutputStream out = new ByteArrayOutputStream();
doc.write(out);
out.close();
return out.toByteArray();
}
}
public XWPFDocument replaceImage(XWPFDocument document, String imageOldName, byte[] newImage, int newImageWidth, int newImageHeight) throws Exception {
try {
int imageParagraphPos = -1;
XWPFParagraph imageParagraph = null;
List<IBodyElement> documentElements = document.getBodyElements();
for (IBodyElement documentElement : documentElements) {
imageParagraphPos++;
if (documentElement instanceof XWPFParagraph) {
imageParagraph = (XWPFParagraph) documentElement;
if (imageParagraph.getCTP() != null && imageParagraph.getCTP().toString().trim().contains(imageOldName)) {
break;
}
}
}
if (imageParagraph == null) {
throw new Exception("Unable to replace image data due to the exception:\n"
+ "'" + imageOldName + "' not found in in document.");
}
ParagraphAlignment oldImageAlignment = imageParagraph.getAlignment();
// remove old image
boolean isDeleted = document.removeBodyElement(imageParagraphPos);
// now add new image
XWPFParagraph newImageParagraph = document.createParagraph();
XWPFRun newImageRun = newImageParagraph.createRun();
newImageParagraph.setAlignment(oldImageAlignment);
try (InputStream is = new ByteArrayInputStream(newImage)) {
newImageRun.addPicture(is, XWPFDocument.PICTURE_TYPE_JPEG, "qr",
Units.toEMU(newImageWidth), Units.toEMU(newImageHeight));
}
// set new image at the old image position
document.setParagraph(newImageParagraph, imageParagraphPos);
// NOW REMOVE REDUNDANT IMAGE FORM THE END OF DOCUMENT
document.removeBodyElement(document.getBodyElements().size() - 1);
return document;
} catch (Exception e) {
throw new Exception("Unable to replace image '" + imageOldName + "' due to the exception:\n" + e);
}
}
The image with placeholder:
enter image description here
The resulting image:
enter image description here

To replace picture templates in Microsoft Word there is no need to delete them.
The storage is as so:
The embedded media is stored as binary file. This is the picture data (XWPFPictureData). In the document a picture element (XWPFPicture) links to that picture data.
The XWPFPicture has settings for position, size and text flow. These dont need to be changed.
The changing is needed in XWPFPictureData. There one can replace the old binary content with the new.
So the need is to find the XWPFPicture in the document. There is a non visual picture name stored while inserting the picture in the document. So if one knows that name, then this could be a criteriea to find the picture.
If found one can get the XWPFPictureData from found XWPFPicture. There is method XWPFPicture.getPictureDatato do so. Then one can replace the old binary content of XWPFPictureData with the new. XWPFPictureData is a package part. So it has PackagePart.getOutputStream to get an output stream to write to.
Following complete example shows that all.
The source.docx needs to have an embedded picture named "QRTemplate.jpg". This is the name of the source file used while inserting the picture into Word document using Word GUI. And there needs to be a file QR.jpg which contains the new content.
The result.docx then has all pictures named "QRTemplate.jpg" replaced with the content of the given file QR.jpg.
import java.io.FileInputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordReplacePictureData {
static XWPFPicture getPictureByName(XWPFRun run, String pictureName) {
if (pictureName == null) return null;
for (XWPFPicture picture : run.getEmbeddedPictures()) {
String nonVisualPictureName = picture.getCTPicture().getNvPicPr().getCNvPr().getName();
if (pictureName.equals(nonVisualPictureName)) {
return picture;
}
}
return null;
}
static void replacePictureData(XWPFPictureData source, String pictureResultPath) {
try ( FileInputStream in = new FileInputStream(pictureResultPath);
OutputStream out = source.getPackagePart().getOutputStream();
) {
byte[] buffer = new byte[2048];
int length;
while ((length = in.read(buffer)) > 0) {
out.write(buffer, 0, length);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
static void replacePicture(XWPFRun run, String pictureName, String pictureResultPath) {
XWPFPicture picture = getPictureByName(run, pictureName);
if (picture != null) {
XWPFPictureData source = picture.getPictureData();
replacePictureData(source, pictureResultPath);
}
}
public static void main(String[] args) throws Exception {
String templatePath = "./source.docx";
String resultPath = "./result.docx";
String pictureTemplateName = "QRTemplate.jpg";
String pictureResultPath = "./QR.jpg";
try ( XWPFDocument document = new XWPFDocument(new FileInputStream(templatePath));
FileOutputStream out = new FileOutputStream(resultPath);
) {
for (IBodyElement bodyElement : document.getBodyElements()) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)bodyElement;
for (XWPFRun run : paragraph.getRuns()) {
replacePicture(run, pictureTemplateName, pictureResultPath);
}
}
}
document.write(out);
}
}
}

I have a dirty workaround. Since the text block on the right side of the image is static, I replaced the text with screen-shot on the original docx. And now, when the placeholder image been substituted by the new image, everything is rendered as expected.

Get BufferedImage from png PDXObjectImage

I'm trying to get a BufferedImage from PDXObjectImage that has png suffix with:
PDResources pdResources = pdPage.getResources();
Map<String, PDXObject> xobjects = (Map<String, PDXObject>) pdResources.getXObjects();
if (xobjects != null) {
for (String key : xobjects.keySet()) {
PDXObject xobject = xobjects.get(key);
if (xobject instanceof PDXObjectImage) {
PDXObjectImage imageObject = (PDXObjectImage) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null) {
BufferedImage image = imageObject.getRGBImage();
}
}
}
}
this code works fine having jpg PDXObjectImages but image is null with png images.
What is the right way to get a BufferedImage from a PDXObjectImage that has PNG suffix?
I also tried :
BufferedImage image = ImageIO.read(((PDPixelMap)imageObject).getPDStream().createInputStream());
But again image is null.
I'm using org.apache.pdfbox version 1.8.11.

Finally moved to version 2.0 of PDFBox then got a clear warning that I have not installed jbig2 decoder and solved the problem adding the following dependency in maven.
<dependency>
<groupId>com.levigo.jbig2</groupId>
<artifactId>levigo-jbig2-imageio</artifactId>
<version>1.6.5</version>
</dependency>
#TilmanHausherr thanks.

PDFBox locks JPEG input file until application exits

I'm using PDFBox RC2 in a Windows 7 environment, Java 1.8_66. I'm using it to create a PDF from a collection of 200dpi page-sized image files, both JPEG and PNG.
It turns out that when adding JPEG files to a PDF, the PDImageXObject.createFromFile() routine fails to close an internal file handle, thus locking the image file for the lifetime of the application. When adding PNG files to a PDF, there is no problem.
Here's some sample code that reproduces the issue. Using process explorer (from sysinternals), view the open file handles for the java.exe process and run this code. My test uses about 20 full sized JPEG files. Note that after the method exits, several locked files still remain behind.
public Boolean CreateFromImages_Broken(String pdfFilename, String[] imageFilenames) {
PDDocument doc = new PDDocument();
for (String imageFilename : imageFilenames) {
try {
PDPage page = new PDPage();
doc.addPage(page);
PDImageXObject pdImage = PDImageXObject.createFromFile(imageFilename, doc);
// at this point, if the imageFilename is a jpeg, pdImage holds onto a handle for
// the given imageFilename and that file remains locked until the application is closed
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
float scale = (float)72.0 / 200;
page.setMediaBox(new PDRectangle((int)(pdImage.getWidth() * scale), (int)(pdImage.getHeight() * scale)));
contentStream.drawImage(pdImage, 0, 0, pdImage.getWidth()*scale, pdImage.getHeight()*scale);
}
} catch (IOException ioe) {
return false;
}
}
try {
doc.save(pdfFilename);
doc.close();
} catch (IOException ex) {
return false;
}
return true;
}

As a workaround, I reviewed the source code for PNG and JPEG handling, and I've had success by implementing this, which seems to work for both file types:
public Boolean CreateFromImages_FIXED(String pdfFilename, String[] imageFilenames) {
PDDocument doc = new PDDocument();
for (String imageFilename : imageFilenames) {
FileInputStream fis = null;
try {
PDPage page = new PDPage();
doc.addPage(page);
PDImageXObject pdImage = null;
// work around JPEG issue by opening up our own stream, with which
// we can close ourselves instead of PDFBOX leaking it. For PNG
// images, the createFromFile seems to be OK
if (imageFilename.toLowerCase().endsWith(".jpg")) {
fis = new FileInputStream(new File(imageFilename));
pdImage = JPEGFactory.createFromStream(doc, fis);
} else {
pdImage = PDImageXObject.createFromFile(imageFilename, doc);
}
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
float scale = (float)72.0 / 200;
page.setMediaBox(new PDRectangle((int)(pdImage.getWidth() * scale), (int)(pdImage.getHeight() * scale)));
contentStream.drawImage(pdImage, 0, 0, pdImage.getWidth()*scale, pdImage.getHeight()*scale);
if (fis != null) {
fis.close();
fis = null;
}
}
} catch (IOException ioe) {
return false;
}
}
try {
doc.save(pdfFilename);
doc.close();
} catch (IOException ex) {
return false;
}
return true;
}

Convert from Itext PDF byte array to multipage TIFF file

I have a pdf file (obtained from a byte[] generated by iText) I need to send to a signature hardware.
Due some incompatibility with the java printer driver I can't send the PDF directly, so i need to convert it to images before. I've succeed converting each PDF page to a jpg file, but customer does not like solution cause signatures are not in all the document, only in individual pages.
As I've not found any free library, I decided to make it in four steps:
STEP1: generate PDF with itext and persist it.
FileOutputStream fos = new FileOutputStream("tempFile.pdf");
fos.write(myByteArray);
fos.close();
fos.flush();
STEP 2: convert from PDF multipaged to List<java.awt.Image>
List<Image> images = null;
Ghostscript.getInstance(); // create gs instance
PDFDocument lDocument = new PDFDocument();
lDocument.load(new File("tempFile.pdf"));
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);
try
{
images = renderer.render(lDocument);
}
catch (RendererException | DocumentException e)
{
e.printStackTrace();
}
Step 3: Now I iterate over List<java.awt.Image> to convert to an individual TIFF's.
int filename = 1;
TIFFEncodeParam params = new TIFFEncodeParam();
Iterator<Image> imageIterator = images.iterator();
while (imageIterator.hasNext()) {
BufferedImage image = (BufferedImage) imageIterator.next();
FileOutputStream os = new FileOutputStream(/*outputDir + */ filename + ".tif");
JAI.create("encode", image , os, "TIFF", params);
filename ++;
}
STEP 4: create multipaged TIFF from various individual TIFF files
BufferedImage image[] = new BufferedImage[paginas];
for (int i = 0; i < paginas; i++) {
SeekableStream ss = new FileSeekableStream((i + 1) + ".tif");
ImageDecoder decoder = ImageCodec.createImageDecoder("tiff", ss, null);
PlanarImage pi = new NullOpImage(decoder.decodeAsRenderedImage(0),null,null,OpImage.OP_IO_BOUND);
image[i] = pi.getAsBufferedImage();
ss.close();
}
TIFFEncodeParam params = new TIFFEncodeParam();
params.setCompression(TIFFEncodeParam.COMPRESSION_DEFLATE);
OutputStream out = new FileOutputStream(nombre +".tif");
ImageEncoder encoder = ImageCodec.createImageEncoder("tiff", out, params);
List <BufferedImage>list = new ArrayList<BufferedImage>(image.length);
for (int i = 1; i < image.length; i++) {
list.add(image[i]);
}
params.setExtraImages(list.iterator());
encoder.encode(image[0]);
out.close();
System.out.println("Done.");
DONE. Hope that helps for someone else with same problem.

I had same issue a while ago. I got lot of help from here:
Multiple page tif
Allso check:
JAI (Java Advance Image)
Here is the conde snippet to convert pdf pages to png images (using org.apache.pdfbox library):
PDDocument document = null;
document = PDDocument.load(pdf1);
int pageNum = document.getNumberOfPages();
PDFImageWriter writer = new PDFImageWriter();
String filename = pdf1.getPath() + "-";
filename = filename.replace(".pdf", "");
writer.writeImage(document, "png", "", 1, Integer.MAX_VALUE, filename);
document.close();
And after that i converted each PNG image to TIFF and then from multiple TIFF images to single multi paged TIFF.

Edit images in PDF file using COSStream object

I try to edit images in PDF file using PDFBox library. How I have example working only for jpeg images. ImageIO.read() fails to decode images with 'png' suffix. Here is code example. So my question: how to do the same for all types of images in PDF documents? Can I still use ImageIO for it or need another approach?
public static void main(String[] args) throws Exception {
PDDocument doc = PDDocument.load("docs/input1.pdf");
// Get all images from first page
Map<String, PDXObjectImage> pageImages = ((PDPage) doc.getDocumentCatalog().getAllPages().get(0)).getResources().getImages();
if (pageImages != null)
{
// iterate by images
Iterator<String> imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext())
{
String key = imageIter.next();
PDXObjectImage image = pageImages.get(key); // get page image object
String suffix = image.getSuffix(); // get image suffix
String imageName = key+'.'+suffix; // compose image name
System.out.print("process "+imageName+"... ");
COSStream s = image.getCOSStream(); // get COSStream to manipulate
BufferedImage img = ImageIO.read(s.getFilteredStream()); // get BufferedImage to edit
if(img == null)
{
System.out.println("Can't decode");
}
else
{
paint(img.createGraphics()); // draw on it
ImageIO.write(img, suffix, new File("out/"+imageName)); // write file to check result...
// encode image back to COSStream
OutputStream out = s.createFilteredStream();
ImageIO.write(img, suffix, out);
out.close();
System.out.println("done");
}
}
}
doc.save("out/output1.pdf"); // save document
}
/**
* Draw red rectangular to test
* #param g graphics
*/
public static void paint(Graphics2D g) {
int xpoints[] = {25, 245, 245, 25};
int ypoints[] = {25, 25, 545, 545};
g.setColor(Color.RED);
g.fillPolygon(xpoints, ypoints, 4);
}

It's better to work not with stream of PDXObjectImage but create new instance of PDXObjectImage and replace it in resources collection. It's more generic and universal way. Use getRGBImage() to convert PDXObjectImage to BufferedImage and constructor (PDPixelMap, PDJpeg etc) to convert edited result back to PDXObjectImage. Note you still have problems with JBIG2 and Jpeg2000 images due to bugs. Here is code example I use to find and convert all images in document:
// Recursive resource processor
// Here can be images inside in PDXObjectForm objects
protected static void processResources(PDResources resources, PDDocument doc, String filename) throws IllegalArgumentException, SecurityException, IOException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, JBIG2Exception, ColorSpaceException, ICCProfileException
{
if(resources == null) return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null) return;
// iterate by images
Iterator<String> imageIter = xObjects.keySet().iterator();
while (imageIter.hasNext())
{
String key = imageIter.next();
PDXObject o = xObjects.get(key);
if(o instanceof PDXObjectImage)
xObjects.put(key, processImage((PDXObjectImage) o /*, some additional parms... */));
if(o instanceof PDXObjectForm)
processResources(((PDXObjectForm) o).getResources(), doc, filename);
}
resources.setXObjects(xObjects);
}
Note resources.setXObjects() call at the end - without it changes you made in collection obtained by resources.getXObjects() will not be written back to document.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compressing PDF with Java library - java

Related

Java Apache POI: insert an image "infront the text"

Get BufferedImage from png PDXObjectImage

PDFBox locks JPEG input file until application exits

Convert from Itext PDF byte array to multipage TIFF file

Edit images in PDF file using COSStream object

Categories

Resources