I am using pdfbox to manipulate PDF content. I have a big PDF file (say 500 pages). I also have a few other single page PDF files containing only a single image which are around 8-15kb per file at the max. What I need to do is to import these single page pdf's like an overlay onto certain pages of the big PDF file.
I have tried the LayerUtility of pdfbox where I've succeeded but it creates a very large sized file as the output. The source pdf is about 1MB before processing and when added with the smaller pdf files, the size goes upto 64MB. And sometimes I need to include two smaller PDF's onto the bigger one.
Is there a better way to do this or am I just doing this wrong? Posting code below trying to add two layers onto a single page:
...
...
..
overlayDoc[pCounter] = PDDocument.load("data\\" + overlay + ".pdf");
outputPage[pCounter] = (PDPage) overlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu = new LayerUtility( overlayDoc[pCounter] );
form[pCounter] = lu.importPageAsForm( bigPDFDoc, Integer.parseInt(pageNo)-1);
lu.appendFormAsLayer( outputPage[pCounter], form[pCounter], aTrans, "OVERLAY_"+pCounter );
outputDoc.addPage(outputPage[pCounter]);
mOverlayDoc[pCounter] = PDDocument.load("data\\" + overlay2 + ".pdf");
mOutputPage[pCounter] = (PDPage) mOverlayDoc[pCounter].getDocumentCatalog().getAllPages().get(0);
LayerUtility lu2 = new LayerUtility( mOverlayDoc[pCounter] );
mForm[pCounter] = lu2.importPageAsForm(outputDoc, outputDoc.getNumberOfPages()-1);
lu.appendFormAsLayer( mOutputPage[pCounter], mForm[pCounter], aTrans, "OVERLAY_2"+pCounter );
outputDoc.removePage(outputPage[pCounter]);
outputDoc.addPage(mOutputPage[pCounter]);
...
...
With code like the following I don't see any unepected growth of size:
PDDocument bigDocument = PDDocument.load(BIG_SOURCE_FILE);
LayerUtility layerUtility = new LayerUtility(bigDocument);
List bigPages = bigDocument.getDocumentCatalog().getAllPages();
// import each page to superimpose only once
PDDocument firstSuperDocument = PDDocument.load(FIRST_SUPER_FILE);
PDXObjectForm firstForm = layerUtility.importPageAsForm(firstSuperDocument, 0);
PDDocument secondSuperDocument = PDDocument.load(SECOND_SUPER_FILE);
PDXObjectForm secondForm = layerUtility.importPageAsForm(secondSuperDocument, 0);
// These things can easily be done in a loop, too
AffineTransform affineTransform = new AffineTransform(); // Identity... your requirements may differ
layerUtility.appendFormAsLayer((PDPage) bigPages.get(0), firstForm, affineTransform, "Superimposed0");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(1), secondForm, affineTransform, "Superimposed1");
layerUtility.appendFormAsLayer((PDPage) bigPages.get(2), firstForm, affineTransform, "Superimposed2");
bigDocument.save(BIG_TARGET_FILE);
As you see I superimposed the first page of FIRST_SUPER_FILE on two pages of the target file but I only imported the page once. Thus, also the resources of that imported page are imported only once.
This approach is open for loops, too, but don't import the same page multiple times! Instead import all required template pages once up front as forms and in the later loop reference those forms again and again.
(I hope this solves your issue. If not, supply more code and the sample PDFs to reproduce your issue.)
Related
I have added an icon as Image object into PDF page with OpenPdf that is based on iText core. Here is my code
// inout stream from file
InputStream inputStream = new FileInputStream(file);
// we create a reader for a certain document
PdfReader reader = new PdfReader(inputStream);
// we create a stamper that will copy the document to a new file
PdfStamper stamp = new PdfStamper(reader, new FileOutputStream(file));
// adding content to each page
PdfContentByte over;
// get watermark icon
Image img = Image.getInstance(PublicFunction.getByteFromDrawable(context, R.drawable.ic_chat_lawone_new));
img.setAnnotation(new Annotation(0, 0, 0, 0, "https://github.com/LibrePDF/OpenPDF"));
img.setAbsolutePosition(pointF.x, pointF.y);
img.scaleAbsolute(50, 50);
// get page file number count
int pageNumbers = reader.getNumberOfPages();
if (pageNumbers < pageIndex) {
// closing PdfStamper will generate the new PDF file
stamp.close();
throw new PDFException("page index is out of pdf file page numbers", new Throwable());
}
// annotation added into target page
over = stamp.getOverContent(pageIndex);
if (over == null) {
stamp.close();
throw new PDFException("getUnderContent is null", new Throwable());
}
over.addImage(img);
// closing PdfStamper will generate the new PDF file
stamp.close();
// close reader
reader.close();
now I need to delete or update the color of added image object on user click, I have the click function that returns MotionEvent, now I need to delete or update or replace added image object.
Any Idea?!
In your parallel OpenPDF issue 464 you posted additionally:
Here my progress
Now I can achieve the XObjects added into pdf file, and I can remove them from pdf page this way:
// inout stream from file
InputStream inputStream = new FileInputStream(file);
// we create a reader for a certain document
PdfReader pdfReader = new PdfReader(inputStream);
// get page file number count
int pageNumbers = pdfReader.getNumberOfPages();
// we create a stamper that will copy the document to a new file
PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileOutputStream(file));
// get page
PdfDictionary page = pdfReader.getPageN(currPage);
PdfDictionary resources = page.getAsDict(PdfName.RESOURCES);
// get page resources
PdfArray annots = resources.getAsArray(PdfName.ANNOTS);
PdfDictionary xobjects = resources.getAsDict(PdfName.XOBJECT);
// remove Xobjects
for (PdfName key : xobjects.getKeys()) {
xobjects.remove(key);
}
// remove annots
for (PdfObject element : annots.getElements()) {
annots.remove(0);
}
// close pdf stamper
pdfStamper.close();
// close pdf reader
pdfReader.close();
So the XObjects will remove from the screen, but still there is a problem!!!
When I remove them and try to add a new one, last deleted object appears and add into the pdf page! REALLY!!! :))
I think there should be another place that these objects should be removed from.
What happens here is that you indeed do remove the bitmap image resources from the page:
for (PdfName key : xobjects.getKeys()) {
xobjects.remove(key);
}
but you don't remove the instructions for drawing these resources from the content stream. This has two consequences:
Your PDF strictly speaking becomes invalid as a resource is referenced from the content stream which is not defined in the page resources. Depending on the viewer in question this might result in warning messages.
If the PDF is further processed and some new XObject is added to the same page with the same resource name, the original image drawing instruction now again has a resource to draw and makes it appear at the original position once more.
This explains your observation:
When I remove them and try to add a new one, last deleted object appears and add into the pdf page! REALLY!!! :))
I assume you used the same source image in your test, so it looked like the original image appeared again at the original position when actually the new image appeared there.
Thus, instead of merely removing the image XObject, you have the choice of either
also removing the XObject drawing instruction from the content stream or
replacing the image XObject by an invisible XObject instead.
The former option in general is non-trivial, in particular if your PDF-tool-to-be also allows other changes of the page content. In case of iText 5 or iText 7 I'd propose using the PdfContentStreamEditor (see here) / PdfCanvasEditor (see here) class to find and remove Do operations from the page content streams but I have no OpenPDF version of that class yet.
What you can do quite easily, though, is replacing the image resources by form XObjects without any content:
PdfTemplate pdfTemplate = PdfTemplate.createTemplate(pdfStamper.getWriter(), 1, 1);
// replace Xobjects
for (PdfName key : xobjects.getKeys()) {
xobjects.put(key, pdfTemplate.getIndirectReference());
}
(RemoveImage test testRemoveImageAddedLikeHamidReza)
Beware, replacing all XObjects by empty XObjects has an obvious disadvantage, it replaces all XObjects, not merely the ones your tool created before! Thus, if the original PDFs processed by your tool also drew XObjects in their immediate content streams, those XObjects also are rendered invisible. If you don't want that, you need some specific criteria to recognize the image XObjects you added and only replace them.
Furthermore, there are other problems afoot: Each time you process the OverContent of a page in a PdfStamper, the pre-existing content of that page is wrapped into a q / Q (save-graphics-state / restore-graphics-state) envelope to prevent changes of the graphics state in that previous content bleed through and mix up your OverContent additions. Thus, if you manipulate a file many times in your tool, the original page content may be wrapped in a fairly deep nesting of such envelopes. Unfortunately PDF readers may support only a limited nesting depth, e.g. ISO 32000-1 mentions a maximum depth of 28 envelopes.
Thus, if you still have the chance to overhaul your design, you should consider putting the images into annotation appearances instead of into the page content. After all, you already do generate annotations, currently merely to transport a link, so you could also generate annotations with appearances.
im using Apache PDFBox,
I want to convert a RGB PDF file to another GRAYSCALE file WITHOUT using images method because its making huge file size -_- !!
so this my steps:
Export a (A4) First.pdf from Adobe InDesign, contain images, texts, vector-objects.
I read the First.pdf file. Done!
using LayerUtility, copy pages from First.pdf rotate them and put them to NEW PDF file (A4) Second.pdf. Done!
this method preferred because i need vector-objects to reduce the size.
then, i want to save this as GRAY-SCALE PDF file (Second-grayscale.pdf)
and this my code (not all):
PDDocument documentFirst = PDDocument.load("First.pdf"));
// Second.pdf its empty always
PDDocument documentSecond = PDDocument.load("Second.pdf"));
for (int page = 0; page < documentSecond.getNumberOfPages(); page++) {
// get current page from documentSecond
PDPage tempPage = documentSecond.getPage(page);
// create content contentStream
PDPageContentStream contentStream = new PDPageContentStream(documentSecond, tempPage);
// create layerUtility
LayerUtility layerUtility = new LayerUtility(documentSecond);
// importPageAsForm from documentFirst
PDFormXObject form = layerUtility.importPageAsForm(documentFirst, page);
// saveGraphicsState
contentStream.saveGraphicsState();
// rotate the page
Matrix matrix;
matrix.rotate(Math.toRadians(90));
contentStream.transform(matrix);
// draw the rotated page from documentFirst to documentSecond
contentStream.drawForm(form);
contentStream.close();
}
// save the new document
documentSecond.save("Second.pdf");
documentSecond.close();
documentFirst.close();
// now convert it to GRAYSCALE or do it in the Loop above!
well, i just start using Apache Box this week, i have followed some
example, but most are old and not working, until now i did what i
need, just need the Grayscale :)!!
if there are other solutions in java using open-source library
or a free tools. (i found with Ghost Script and Python)
i read this example but i didn't understand it and there are a functions deprecated!:
https://github.com/lencinhaus/pervads/blob/master/libs/pdfbox/src/java/org/apache/pdfbox/ConvertColorspace.java
its about PDF Specs, and changing Color Space...
You mentioned you would be interested in a Ghostscript based solution as far as I understood.
If you are able to call GS from your command line you can do color to grayscale conversion with this command line
gs -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o out.pdf -f input.pdf
my answer is taken from How to convert a PDF to grayscale from command line avoiding to be rasterized?
How can I add a page from external pdf doc to destination pdf if pages have different sizes?
Here is what I'd like to accomplish:
I tried to use LayerUtility (like in this example PDFBox LayerUtility - Importing layers into existing PDF), but once I import page from external pdf the process hangs:
PDDocument destinationPdfDoc = PDDocument.load(fileInputStream);
PDDocument externalPdf = PDDocument.load(EXTERNAL PDF);
List<PDPage> destinationPages = destinationPdfDoc.getDocumentCatalog().getAllPages();
LayerUtility layerUtility = new LayerUtility(destinationPdfDoc);
// process hangs here
PDXObjectForm firstForm = layerUtility.importPageAsForm(externalPdf, 0);
AffineTransform affineTransform = new AffineTransform();
layerUtility.appendFormAsLayer(destinationPages.get(0), firstForm, affineTransform, "external page");
destinationPdfDoc.save(resultTempFile);
destinationPdfDoc.close();
externalPdf.close();
What I'm doing wrong?
PDFBox dependencies
The main issue was that PDFBox has three core components and one required dependency. One core component was missing.
In comments the OP clarified that
Actually process doesn't hangs, the file is just not created at all.
As this sounds like there might have been an exception or error, trying to envelope the code as a try { ... } catch (Throwable t) { t.printStackTrace(); } block has been proposed in chat. And indeed,
java.lang.NoClassDefFoundError: org/apache/fontbox/util/BoundingBox
at org.apache.pdfbox.util.LayerUtility.importPageAsForm(LayerUtility.java:203)
at org.apache.pdfbox.util.LayerUtility.importPageAsForm(LayerUtility.java:135)
at ...
As it turned out, fontbox.jar was missing from the OP's setup.
The PDFBox version 1.8.x dependencies are described here. Especially there are the three core components pdfbox, fontbox, and jempbox all of which shall be present in the same version, and there is the required dependency commons-logging.
As soon as the missing component had been added, the sample worked properly.
Positioning the imported page
The imported page can be positioned on the target page by means of a translation in the AffineTransform parameter. This parameter also allows for other transformations, e.g. to scale, rotate, mirror, skew,...*
For the original sample files this PDF page
was added onto onto this page
which resulted in this page
The OP then wondered
how to position the imported layer
The parameter for that in the layerUtility.appendFormAsLayer call is the AffineTransform affineTransform. The OP used new AffineTransform() here which creates an identity matrix which in turn causes the source page to be added at the origin of coordinate system, in this case at the bottom.
By using a translation instead of the identity, e.g
PDRectangle destCrop = destinationPages.get(0).findCropBox();
PDRectangle sourceBox = firstForm.getBBox();
AffineTransform affineTransform = AffineTransform.getTranslateInstance(0, destCrop.getUpperRightY() - sourceBox.getHeight());
one can position the source page elsewhere, e.g. at the top:
PDFBox LayerUtility's expectations
Unfortunately it turns out that layerUtility.appendFormAsLayer appends the form to the page without resetting the graphics context.
layerUtility.appendFormAsLayer uses this code to add an additional content stream:
PDPageContentStream contentStream = new PDPageContentStream(
targetDoc, targetPage, true, !DEBUG);
Unfortunately a content stream generated by this constructor inherits the graphics state as is at the end of the existing content of the target page. This especially means that the user space coordinate system may not be in its default state anymore. Some software e.g. mirrors the coordinate system to have y coordinates increasing downwards.
If instead
PDPageContentStream contentStream = new PDPageContentStream(
targetDoc, targetPage, true, !DEBUG, true);
had been used, the graphics state would have been reset to its default state and, therefore, be known.
By itself, therefore, this method is not usable in a controlled manner for arbitrary input.
Fortunately, though, the LayerUtility also has a method wrapInSaveRestore(PDPage) to overcome this weakness by manipulating the content of the given page to have the default graphics state at the end.
Thus, one should replace
layerUtility.appendFormAsLayer(destinationPages.get(0), firstForm, affineTransform, "external page");
by
PDPage destPage = destinationPages.get(0);
layerUtility.wrapInSaveRestore(destPage);
layerUtility.appendFormAsLayer(destPage, firstForm, affineTransform, "external page");
I use different tools like processing to create vector plots. These plots are written as single or multi-page pdfs. I would like to include these plots in a single report-like pdf using pdfbox.
My current workflow includes these pdfs as images with the following pseudo code
PDDocument inFile = PDDocument.load(file);
PDPage firstPage = (PDPage) inFile.getDocumentCatalog().getAllPages().get(0);
BufferedImage image = firstPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
PDXObjectImage ximage = new PDPixelMap(document, image);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.drawXObject(ximage, 0, 0, ximage.getWidth(), ximage.getHeight());
contentStream.close();
While this works it looses the benefits of the vector file formats, espectially file/size vs. printing qualitity.
Is it possible to use pdfbox to include other pdf pages as embedded objects within a page (Not added as a separate page)? Could I e.g. use a PDStream? I would prefer a solution like pdflatex is able to embed pdf figures into a new pdf document.
What other Java libraries can you recommend for that task?
Is it possible to use pdfbox to include other pdf pages as embedded objects within a page
It should be possible. The PDF format allows the use of so called form xobjects to serve as such embedded objects. I don't see an explicit implementation for that, though, but the procedure is similar enough to what PageExtractor or PDFMergerUtility do.
A proof of concept derived from PageExtractor using the current SNAPSHOT of the PDFBox 2.0.0 development version:
PDDocument source = PDDocument.loadNonSeq(SOURCE, null);
List<PDPage> pages = source.getDocumentCatalog().getAllPages();
PDDocument target = new PDDocument();
PDPage page = new PDPage();
PDRectangle cropBox = page.findCropBox();
page.setResources(new PDResources());
target.addPage(page);
PDFormXObject xobject = importAsXObject(target, pages.get(0));
page.getResources().addXObject(xobject, "X");
PDPageContentStream content = new PDPageContentStream(target, page);
AffineTransform transform = new AffineTransform(0, 0.5, -0.5, 0, cropBox.getWidth(), 0);
content.drawXObject(xobject, transform);
transform = new AffineTransform(0.5, 0.5, -0.5, 0.5, 0.5 * cropBox.getWidth(), 0.2 * cropBox.getHeight());
content.drawXObject(xobject, transform);
content.close();
target.save(TARGET);
target.close();
source.close();
This code imports the first page of a source document to a target document as XObject and puts it twice onto a page there with different scaling and rotation transformations, e.g. for this source
it creates this
The helper method importAsXObject actually doing the import is defined like this:
PDFormXObject importAsXObject(PDDocument target, PDPage page) throws IOException
{
final PDStream src = page.getContents();
if (src != null)
{
final PDFormXObject xobject = new PDFormXObject(target);
OutputStream os = xobject.getPDStream().createOutputStream();
InputStream is = src.createInputStream();
try
{
IOUtils.copy(is, os);
}
finally
{
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(os);
}
xobject.setResources(page.findResources());
xobject.setBBox(page.findCropBox());
return xobject;
}
return null;
}
As mentioned above this is only a proof of concept, corner cases have not yet been taken into account.
To update this question:
There is already a helper class in org.apache.pdfbox.multipdf.LayerUtility to do the import.
Example to show superimposing a PDF page onto another PDF: SuperimposePage.
This class is part of the Apache PDFBox Examples and sample transformations as shown by #mkl were added to it.
As mkl appropriately suggested, PDFClown is among the Java libraries which provide explicit support for page embedding (so-called Form XObjects (see PDF Reference 1.7, ยง 4.9)).
In order to let you get a taste of the way PDFClown works, the following code represents the equivalent of mkl's PDFBox solution (NOTE: as mkl later stated, his code sample was by no means optimised, so this comparison may not correspond to the actual status of PDFBox -- comments are welcome to clarify this):
Document source = new File(SOURCE).getDocument();
Pages sourcePages = source.getPages();
Document target = new File().getDocument();
Page targetPage = new Page(target);
target.getPages().add(targetPage);
XObject xobject = sourcePages.get(0).toXObject(target);
PrimitiveComposer composer = new PrimitiveComposer(targetPage);
Dimension2D targetSize = targetPage.getSize();
Dimension2D sourceSize = xobject.getSize();
composer.showXObject(xobject, new Point2D.Double(targetSize.getWidth() * .5, targetSize.getHeight() * .35), new Dimension(sourceSize.getWidth() * .6, sourceSize.getHeight() * .6), XAlignmentEnum.Center, YAlignmentEnum.Middle, 45);
composer.showXObject(xobject, new Point2D.Double(targetSize.getWidth() * .35, targetSize.getHeight()), new Dimension(sourceSize.getWidth() * .4, sourceSize.getHeight() * .4), XAlignmentEnum.Left, YAlignmentEnum.Top, 90);
composer.flush();
target.getFile().save(TARGET, SerializationModeEnum.Standard);
source.getFile().close();
Comparing this code to PDFBox's equivalent you can notice some relevant differences which show PDFClown's neater style (it would be nice if some PDFBox expert could validate my assertions):
Page-to-FormXObject conversion: PDFClown natively supports a dedicated method (Page.toXObject()), so there's no need for additional heavy-lifting such as the helper method importAsXObject();
Resource management: PDFClown automatically (and transparently) allocates page resources, so there's no need for explicit calls such as page.getResources().addXObject(xobject, "X");
XObject drawing: PDFClown supports both high-level (explicit scale, translation and rotation anchors) and low-level (affine transformations) methods to place your FormXObject into the page, so there's no need to necessarily deal with affine transformations.
The whole point is that PDFClown features a rich architecture made up of multiple abstraction layers: according to your requirements, you can choose the most appropriate coding style (either to delve into PDF's low-level basic structures or to leverage its convenient and elegant high-level model). PDFClown lets you tweak every single byte and solve complex tasks with a ridiculously simple method call, at your will.
DISCLOSURE: I'm the lead developer of PDFClown.
I've got a printer which doesn't support the feature I need.
The printer prints A2 paper size. I would like to print two A3 size pages, that would fit on a single A2 paper, but my printer doesn't support this.
I already called the support of the company, but they told me I need to buy a newer one because my printer doesn't support this function. (Its very funny because an even older version of that printer does support this function).
So I tried to use the Apache PDFBox, where I can load my pdf file like this:
File pdfFile = new File(path);
PDDocument pdfDocument = load(pdfFile);
The file I loaded is size A3. I think it would be enough if I could get a new PDDocument with A2 paper size. Then put my loaded pdfFile twice in an A2 paper.
All in all, I need the file I loaded there two times on one page. I just don't know how to do that.
Best regards.
You might want to look at the PageCombinationSample.java which according to its JavaDoc does nearly what you need:
This sample demonstrates how to combine multiple pages into single bigger pages (for example
two A4 modules into one A3 module) using form XObjects [PDF:1.6:4.9].
Form XObjects are a convenient way to represent contents multiple times on multiple pages as
templates.
The central code:
// 1. Opening the source PDF file...
File sourceFile = new File(filePath);
// 2. Instantiate the target PDF file!
File file = new File();
// 3. Source page combination into target file.
Document document = file.getDocument();
Pages pages = document.getPages();
int pageIndex = -1;
PrimitiveComposer composer = null;
Dimension2D targetPageSize = PageFormat.getSize(SizeEnum.A4);
for(Page sourcePage : sourceFile.getDocument().getPages())
{
pageIndex++;
int pageMod = pageIndex % 2;
if(pageMod == 0)
{
if(composer != null)
{composer.flush();}
// Add a page to the target document!
Page page = new Page(
document,
PageFormat.getSize(SizeEnum.A3, OrientationEnum.Landscape)
); // Instantiates the page inside the document context.
pages.add(page); // Puts the page in the pages collection.
// Create a composer for the target content stream!
composer = new PrimitiveComposer(page);
}
// Add the form to the target page!
composer.showXObject(
sourcePage.toXObject(document), // Converts the source page into a form inside the target document.
new Point2D.Double(targetPageSize.getWidth() * pageMod, 0),
targetPageSize,
XAlignmentEnum.Left,
YAlignmentEnum.Top,
0
);
}
composer.flush();
// 4. Serialize the PDF file!
serialize(file, "Page combination", "combining multiple pages into single bigger ones", "page combination");
// 5. Closing the PDF file...
sourceFile.close();