I want to iterate through the pages of a PDF and write a new PDF where all images have interpolation set to false. I was expecting to be able to do something like the following, but I cannot find a method of accessing the Images or Rectangles on the PDF page.
PdfCopy copy = new PdfCopy(document, new FileOutputStream(outFileName));
copy.newPage();
PdfReader reader = new PdfReader(inFileName);
for(int i = 1; i <= reader.getNumberOfPages(); i++) {
PdfImportedPage importedPage = copy.getImportedPage(reader, i);
for(Image image : importedPage.images())
image.isInterpolated(false);
copy.addPage(importedPage);
}
reader.close();
There is, however, no PdfImportedPage.images(). Any suggestions on how I might otherwise do the same?
Cheers
Nik
It won't be that easy. There's no high-level way of doing what you want. You'll have to enumerate the resources looking for XObject Images, and clear their /Interpolate flag.
And you'll have to do it before creating the PdfImportedPage because there's no public way to access their resources. Grr.
void removeInterpolation( int pageNum ) {
PdfDictionary page = someReader.getPageN(pageNum);
PdfDictionary resources = page.getAsDict(PdfName.RESOURCES);
enumResources(resources);
}
void enumResource( PdfDictionary resources) {
PdfDictionary xobjs = resources.getAsDict(PdfName.XOBJECTS);
Set<PdfName> xobjNames = xobjs.getKeys();
for (PdfName name : xobjNames) {
PdfStream xobjStream = xobjs.getAsStream(name);
if (PdfName.FORM.equals( xobjStream.getAsName(PdfName.SUBTYPE))) {
// xobject forms have their own nested resources.
PdfDictionary nestedResources = xobjStream.getAsDict(PdfName.RESOURCES);
enumResources(nestedResources);
} else {
xobjStream.remove(PdfName.INTERPOLATE);
}
}
}
There's quite a bit of null-checking that's skipped in the above code. A page doesn't have to have a resource dictionary, though they almost always do. Ditto for XObject Forms. All the getAs* functions will return null if the given key is missing or of a different type... you get the idea.
Related
Currently using pdfbox 2.x library for removing and adding the QR code image after loading the pdf file from the file system. After removing the QR code from the pdf file and saving and opening the modified document in Adobe Reader, it populates the above warning message "An error exists on this page. Acrobat may not display the page correctly". The QR code image is removed successfully but shows the warning message after opening.
Also, Before removing the QR code image from the pdf file, the file size was 6.8 MB. After reading the QR code, the file size increases to 8.1 MB.
It should not show the warning message: "An error exists on this page. Acrobat may not display the page correctly" when opening the modified document without QR code image. For the original file, there is no warning message showing.
Also, it was expected that after removing the QR code image, the file size should not increase, rather it should decrease or remain the same.
Can you please help?
Below is the code for removing qr code image from the pdf file.
pdDocument = PDDocument.load(new File(aBarcodeVO.getSourceFilePath()));
newDocument = new PDDocument();
for (int pageCount = 0; pageCount < pdDocument.getNumberOfPages(); pageCount++) {
PDPage pdPage = newDocument.importPage(pdDocument.getPage(pageCount));
String imgUniqueId = aBarcodeVO.getImgUniqueId().concat(String.valueOf(pageCount));
boolean hasQRCodeOnPage = removeQRCodeImage(newDocument, pdPage, imgUniqueId);
qRCodePageList.add(hasQRCodeOnPage);
}
if(qRCodePageList.contains(true)) {
newDocument.save(aBarcodeVO.getDestinationFilePath(true));
}
newDocument.close();
pdDocument.close();
public static boolean removeQRCodeImage(PDDocument document, PDPage page, String imgUniqueId) throws Exception {
String qrCodeCosName = null;
PDResources pdResources = page.getResources();
boolean hasQRCodeOnPage=false;
for (COSName propertyName : pdResources.getXObjectNames()) {
if (!pdResources.isImageXObject(propertyName)) {
continue;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
PDImageXObject pdImageXObject = (PDImageXObject) o;
if (pdImageXObject.getMetadata() != null) {
DomXmpParser xmpParser = new DomXmpParser();
XMPMetadata xmpMetadata = xmpParser.parse(pdImageXObject.getMetadata().toByteArray());
if(xmpMetadata.getDublinCoreSchema()!=null && StringUtils.isNoneBlank(xmpMetadata.getDublinCoreSchema().getTitle())&&xmpMetadata.getDublinCoreSchema().getTitle().contains("_barcodeimg_")) {
((COSDictionary) pdResources.getCOSObject().getDictionaryObject(COSName.XOBJECT))
.removeItem(propertyName);
log.debug("propertyName REMOVED--"+propertyName.getName());
qrCodeCosName = propertyName.getName();
hasQRCodeOnPage=true;
}
}
}
} catch (IOException e) {
log.error("Exception in removeQRCodeImage() while extracting QR image:" + e, e);
}
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
log.debug("original tokens size" + tokens.size());
List<Object> newTokens = new ArrayList<Object>();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator op = (Operator) token;
// find image - remove it
if (op.getName().equals("Do")) {
COSName cosName = (COSName) tokens.get(j - 1);
if (cosName.getName().equals(qrCodeCosName)) {
newTokens.remove(newTokens.size() - 1);
continue;
}
}
}
newTokens.add(token);
}
log.debug("tokens size" + newTokens.size());
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
return hasQRCodeOnPage;
}
A possible error: PDF resources can be shared across pages, even the same Resources object may be used for multiple pages. If your document is of such a type, therefore, your manipulation of the resources of a page may actually manipulate the resources of all pages while your content stream manipulation changes only a single page. Uses of the same image on other pages, therefore, could cause the error message you observed.
Another possible error: While iterating over the resources of the page, you remove all matching image Xobjects. But while iterating over the instructions of the page, you only remove the showing instructions for one matching image Xobject, the last one found. If there are multiple matching image Xobjects on a page, showing instructions for some of them may remain while the Xobjects themselves are removed; this could also cause the error message observed.
There might also be other issues. For a more specific analysis please share a representative example PDF.
I have a form filling program that takes in a pdf template with rows of fillable fields as well as JSON data for the field values, then populates said values into the form. If there's more rows than can fit on one page, the page is duplicated and the extra rows are added to the duplicate page. The page is deep cloned, then the tValues for the page annotations are changed, as well as the page for the annotation, and new fields are created for the new page. When I open my exported pdf in chrome I can see the field values on the new page, but not in Acrobat. The pdf can be re-saved from chrome to be opened in Acrobat with the values, but this strips away the fillable fields, and I don't want to make users do a work-around for what's likely developer error, since I don't have an in-depth understanding of the PDF spec.
Below is my code for duplicating the page, I'm hoping someone more generally knowledgeable about PDFs can identify what I'm doing wrong so that I can fix it. I haven't included the other helper functions called at the end, but they will essentially find the correct PDField instance and call setValue() on it.
private void createContent(PDDocument document, JSONObject jsonObj) throws IOException {
final PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
final JSONArray rowsArr = (JSONArray)jsonObj.get("rows");
final int pages = (int)Math.ceil(rowsArr.size() / 34.0);
if (pages > 1)
{
// Add additional pages
final PDFCloneUtility cloner = new PDFCloneUtility(document);
final PDPage oldPage = document.getPage(0);
for (int i=1; i < pages; i++)
{
final COSDictionary dupPageDict = (COSDictionary)cloner.cloneForNewDocument(oldPage);
final PDPage dupPage = new PDPage(dupPageDict);
final List<PDAnnotation> dupAnnoList = dupPage.getAnnotations();
for (PDAnnotation anno : dupAnnoList)
{
final COSDictionary annoDict = anno.getCOSObject();
final String oldTStr = annoDict.getString(COSName.T); // Field name, ex: INC0
// Change annotation to new page and create a field for it
if (oldTStr.endsWith(String.valueOf(i-1)))
{
// Change page link and add to dupPage
anno.setPage(dupPage);
dupPage.getAnnotations().add(anno);
// Update anno name for new page
final String dupTStr = oldTStr.substring(0, oldTStr.length() - 1) + i; // ex: INC1
annoDict.setItem(COSName.T, new COSString(dupTStr));
annoDict.setItem(COSName.AP, null);
// All the fields should be text fields.
COSBase ftBase = annoDict.getItem(COSName.FT);
if (ftBase instanceof COSName && COSName.TX == ftBase)
{
final PDTextField oldField = (PDTextField) acroForm.getField(oldTStr);
if (oldField != null)
{
// Create new field for anno
final PDTextField dupField = new PDTextField(acroForm);
dupField.setPartialName(dupTStr);
dupField.setDefaultAppearance(oldField.getDefaultAppearance());
if (anno instanceof PDAnnotationWidget)
{
dupField.getWidgets().add((PDAnnotationWidget) anno);
acroForm.getFields().add(dupField);
}
}
}
}
}
// Append dupPage before instructions page (which is at the end of the doc)
final PDPageTree pgTree = document.getDocumentCatalog().getPages();
pgTree.insertBefore(dupPage, pgTree.get(pgTree.getCount() - 1));
}
}
PDFont font = PDType1Font.HELVETICA;
PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
acroForm.setDefaultResources(resources);
for (int i = 0; i < pages; i++)
{
addHeaderInfo(acroForm, jsonObj, i);
addMainInfo(acroForm, rowsArr, i);
addFooterInfo(acroForm, jsonObj, i);
}
}
Example output file: https://www.mediafire.com/file/7zu7xxo2fflpdnw/example_out_73130945.pdf/file
We use pdfbox for in one of our applications.
Some pdfs that are overlaid result in "broken" output and fonts.
Below is the sample code I'm using to overlay pdfs.
The pdfs sometimes have different numbers of pages.
We flatten acroforms and set annotations to read-only.
Pdf page rotation and bbox sizing sometimes set differently (especially from scanners) so we try to correct for this.
PDDocument baseDocument = PDDocument.load(new File("base.pdf"));
PDDocument overlayDocument = PDDocument.load(new File("overlay.pdf"));
Iterator<PDPage> baseDocumentIterator = baseDocument.getPages().iterator();
Iterator<PDPage> overlayIterator = overlayDocument.getPages().iterator();
PDDocument finalOverlayDoc = new PDDocument();
while(baseDocumentIterator.hasNext() && overlayIterator.hasNext()) {
PDPage backing = baseDocumentIterator.next();
//locking annotations per page
List<PDAnnotation> annotations = backing.getAnnotations();
for (PDAnnotation a :annotations) {
a.setLocked(true);
a.setReadOnly(true);
}
// setting size so there's no weird overflow issues
PDRectangle rect = new PDRectangle();
rect.setLowerLeftX(0);
rect.setLowerLeftY(0);
rect.setUpperRightX(backing.getBBox().getWidth());
rect.setUpperRightY(backing.getBBox().getHeight());
backing.setCropBox(rect);
backing.setMediaBox(rect);
backing.setBleedBox(rect);
PDPage pg = overlayIterator.next();
//setting rotation if different. Some scanners cause issues.
if(backing.getRotation()!= pg.getRotation())
{
pg.setRotation(-backing.getRotation());
}
finalOverlayDoc.addPage(pg);
}
finalOverlayDoc.close();
//flatten acroform
PDAcroForm acroForm = baseDocument.getDocumentCatalog().getAcroForm();
if (acroForm != null) {
acroForm.flatten();
acroForm.setNeedAppearances(false);
}
Overlay overlay = new Overlay();
overlay.setOverlayPosition(Overlay.Position.FOREGROUND);
overlay.setInputPDF(baseDocument);
overlay.setAllPagesOverlayPDF(finalOverlayDoc);
Map<Integer, String> ovmap = new HashMap<Integer, String>();
overlay.overlay(ovmap);
PDPageTree allOverlayPages = overlayDocument.getPages();
if(baseDocument.getPages().getCount() < overlayDocument.getPages().getCount()) //Additional pages in the overlay pdf need to be appended to the base pdf.
{
for(int i=baseDocument.getPages().getCount();i<allOverlayPages.getCount(); i++)
{
baseDocument.addPage(allOverlayPages.get(i));
}
}
PDDocument finalDocument = new PDDocument();
for(PDPage p: baseDocument.getPages()){
finalDocument.addPage(p);
}
String filename = "examples/merge_pdf_examples/debug.pdf";
filename = filename + new Date().getTime() + ".pdf";
finalDocument.save(filename);
finalDocument.close();
baseDocument.close();
overlayDocument.close();
There is no error in the PDF file you shared relevant for using Overlay.
It uses one PDF feature which is seldom used, though, the pages inherit resources from their parent node: Page objects in a PDF are arranged in a tree with the actual pages being leaves; a page object in this tree often itself carries all the information defining it but a number of page properties can also be carried by an inner node and inherited by descendant pages unless they override them.
After you shared your code it turns out that you have a preparation step which loses all inherited information: When you generate finalOverlayDoc from overlayDocument you essentially do:
while(overlayIterator.hasNext()) {
PDPage pg = overlayIterator.next();
//setting rotation if different. Some scanners cause issues.
finalOverlayDoc.addPage(pg);
}
(OverlayDocuments test testOverlayPreparationExampleBroken)
Here you only transport the page object itself, losing all inherited properties.
For the document at hand you can fix this by explicitly setting the page resources to the inherited ones:
while(overlayIterator.hasNext()) {
PDPage pg = overlayIterator.next();
pg.setResources(pg.getResources());
//setting rotation if different. Some scanners cause issues.
finalOverlayDoc.addPage(pg);
}
(OverlayDocuments test testOverlayPreparationFixedExampleBroken)
Beware, though: This only explicitly sets the page resources but there also are other page attributes which can be inherited.
I would propose, therefore, that you don't create a new PDDocument at all; instead of moving the overlayDocument pages to finalOverlayDoc only change them in place. If overlayDocument has more pages than baseDocument, you additionally have to remove excess pages from overlayDocument. Then use overlayDocument in overlaying instead of finalOverlayDoc.
Looking further down your code I see you repeat the anti-pattern of moving page objects to other documents without respecting inherited properties again and again. I guess you should completely overhaul that code, removing that anti-pattern.
I'm useing iText to fill a template PDF which contains a AcroForm.
Now I want to use this template to create a new PDF with dynamically pages.
My idea is it to fill the template PDF, copy the page with the written fields and add it to a new file. They main Problem is that our customer want to designe the template by them self. So I'm not sure if I try the right way to solve this Problem.
So I've created this code which don't work right now I get the error com.itextpdf.io.IOException: PDF header not found.
My Code
x = 1;
try (PdfDocument finalDoc = new PdfDocument(new PdfWriter("C:\\Users\\...Final.pdf"))) {
for (HashMap<String, String> map : testValues) {
String path1 = "C:\\Users\\.....Temp.pdf"
InputStream template = templateValues.get("Template");
PdfWriter writer = new PdfWriter(path1);
try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(template), writer)) {
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
for (HashMap.Entry<String, String> map2 : map.entrySet()) {
if (form.getField(map2.getKey()) != null) {
Map<String, PdfFormField> fields = form.getFormFields();
fields.get(map2.getKey()).setValue(map2.getValue());
}
}
} catch (IOException | PdfException ex) {
System.err.println("Ex2: " + ex.getMessage());
}
if (x != 0 && (x % 5) == 0) {
try (PdfDocument tempDoc = new PdfDocument(new PdfReader(path1))) {
PdfPage page = tempDoc.getFirstPage();
finalDoc.addPage(page.copyTo(finalDoc));
} catch (IOException | PdfException ex) {
System.err.println("Ex3: " + ex.getMessage());
}
}
x++;
}
} catch (IOException | PdfException ex) {
System.err.println("Ex: " + ex.getMessage());
}
Part 1 - PDF Header is Missing
this appears to be caused by you attempting to re-read an InputStream w/in a loop that has already been read (and, depending on the configuration of the PdfReader, closed). Solving for this depends on the specific type of InputStream being used - if you want to leave it as a simple InputStream (vs. a more specific yet more capable InputStream type) then you'll need to first slurp up the bytes from the stream into memory (e.g. a ByteArrayOutputStream) then create your PDFReaders based on those bytes.
i.e.
ByteArrayOutputStream templateBuffer = new ByteArrayOutputStream();
while ((int c = template.read()) > 0) templateBuffer.write(c);
for (/* your loop */) {
...
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(templateBuffer.toByteArray())), new PdfWriter(tmp))
...
Part 2 - other problems
Couple of things
make sure to grab the recently released 7.0.1 version of iText since it included a couple of fixes wrt/ AcroForm handling
you can probably get away with using ByteArrayOutputStreams for your temporary PDFs (vs. writing them out to files) - i'll use this approach in the example below
PdfDocument/PdfPage is in the "kernel" module, yet AcroForms are in the "form" module (meaning PdfPage is intentionally unaware of AcroForms) - IPdfPageExtraCopier is sortof the bridge between the modules. In order to properly copy AcroForms, you need to use the two-arg copyTo() version, passing an instance of PdfPageFormCopier
field names must be unique in the document (the "absolute" field name that is - i'll skip field hierarcies for now). Since we're looping through and adding the fields from the template multiple times, we need to come up with a strategy to rename the fields to ensure uniqueness (the current API is actually a little bit clunky in this area)
File acroFormTemplate = new File("someTemplate.pdf");
Map<String, String> someMapOfFieldToValues = new HashMap<>();
try (
PdfDocument finalOutput = new PdfDocument(new PdfWriter(new FileOutputStream(new File("finalOutput.pdf")));
) {
for (/* some looping condition */int x = 0; x < 5; x++) {
// for each iteration of the loop, create a temporary in-memory
// PDF to handle form field edits.
ByteArrayOutputStream tmp = new ByteArrayOutputStream();
try (
PdfDocument filledInAcroFormTemplate = new PdfDocument(new PdfReader(new FileInputStream(acroFormTemplate)), new PdfWriter(tmp));
) {
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(filledInAcroFormTemplate, true);
for (PdfFormField field : acroForm.getFormFields().values()) {
if (someMapOfFieldToValues.containsKey(field.getFieldName())) {
field.setValue(someMapOfFieldToValues.get(field.getFieldName()));
}
}
// NOTE that because we're adding the template multiple times
// we need to adopt a field renaming strategy to ensure field
// uniqueness in the final document. For demonstration's sake
// we'll just rename them prefixed w/ our loop counter
List<String> fieldNames = new ArrayList<>();
fieldNames.addAll(acroForm.getFormFields().keySet()); // avoid ConfurrentModification
for (String fieldName : fieldNames) {
acroForm.renameField(fieldName, x+"_"+fieldName);
}
}
// the temp PDF needs to be "closed" for all the PDF finalization
// magic to happen...so open up new read-only version to act as
// the source for the merging from our in-memory bucket-o-bytes
try (
PdfDocument readOnlyFilledInAcroFormTemplate = new PdfDocument(new PdfReader(new ByteArrayInputStream(tmp.toByteArray())));
) {
// although PdfPage.copyTo will probably work for simple pages, PdfDocument.copyPagesTo
// is a more comprehensive copy (wider support for copying Outlines and Tagged content)
// so it's more suitable for general page-copy use. Also, since we're copying AcroForm
// content, we need to use the PdfPageFormCopier
readOnlyFilledInAcroFormTemplate.copyPagesTo(1, 1, finalOutput, new PdfPageFormCopier());
}
}
}
Close your PdfDocuments when you are done with adding content to them.
I'd like to have a program that removes all rectangles from a PDF file. One use case for this is to unblacken a given PDF file to see if there is any hidden information behind the rectangles. The rest of the PDF file should be kept as-is.
Which PDF library is suitable to this task? In Java, I would like the code to look like this:
PdfDocument doc = PdfDocument.load(new File("original.pdf"));
PdfDocument unblackened = doc.transform(new CopyingPdfVisitor() {
public void visitRectangle(PdfRect rect) {
if (rect.getFillColor().getBrightness() >= 0.1) {
super.visitRectangle(rect);
}
}
});
unblackened.save(new File("unblackened.pdf"));
The CopyingPdfVisitor would copy a PDF document exactly as-is, and my custom code would leave out all the dark rectangles.
Itext pdf library have ways to modify pdf content.
The *ITEXT CONTENTPARSER Example * may give you any idea. "qname" parameter (qualified name) may be used to detected rectangle element.
http://itextpdf.com/book/chapter.php?id=15
Other option, if you want obtain the text on the document use the PdfReaderContentParser to extract text content
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
example at http://itextpdf.com/examples/iia.php?id=277