apache pdfbox - how to test if a document is flattened?

apache pdfbox - how to test if a document is flattened? - java

I have written the following small Java main method. It takes in a (hardcoded for testing purposes!) PDF document I know contains active elements in the form and need to flatten it.
public static void main(String [] args) {
try {
// for testing
Tika tika = new Tika();
String filePath = "<path-to>/<pdf-document-with-active-elements>.pdf";
String fileName = filePath.substring(0, filePath.length() -4);
File file = new File(filePath);
if (tika.detect(file).equalsIgnoreCase("application/pdf")) {
PDDocument pdDocument = PDDocument.load(file);
PDAcroForm pdAcroForm = pdDocument.getDocumentCatalog().getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.flatten();
pdAcroForm.refreshAppearances();
pdDocument.save(fileName + "-flattened.pdf");
}
pdDocument.close();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
What kind of test would assert the File(<path-to>/<pdf-document-with-active-elements>-flattened.pdf) generated by this code would, in fact, be flat?

What kind of test would assert that the file generated by this code would, in fact, be flat?
Load that document anew and check whether it has any form fields in its PDAcroForm (if there is a PDAcroForm at all).
If you want to be thorough, also iterate through the pages and assure that there are no Widget annotations associated to them anymore.
And to really be thorough, additionally determine the field positions and contents before flattening and apply text extraction at those positions to the flattened pdf. This verifies that the form has not merely been dropped but indeed flattened.

Related

PDF Generation using iText7 with Java

I am trying to add content to an existing PDF using iText7. I have been able to create new PDFs and add content to them using Paragraphs and Tables. However, once I go to reopen a PDF that I have created and attempt to write more content to it, the new content starts overwriting the old content. I want the new content to be appended to the Document after the old content. How can I achieve this?
Edit
This is the Class which sets up some common methods that will be executed with each change done to a PDF document.
public class PDFParent {
private static Document document;
private static PdfWriter writer;
private static PdfReader reader;
private static PageSize ps;
private static PdfDocument pdfDoc;
public static Document getDocument() {
return document;
}
public static void setDocument(Document document) {
PDFParent.document = document;
}
public static void setupPdf(byte[] inParamInPDFBinary){
writer = new PdfWriter(new ByteArrayOutputStream());
try {
reader = new PdfReader(new ByteArrayInputStream(inParamInPDFBinary));
} catch (IOException e) {
e.printStackTrace();
}
pdfDoc = new PdfDocument(reader, writer);
ps = PageSize.A4;
document = new Document(pdfDoc, ps);
}
public static byte[] writePdf(){
ByteArrayOutputStream stream = (ByteArrayOutputStream) writer.getOutputStream();
return stream.toByteArray();
}
public static void closePdf(){
pdfDoc.close();
}
And this is how I am adding the content to the pdf
public class ActAddParagraphToPDF extends PDFParent{
// output parameters
public static byte[] outParamOutPDFBinary;
public static ActAddParagraphToPDF mosAddParagraphToPDF(byte[] inParamInPDFBinary, String inParamParagraph) throws IOException {
ActAddParagraphToPDF result = new ActAddParagraphToPDF();
setupPdf(inParamInPDFBinary);
//---------------------begin content-------------------//
getDocument().add((Paragraph) new Paragraph(inParamParagraph));
//---------------------end content-------------------//
closePdf();
outParamOutPDFBinary = writePdf();
return result;
}
When I go to execute this second class, it appears to be treating the original document as if it is blank. Then writes the new Paragraph on top of the original content. I know that I am missing something, just not sure what that is.

Is reopening the document every time a requirement? If you keep the document open, you can append as much content as you want and you won't have to deal with content overlapping problems.
If it is a requirement, then you will have to track the last free content position yourself and reset it to new DocumentRenderer.
A Rectangle would be enough to store the free area that is left on the last page. Right before closing the document, save the free area in some Rectangle in the following way:
Rectangle savedBbox = document.getRenderer().getCurrentArea().getBBox();
After that, when you have to reopen the document, first jump to the last page:
document.add(new AreaBreak(AreaBreakType.LAST_PAGE));
And then reset the free occupied area left from the previous time you dealt with the document:
document.getRenderer().getCurrentArea().setBBox(savedBbox);
After that you are free to add new content to the document and it will appear at the saved position:
document.add(new Paragraph("Hello again"));
Please note that this approach works if you know which documents you are dealing with (i.e. you can associate last "free" position with the document's ID) and this document is not changed outside of your environment. If this is not the case, I recommend that you look into content extraction and in particular PdfDocumentContentParser. It can help you to extract the content you have on the page and determine which positions it occupies. Then you can calculate the free area on a page and use document.getRenderer().getCurrentArea().setBBox approach I described above to point DocumentRenderer to the correct place to write content to.

PDFBox Form fill - saveIncremental does not work

I have a pdf file with some form field that I want to fill from java. Right now I'm trying to fill just one form which I am finding by its name. My code looks like this:
File file = new File("c:/Testy/luxmed/Skierowanie3.pdf");
PDDocument document = PDDocument.load(file);
PDDocumentCatalog doc = document.getDocumentCatalog();
PDAcroForm Form = doc.getAcroForm();
String formName = "topmostSubform[0].Page1[0].pana_pania[0]";
PDField f = Form.getField(formName);
setField(document, formName, "Artur");
System.out.println("New value 2nd: " + f.getValueAsString());
document.saveIncremental(new FileOutputStream("c:/Testy/luxmed/nowy_pd3.pdf"));
document.close();
and this:
public static void setField(PDDocument pdfDocument, String name, String Value) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field instanceof PDCheckBox){
field.setValue("Yes");
}
else if (field instanceof PDTextField){
System.out.println("Original value: " + field.getValueAsString());
field.setValue(Value);
System.out.println("New value: " + field.getValueAsString());
}
else{
System.out.println("Nie znaleziono pola");
}
}
As system.out states, the value was set correctly, but in new the generated pdf file, new value is not showing up (original String is presented) so I guess the incremental saving does not work properly. What am I missing?
I use 2.0.2 version of pdfbox, and here is pdf file with which I working: pdf

In general
When saving changes to a PDF as an incremental update with PDFBox 2.0.x, you have to set the property NeedToBeUpdated to true for every PDF object changed. Furthermore, the object must be reachable from the PDF catalog via a chain of references, and each PDF object in this chain also has to have the property NeedToBeUpdated set to true.
This is due to the way PDFBox saves incrementally, starting from the catalog it inspects the NeedToBeUpdated property, and if it is set to true, PDFBox stores the object, and only in this case it recurses deeper into the objects referenced from this object in search for more objects to store.
In particular this implies that some objects unnecessarily have to be marked NeedToBeUpdated, e.g. the PDF catalog itself, and in some cases this even defeats the purpose of the incremental update at large, see below.
In case of the OP's document
Setting the NeedToBeUpdated properties
On one hand one has to extend the setField method to mark the chain of field dictionaries up to and including the changed field and also the appearance:
public static void setField(PDDocument pdfDocument, String name, String Value) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field instanceof PDCheckBox) {
field.setValue("Yes");
}
else if (field instanceof PDTextField) {
System.out.println("Original value: " + field.getValueAsString());
field.setValue(Value);
System.out.println("New value: " + field.getValueAsString());
}
else {
System.out.println("Nie znaleziono pola");
}
// vvv--- new
COSDictionary fieldDictionary = field.getCOSObject();
COSDictionary dictionary = (COSDictionary) fieldDictionary.getDictionaryObject(COSName.AP);
dictionary.setNeedToBeUpdated(true);
COSStream stream = (COSStream) dictionary.getDictionaryObject(COSName.N);
stream.setNeedToBeUpdated(true);
while (fieldDictionary != null)
{
fieldDictionary.setNeedToBeUpdated(true);
fieldDictionary = (COSDictionary) fieldDictionary.getDictionaryObject(COSName.PARENT);
}
// ^^^--- new
}
(FillInFormSaveIncremental method setField)
On the other hand the main code has to be extended to mark a chain from the catalog to the fields array:
PDDocument document = PDDocument.load(...);
PDDocumentCatalog doc = document.getDocumentCatalog();
PDAcroForm Form = doc.getAcroForm();
String formName = "topmostSubform[0].Page1[0].pana_pania[0]";
PDField f = Form.getField(formName);
setField(document, formName, "Artur");
System.out.println("New value 2nd: " + f.getValueAsString());
// vvv--- new
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
dictionary.setNeedToBeUpdated(true);
dictionary = (COSDictionary) dictionary.getDictionaryObject(COSName.ACRO_FORM);
dictionary.setNeedToBeUpdated(true);
COSArray array = (COSArray) dictionary.getDictionaryObject(COSName.FIELDS);
array.setNeedToBeUpdated(true);
// ^^^--- new
document.saveIncremental(new FileOutputStream(...));
document.close();
(FillInFormSaveIncremental test testFillInSkierowanie3)
Beware: for use with generic PDFs one obviously should introduce some null tests...
Opening the result file in Adobe Reader one will unfortunately see that the program complains about changes which disable extended features in the file.
This is due to the quirk in PDFBox' incremental saving that it requires some unnecessary objects in the update section. In particular the catalog is saved there which contains a usage rights signature (the technology granting extended features). The re-saved signature obviously is not at its original position in its original revision anymore. Thus, is invalidated.
Most likely the OP OP wanted to save the PDF incrementally to not break this signature but PDFBox does not permit this. Oh well...
Thus, the only thing one can do is prevent the warning by completely removing the signature.
Removing the usage rights signature
We already have retrieved the catalog object in the additions above, so removing the signature is easy:
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
// vvv--- new
dictionary.removeItem(COSName.PERMS);
// ^^^--- new
dictionary.setNeedToBeUpdated(true);
(FillInFormSaveIncremental test testFillInSkierowanie3)
Opening the result file in Adobe Reader one will unfortunately see that the program complains about missing extended features in the file to save it.
This is due to the fact that Adobe Reader requires extended features to save changes to XFA forms, extended features we had to remove in this step.
But the document at hand is a hybrid AcroForm & XFA form document, and Adobe Reader requires no extended features to save AcroForm documents. Thus, all we have to do is remove the XFA form. As our code only sets the AcroForm value, this is a good idea anyways...
Removing the XFA form
We already have retrieved the acroform object in the additions above, so removing the XFA form referenced from there is easy:
dictionary = (COSDictionary) dictionary.getDictionaryObject(COSName.ACRO_FORM);
// vvv--- new
dictionary.removeItem(COSName.XFA);
// ^^^--- new
dictionary.setNeedToBeUpdated(true);
(FillInFormSaveIncremental test testFillInSkierowanie3)
Opening the result file in Adobe Reader one will see that one now can without further ado edit the form and save the file.
Beware, a sufficiently new Adobe Reader version is required for this, earlier versions (up to at least version 9) did require extended features even for saving changes to an AcroForm form

Page is cropped in new document in PDFBox while copying

I am trying to split the single PDF into multiple. Like 10 page document into 10 single page document.
PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();
Here the problem is, the new document's page size is different than original document. So some text are cropped or missing in new document. I am using PDFBox 2.0 and how I can avoid this?
UPDATE:
Thanks #mkl.
Splitter did the magic. Here is the updated working part,
public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
throws IOException {
File file = new File(meta.getFilename());
Splitter splitter = new Splitter();
splitter.setStartPage(meta.getStart());
splitter.setEndPage(meta.getEnd());
splitter.setSplitAtPage(meta.getEnd());
List<PDDocument> docs = splitter.split(source);
if(docs.size() > 0){
PDDocument output = docs.get(0);
output.save(file);
output.close();
}
}
public class SplitMeta {
private String filename;
private int start;
private int end;
public SplitMeta() {
}
}

Unfortunately the OP has not provided a sample document to reproduce the issue. Thus, I have to guess.
I assume that the issue is based in objects not immediately linked to the page object but inherited from its parents.
In that case using PDDocument.addPage is the wrong choice as this method only adds the given page object to the target document page tree without consideration of inherited stuff.
Instead one should use PDDocument.importPage which is documented as:
/**
* This will import and copy the contents from another location. Currently the content stream is stored in a scratch
* file. The scratch file is associated with the document. If you are adding a page to this document from another
* document and want to copy the contents to this document's scratch file then use this method otherwise just use
* the {#link #addPage} method.
*
* Unlike {#link #addPage}, this method does a deep copy. If your page has annotations, and if
* these link to pages not in the target document, then the target document might become huge.
* What you need to do is to delete page references of such annotations. See
* here for how to do this.
*
* #param page The page to import.
* #return The page that was imported.
*
* #throws IOException If there is an error copying the page.
*/
public PDPage importPage(PDPage page) throws IOException
Actually even this method might not suffice as is as it does not consider all inherited attributes, but looking at the Splitter utility class one gets an impression what one has to do:
PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources
processAnnotations(imported);
making use of the helper method
private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
As you are trying to split the single PDF into multiple, like 10 page document into 10 single page document, you might want to use this Splitter utility class as is.
Tests
To test those methods I used the output of the PDF Clown sample output AnnotationSample.Standard.pdf because that library heavily depends on inheritance of page tree values. Thus, I copied the content of its only page to a new document using either PDDocument.addPage, PDDocument.importPage, or Splitter like this:
PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithAddPage)
PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithImportPage)
PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithSplitter)
Only the final test copied the page faithfully.

Copy contents from docx with bullets intact with Apache POI

I am trying to copy contents from a docx file to the clipboard eventually. The code I have come up with so far is:
package config;
public class buffer {
public static void main(String[] args) throws IOException, XmlException {
XWPFDocument srcDoc = new XWPFDocument(new FileInputStream("D:\\rules.docx"));
XWPFDocument destDoc = new XWPFDocument();
OutputStream out = new FileOutputStream("D:\\test.docx");
for (IBodyElement bodyElement : srcDoc.getBodyElements()) {
XWPFParagraph srcPr = (XWPFParagraph) bodyElement;
XWPFParagraph dstPr = destDoc.createParagraph();
dstPr.createRun();
int pos = destDoc.getParagraphs().size() - 1;
destDoc.setParagraph(srcPr, pos);
}
destDoc.write(out);
out.close();
}
}
This does fetch the bullets but numbers them. I want to retain the original bullet format. Is there a way to do this?

You'll need to handle the numbering definition (in the numbering part) correctly.
The most reliable thing to do would be to copy the definition (both the instance list and the abstract one) across, and renumber it (ie give it a new ID) so that it is unique.
Then of course you'll need to update the ID's in your paragraph to match.
Note that the above is a solution only for the question you have asked.
You'll run into problems if your content contains a rel to some other part (eg an image). And you'tr not handling the style definition etc.

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

apache pdfbox - how to test if a document is flattened? - java

Related

PDF Generation using iText7 with Java

PDFBox Form fill - saveIncremental does not work

Page is cropped in new document in PDFBox while copying

Copy contents from docx with bullets intact with Apache POI

Extracting an embedded object from a pdf

Categories

Resources