Extract page number from PDF file

Extract page number from PDF file - java

I have a PDF document which might have been created by extracting few pages from another PDF document. I am wondering How do I get the page number. As the starting page number is 572, which for a complete PDF document should have been 1.
Do you think converting the PDF into an XMl will sort this issue?

Most probably the document contains /PageLabels entry in the Document Catalog. This entry specifies the numbering style for page numbers and the starting number, too.
You might have to update the starting number or remove the entry completely. The following document contains more information about /PageLabels entry:
Specifying consistent page numbering for PDF documents
The example 2 in the document might be useful if you decide to update the entry.

Finally figured it out using iText. Would not have been possible without Bovrosky's hint. Tons of thanks to him. Posting the code sample:
public void process(PdfReader reader) {
PRIndirectReference obj = (PRIndirectReference) dict.get(com.itextpdf.text.pdf.PdfName.PAGELABELS);
System.out.println(obj.getNumber());
PdfObject ref = reader.getPdfObject(obj.getNumber());
PdfArray array = (PdfArray)((PdfDictionary) ref).get(com.itextpdf.text.pdf.PdfName.NUMS);
System.out.println("Start Page: " + resolvePdfIndirectReference(array, reader));
}
private static int resolvePdfIndirectReference(PdfObject obj, PdfReader reader) {
if (obj instanceof PdfArray) {
PdfDictionary subDict = null;
PdfIndirectReference indRef = null;
ListIterator < PdfObject > itr = ((PdfArray) obj).listIterator();
while (itr.hasNext()) {
PdfObject pdfObj = itr.next();
if (pdfObj instanceof PdfIndirectReference)
indRef = (PdfIndirectReference) pdfObj;
if (pdfObj instanceof PdfDictionary) {
subDict = (PdfDictionary) pdfObj;
break;
}
}
if (subDict != null) {
return resolvePdfIndirectReference(subDict, reader);
} else if (indRef != null)
return resolvePdfIndirectReference(indRef, reader);
} else if (obj instanceof PdfIndirectReference) {
PdfObject ref = reader.getPdfObject(((PdfIndirectReference) obj).getNumber());
return resolvePdfIndirectReference(ref, reader);
} else if (obj instanceof PdfDictionary) {
PdfNumber num = (PdfNumber)((PdfDictionary) obj).get(com.itextpdf.text.pdf.PdfName.ST);
return num.intValue();
}
return 0;
}

Related

How to remove Images from PDF File?

Hello ,thank you for answer my question.This proble is perplex me for a long time.
I have search this QS for a long time,I read so many article in stack overFlow and google,but those articles is outdated or fragmented,so I have to seek for help.
I hope some one can help me ,please.
public class TEST04 {
public static void main(String[] args) throws IOException {
System.out.println("Hi");
//ori pdf file
String oriPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\1.pdf";
//out pdf file
String outPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\2.pdf";
strip(oriPDFFile, outPDFFile);
}
//parse
public static void strip(String pdfFile, String pdfFileOut) throws IOException {
//load ori pdf file
PDDocument document = PDDocument.load(new File(pdfFile));
//get All pages
List<PDPage> pageList = IterUtil.toList(document.getDocumentCatalog().getPages());
for (int i = 0; i < pageList.size(); i++) {
PDPage page = pageList.get(i);
COSDictionary newDictionary = new COSDictionary(page.getCOSObject());
PDFStreamParser parser = new PDFStreamParser(page);
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator operator = (Operator) token;
if (operator.getName().equals("Do")) {
COSName cosName = (COSName) newTokens.remove(newTokens.size() - 1);
deleteObject(newDictionary, cosName);
continue;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
// ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
// writer.writeTokens( newTokens );
// page.setContents(newContents);
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
document.save(pdfFileOut);
document.close();
}
//delete
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}
}
The stack trace:

It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)

Following the tip in Ali Yavari's answer you created a test class. Unfortunately that test code produced an exception. This answer focuses on fixing your code.
According to the stack trace you posted an image of the exception occurred while saving the document; some stream was asked to provide an InputStream and it failed with the message "Cannot read while there is an open stream writer".
So, let's have a look where your code opens a stream writer but does not close it again:
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens( newTokens );
page.setContents(newContents);
Indeed, here you ask a stream (the PDStream newContents) for something to write to (newContents.createOutputStream()) but don't close it.
You can do that like this:
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
A side note, you will have to re-write what you do with the newDictionary object. Currently you
initialize it with the page dictionary entries,
recursively remove all entries with a key that is a name of an image you remove, and
set the page resources to this dictionary.
Item 2 can delete much more than you actually want, the same name in a different dictionary may refer to an entry with a completely different meaning. Furthermore, you recurse without further checks; if there is a circular relation among the dictionaries, this may result in an infinite recursion, i.e. a stack overflow exception.
Item 3 sets this manipulated page clone inappropriately as the resources of the original page. This create a completely broken page structure.
Instead you should retrieve the resources from the page (resources = page.getResources()) and remove the images by putting them to null (resources.put(cosName, (PDXObject)null)).

In my other answer I focused on advise on how to fix the code in the question. Here I focus on a different approach to the task.
In your code you try to remove the bitmap images by inspecting the page content streams, finding Do operations therein drawing XObjects, and removing both this instruction and the referenced XObject.
It is a bit easier to instead simply replace all image XObjects in the resources by an empty form XObject. This is the approach used here.
As that approach is very easy to implement, I extended it to not only go through the immediate resources of the pages but also iterate into embedded form XObjects and patterns.
void replaceBitmapImagesResources(PDDocument document) throws IOException {
PDFormXObject pdFormXObject = new PDFormXObject(document);
pdFormXObject.setBBox(new PDRectangle(1, 1));
for (PDPage pdPage : document.getPages()) {
replaceBitmapImagesResources(pdPage.getResources(), pdFormXObject);
}
}
void replaceBitmapImagesResources(PDResources resources, PDFormXObject formXObject) throws IOException {
if (resources == null)
return;
for (COSName cosName : resources.getPatternNames()) {
PDAbstractPattern pdAbstractPattern = resources.getPattern(cosName);
if (pdAbstractPattern instanceof PDTilingPattern) {
PDTilingPattern pdTilingPattern = (PDTilingPattern) pdAbstractPattern;
replaceBitmapImagesResources(pdTilingPattern.getResources(), formXObject);
}
}
List<COSName> xobjectsToReplace = new ArrayList<>();
for (COSName cosName : resources.getXObjectNames()) {
PDXObject pdxObject = resources.getXObject(cosName);
if (pdxObject instanceof PDImageXObject) {
xobjectsToReplace.add(cosName);
} else if (pdxObject instanceof PDFormXObject) {
PDFormXObject pdFormXObject = (PDFormXObject) pdxObject;
replaceBitmapImagesResources(pdFormXObject.getResources(), formXObject);
}
}
for (COSName cosName : xobjectsToReplace) {
resources.put(cosName, formXObject);
}
}
(RemoveImages helper methods)
To apply this approach to a PDDocument simply call the first replaceBitmapImagesResources with that document as parameter.
Beware: I tried to keep the code simple; for production use remember to limit the recursion here to prevent endless recursions as in some PDFs XObjects or patterns call themselves directly or indirectly. Also you may want to inspect page annotations and the resources of template pages.

Signing PDF with multiple signature fields using PDFBox 2.0.17

I am trying to sign a PDF with 2 signature fields using the example code provided by PDFBox (https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/CreateVisibleSignature.java). But the signed PDF shows There have been changes made to this document that invalidate the signature.
I have uploaded my sample project to GitHub please find it here.
The project can be opened using IntelliJ or Eclipse.
The program argument should be set to the following to simulate the problem.
keystore/lawrence.p12 12345678 pdfs/Fillable-2.pdf images/image.jpg
Grateful if any PDFBox expert can help me. Thank you.

This answer to the question “Lock” dictionary in signature field is the reason of broken signature after signing already contains code for signing that respects the signature Lock dictionary and creates a matching FieldMDP transformations while signing.
As clarified in a comment, though, the OP wonders
is there any way to lock the corresponding textfield after signing
Thus, not only shall changes to protected form fields invalidate the signature in question but in the course of signing these protected fields shall themselves be locked.
Indeed, one can improve the code from the referenced answer to do that, too:
PDSignatureField signatureField = FIND_YOUR_SIGNATURE_FIELD_TO_SIGN;
PDSignature signature = new PDSignature();
signatureField.setValue(signature);
COSBase lock = signatureField.getCOSObject().getDictionaryObject(COS_NAME_LOCK);
if (lock instanceof COSDictionary)
{
COSDictionary lockDict = (COSDictionary) lock;
COSDictionary transformParams = new COSDictionary(lockDict);
transformParams.setItem(COSName.TYPE, COSName.getPDFName("TransformParams"));
transformParams.setItem(COSName.V, COSName.getPDFName("1.2"));
transformParams.setDirect(true);
COSDictionary sigRef = new COSDictionary();
sigRef.setItem(COSName.TYPE, COSName.getPDFName("SigRef"));
sigRef.setItem(COSName.getPDFName("TransformParams"), transformParams);
sigRef.setItem(COSName.getPDFName("TransformMethod"), COSName.getPDFName("FieldMDP"));
sigRef.setItem(COSName.getPDFName("Data"), document.getDocumentCatalog());
sigRef.setDirect(true);
COSArray referenceArray = new COSArray();
referenceArray.add(sigRef);
signature.getCOSObject().setItem(COSName.getPDFName("Reference"), referenceArray);
final Predicate<PDField> shallBeLocked;
final COSArray fields = lockDict.getCOSArray(COSName.FIELDS);
final List<String> fieldNames = fields == null ? Collections.emptyList() :
fields.toList().stream().filter(c -> (c instanceof COSString)).map(s -> ((COSString)s).getString()).collect(Collectors.toList());
final COSName action = lockDict.getCOSName(COSName.getPDFName("Action"));
if (action.equals(COSName.getPDFName("Include"))) {
shallBeLocked = f -> fieldNames.contains(f.getFullyQualifiedName());
} else if (action.equals(COSName.getPDFName("Exclude"))) {
shallBeLocked = f -> !fieldNames.contains(f.getFullyQualifiedName());
} else if (action.equals(COSName.getPDFName("All"))) {
shallBeLocked = f -> true;
} else { // unknown action, lock nothing
shallBeLocked = f -> false;
}
lockFields(document.getDocumentCatalog().getAcroForm().getFields(), shallBeLocked);
}
signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
signature.setName("blablabla");
signature.setLocation("blablabla");
signature.setReason("blablabla");
signature.setSignDate(Calendar.getInstance());
document.addSignature(signature [, ...]);
(CreateSignature helper method signAndLockExistingFieldWithLock)
with lockFields implemented like this:
boolean lockFields(List<PDField> fields, Predicate<PDField> shallBeLocked) {
boolean isUpdated = false;
if (fields != null) {
for (PDField field : fields) {
boolean isUpdatedField = false;
if (shallBeLocked.test(field)) {
field.setFieldFlags(field.getFieldFlags() | 1);
if (field instanceof PDTerminalField) {
for (PDAnnotationWidget widget : ((PDTerminalField)field).getWidgets())
widget.setLocked(true);
}
isUpdatedField = true;
}
if (field instanceof PDNonTerminalField) {
if (lockFields(((PDNonTerminalField)field).getChildren(), shallBeLocked))
isUpdatedField = true;
}
if (isUpdatedField) {
field.getCOSObject().setNeedToBeUpdated(true);
isUpdated = true;
}
}
}
return isUpdated;
}
(CreateSignature helper method lockFields)

Using PDFBox to remove Optional Content Groups that are not enabled

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.

Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}

Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
I want to combine PDF documents with an edited set of bookmarks that keeps a clear pairing of the bookmarks with each original document. I also want a new top level bookmark describing the set as a whole to improve combining with yet other documents later if the user chooses. The number of documents combined and the number of bookmarks in each is unknown and some documents might not have any bookmarks.
For simplicity, assume I have two documents with two pages and a bookmark to the second page in each. I would want the combined document to have a bookmark structure like this where "NEW" are the ones I am creating based on meta data I have about each source document and "EXISTING" are whatever I copy from the individual documents:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
------ EXISTING Doc one link (page 2)
---- NEW Document two meta (page 3)
------ EXISTING Doc two link (page 4)
Code:
private static String combinePdf(List<String> allFile, LinkedHashMap<String, String> bookmarkMetaMap, Connection conn) throws IOException {
System.out.println("=== combinePdf() ENTER"); // TODO REMOVE
File outFile = File.createTempFile("combinePdf", "pdf", new File(DocumentObj.TEMP_DIR_ON_SERVER));
if (!outFile.exists() || !outFile.canWrite()) {
throw new IOException("Unable to create writeable file in " + DocumentObj.TEMP_DIR_ON_SERVER);
}
if (bookmarkMetaMap == null || bookmarkMetaMap.isEmpty()) {
bookmarkMetaMap = new LinkedHashMap<>(); // prevent NullPointer below
bookmarkMetaMap.put("Documents", "Documents");
}
try ( PdfDocument allPdfDoc = new PdfDocument(new PdfWriter(outFile)) ) {
allPdfDoc.initializeOutlines();
allPdfDoc.getCatalog().setPageMode(PdfName.UseOutlines);
PdfMerger allPdfMerger = new PdfMerger(allPdfDoc, true, false); // build own outline
Iterator<Map.Entry<String, String>> itr = bookmarkMetaMap.entrySet().iterator();
PdfOutline rootOutline = allPdfDoc.getOutlines(false);
PdfOutline mainOutline;
mainOutline = rootOutline.addOutline(itr.next().getValue());
mainOutline.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
int fileNum = 0;
for (String oneFile : allFile) {
PdfDocument onePdfDoc = new PdfDocument(new PdfReader(oneFile));
PdfAcroForm oneForm = PdfAcroForm.getAcroForm(onePdfDoc, false);
if (oneForm != null) {
oneForm.flattenFields();
}
allPdfMerger.merge(onePdfDoc, 1, onePdfDoc.getNumberOfPages());
fileNum++;
String bookmarkLabel = itr.hasNext() ? itr.next().getKey() : "Document " + fileNum;
PdfOutline linkToDoc = mainOutline.addOutline(bookmarkLabel);
linkToDoc.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
PdfOutline srcDocOutline = onePdfDoc.getOutlines(false);
if (srcDocOutline != null) {
List<PdfOutline> outlineList = srcDocOutline.getAllChildren();
if (!outlineList.isEmpty()) {
for (PdfOutline p : outlineList) {
linkToDoc.addOutline(p); // if I comment this out, no error, but links wrong order
}
}
}
onePdfDoc.close();
}
System.out.println("=== combinePdf() DONE ADDING PAGES ==="); //TODO REMOVE
}
return outFile.getAbsolutePath();
}
Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
Error occurs after the debug line "=== combinePdf() DONE ADDING PAGES ===" so the for loop completes as expected.
This means the error occurs when allPdfDoc is automagically closed.
If I remove the line linkToDoc.addOutline(p); I get all of my links and they go to the correct pages but they are not nested/ordered as I want:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
---- NEW Document two meta (page 3)
-- EXISTING Doc one link (page 2)
-- EXISTING Doc two link (page 4)
With the aforementioned line commented out, I am not even sure how the EXISTING links are included at all. I have the mergeOutlines flag set to false in the PdfMerger constructor since I thought I had to construct my own outline. I get similar results no matter whether I set the getOutlines() to true or false as well as if I take out my arbitrary top level new bookmark.
I know how to create a flattened list of new and existing bookmarks in the desired order. So my question is about how to get both the indenting and ordering as desired.
Thanks for taking a look!

Rather than shift bookmarks in the combined PDF, I did it in the component PDF before merging.
Feedback welcome, especially if something is horribly inefficient as PDF size increases:
private static void shiftPdfBookmarksUnderNewBookmark(PdfDocument pdfDocument, String bookmarkLabel) {
if (pdfDocument == null || pdfDocument.getWriter() == null) {
log.warn("shiftPdfBookmarksUnderNewBookmark(): no writer linked to PDFDocument, cannot modify bookmarks");
return;
}
pdfDocument.initializeOutlines();
try {
PdfOutline rootOutline = pdfDocument.getOutlines(false);
PdfOutline subOutline = rootOutline.addOutline(bookmarkLabel);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // Not sure why this is needed, but problems if omitted.
List<PdfOutline> pdfOutlineChildren = rootOutline.getAllChildren();
if (pdfOutlineChildren.size() == 1) {
return;
}
int i = 0;
for (PdfOutline p : rootOutline.getAllChildren()) {
if (p != subOutline) {
if (p.getDestination() == null) {
continue;
}
subOutline.addOutline(p);
}
}
rootOutline.getAllChildren().clear();
rootOutline.addOutline(subOutline);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // not sure why duplicate line above seems to be needed
}
catch (Exception logAndIgnore) {
log.warn("shiftPdfBookmarksUnderNewBookmark ignoring error and not shifting bookmarks: " +logAndIgnore, logAndIgnore);
}
}

Android: Parsing XML DOM parser. Converting childnodes to string

Again a question. This time I'm parsing XML messages I receive from a server.
Someone thought to be smart and decided to place HTML pages in a XML message. Now I'm kind of facing problems because I want to extract that HTML page as a string from this XML message.
Ok this is the XML message I'm parsing:
<AmigoRequest>
<From></From>
<To></To>
<MessageType>showMessage</MessageType>
<Param0>general message</Param0>
<Param1><html><head>test</head><body>Testhtml</body></html></Param1>
</AmigoRequest>
You see that in Param1 a HTML page is specified. I've tried to extract the message the following way:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
return results.item(0).getFirstChild().getNodeValue();
}
}
return "";
}
Where d is the XML message in document form.
It always returns me a null value, because getNodeValue() returns null.
When i try results.item(0).getFirstChild().hasChildNodes() it will return true because he sees there is a tag in the message.
How can i extract the html message <html><head>test</head><body>Testhtml</body></html> from Param0 in a string?
I'm using Android sdk 1.5 (well almost java) and a DOM Parser.
Thanks for your time and replies.
Antek

You could take the content of param1, like this:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
// String extractHTMLTags(String s) is a function that you have
// to implement in a way that will extract all the HTML tags inside a string.
return extractHTMLTags(results.item(0).getTextContent());
}
}
return "";
}
All you have to do is to implement a function:
String extractHTMLTags(String s)
that will remove all HTML tag occurrences from a string.
For that you can take a look at this post: Remove HTML tags from a String

after checking a lot and scratching my head thousands of times I came up with simple alteration that it needs to change your API level to 8

EDIT: I just saw your comment above about getTextContent() not being supported on Android. I'm going to leave this answer up in case it's useful to someone who's on a different platform.
If your DOM API supports it, you can call getTextContent(), as follows:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results != null) {
return results.getTextContent();
}
}
return "";
}
However, getTextContent() is a DOM Level 3 API call; not all parsers are guaranteed to support it. Xerces-J does.
By the way, in your original example, your check for null is in the wrong place; it should be:
if (results != null && results.getLength() > 0) {
Otherwise, you'd get a NPE if results really does come back as null.

Since getTextContent() isn't available to you, another option would be to write it -- it isn't hard. In fact, if you're writing this solely for your own use -- or your employer doesn't have overly strict rules about open source -- you could look at Apache's implementation as a starting point; lines 610-646 seem to contain most of what you need. (Please be respectful of Apache's copyright and license.)
Otherwise, some rough pseudocode for the method would be:
String getTextContent(Node node) {
if (node has no children)
return "";
if (node has 1 child)
return getTextContent(node.getFirstChild());
return getTextContent(new StringBuffer()).toString();
}
StringBuffer getTextContent(Node node, StringBuffer sb) {
for each child of node {
if (child is a text node) sb.append(child's text)
else getTextContent(child, sb);
}
return sb;
}

Well i was almost there with the code...
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
Element node = (Element) results.item(0); // get the value of Param1
Document doc2 = null;
try {
db = dbf.newDocumentBuilder();
doc2 = db.newDocument(); //create new document
doc2.appendChild(doc2.importNode(node, true)); //import the <html>...</html> result in doc2
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
Log.d(TAG, " Exception ", e);
} catch (DOMException e) {
// TODO: handle exception
Log.d(TAG, " Exception ", e);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace(); }
return doc2. .....// All I'm missing is something to convert a Document to a string.
}
}
return "";
}
Like explained in the comment of my code. All I am missing is to make a String out of a Document. You can't use the Transform class in Android... doc2.toString() will give you a serialization of the object..
But my next step is write my own parser if this doesnt work out ;)
Not the best code but a temponary solution.
public String getParam1(String b) {
return b
.substring(b.indexOf("<Param1>") + "<Param1>".length(), b.indexOf("</Param1>"));
}
Where String b is the XML document string.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract page number from PDF file - java

Related

How to remove Images from PDF File?

Signing PDF with multiple signature fields using PDFBox 2.0.17

Using PDFBox to remove Optional Content Groups that are not enabled

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Android: Parsing XML DOM parser. Converting childnodes to string

Categories

Resources